Python log parser #1384

dgolden1 · 2014-10-31T05:02:12Z

While tools/extra/parse_log.sh is a beautiful piece of Bash engineering, it is rather convoluted and difficult to read and modify unless you're a real Bash scripting pro. So I'm introducing parse_log.py, a python version of the log parser, which does basically the same thing but with a few functionality changes that I prefer over the default parse_log.sh implementation.

And of course, you can use it as a library instead of writing the output tables to files if you want to run the code from within some other Python program (e.g., plot_training_log.py).

Usage is very similar to parse_log.sh; see it with ./parse_log.py -h:

$ ./parse_log.py -h
usage: parse_log.py [-h] [--verbose] logfile_path output_dir

Parse a Caffe training log into two CSV files representing training and
testing information

positional arguments:
  logfile_path  Path to log file
  output_dir    Directory in which to place output CSV files

optional arguments:
  -h, --help    show this help message and exit
  --verbose     Print some extra info (e.g., output filenames)

Differences between parse_log.py and parse_log.sh:

The output tables are CSVs, not space-separated; this aids reading in the files with, e.g., Excel or Matlab's readtable command
The first column is NumIters not #Iters because some programs (e.g., Matlab readtable again) do not like weird characters in the column headers
You must specify the output directory for the parsed tables; I didn't like how with parse_log.sh the output files ended up in the current working directory
The time of the first row of the test file is the number of seconds between when the first network test started and when it ended (which can be a few seconds to a couple of minutes), as opposed to in parse_log.sh where the time of the first row was typically near zero; however, the offsets are the same
Because information is read sequentially and, in the current log format, the learning rate isn't reported until after the results of the first network test are reported, the initial value for the learning rate in the train table will always be NaN; fixing this would require a kludge to read ahead to find the first lr = ... line and then rewind, which isn't very aesthetically pleasing
Unlike parse_log.sh, parse_log.py doesn't assume that Test net output #0 is Accuracy and Test net output #1 is loss, which turns out to depend on the network definition (e.g., it's wrong for the GoogLeNet network defined in GoogLeNet training in Caffe #1367, which has multiple loss and accuracy layers); it explicitly looks for the accuracy = ... and loss = ... lines.
Update: train loss is now parsed from lines like Iteration N, loss = X instead of lines like Train net output #M: loss = X to match parse_log.sh.

For your amusement, here's the first few lines the parsed training file both from parse_log.sh and parse_log.py:

parse_log.sh:

#Iters Seconds TrainingLoss LearningRate
0      51.768315     1.94563   0.002
20     61.075860     1.58658   0.002
40     70.376403     1.45276   0.002
60     79.681565     1.47321   0.002
80     88.986380     1.36988   0.002
100    149.623934    1.43918   0.002

parse_log.py

NumIters,Seconds,TrainingLoss,LearningRate
0,51.768378,1.945630,nan
20,61.075930,1.586580,0.002000
40,70.376478,1.452760,0.002000
60,79.681636,1.473210,0.002000
80,88.986512,1.369880,0.002000
100,149.624000,1.439180,0.002000

Yangqing · 2014-11-10T22:29:34Z

Thanks for the PR! Just in case you are wondering - we are currently at the CVPR crunch mode (deadline Nov 15) so kindly expect things to move faster after the deadline.

shelhamer · 2014-12-08T19:59:40Z

@drdan14 this looks nice! Thanks for the improved tool.

For author / copyright date metadata we rely on commit messages instead of code comments. How about dropping the author and date lines in the comment block? If you are truly worried about attribution it's fine, since this is a bonus tool, but note that in the core we don't have any author comments in the code (since versioning tracks it perfectly).

Could you drop 83ce49d and instead ignore PyCharm in your local configuration?

dgolden1 · 2014-12-08T20:05:44Z

@shelhamer all done. I implemented your requested changes and squashed them

Was previously using `Train net output #M: loss = X` lines, but there may not be exactly one of those (e.g., for GoogLeNet, which has multiple loss layers); I believe that `Iteration N, loss = X` is the aggregated loss. If there's only one loss layer, these two values will be equal and it won't matter. Otherwise, we presumably want to report the aggregated loss.

Output format is unchanged (except that csv.DictWriter insists on writing ints as 0.0 instead of 0)

Python log parser

shelhamer · 2014-12-08T22:01:44Z

Thanks for the orderly PR and useful improvement.

sguada · 2014-12-08T22:16:53Z

@drdan14 the initial learning rate you could get it from the solver, just look for base_lr: 0.01

sguada · 2014-12-08T22:36:19Z

I tested with some logs and failed where parse_log.sh worked. So I think it will need some review.
Assuming the name or the order of the tops for train or test is not useful, it should be read from the log.

dgolden1 · 2014-12-08T23:14:36Z

@sguada if the training is continued from a snapshot and lr_policy is "step", then the initial learning rate may not be equal to the base_lr, right?

dgolden1 · 2014-12-08T23:22:15Z

@sguada, the current version of parse_log.py requires the accuracy and loss layers to be explicitly named accuracy and loss. This is in contrast to parse_log.sh, which assumed that the first two tops were accuracy and loss, respectively, regardless of their names. I think this Python functionality is less fragile, but a compromise might be to use the names accuracy and loss if they're present, and otherwise, assume that the first two tops represent accuracy and loss regardless of their names. But this makes the script significantly more complicated.

What do you think?

sguada · 2014-12-08T23:24:58Z

Yeah, that's true. Probably it will be better to read everything from the log then, maybe just store all the available information per Iteration and then save it with appropriate heading.

I don't think is good to assume the name of the top or the name of the layer, since the log will print them.

For example given a log containing:

I1130 13:42:12.935655 27974 solver.cpp:209] Iteration 366680, loss = 5.20852
I1130 13:42:12.935745 27974 solver.cpp:224]     Train net output #0: loss1/loss1 = 4.01471 (* 0.3 = 1.20441 loss)
I1130 13:42:12.935762 27974 solver.cpp:224]     Train net output #1: loss2/loss1 = 3.91581 (* 0.3 = 1.17474 loss)
I1130 13:42:12.935806 27974 solver.cpp:224]     Train net output #2: loss3 = 3.00787 (* 1 = 3.00787 loss)
I1130 13:42:12.935822 27974 solver.cpp:445] Iteration 366680, lr = 0.0096

I would expect an output for .train as:

Iteration, loss, loss1/loss1, loss2/loss1, loss3, lr
366680, 5.20852, 4.01471, 3.91581, 3.00787, 0.0096

dgolden1 · 2014-12-08T23:30:47Z

@sguada I think that's a reasonable change, but changing the output format may break other scripts down the line, such as plot_training_log.py.example. Is that worth it?

sguada · 2014-12-08T23:33:11Z

@drdan14 Yeah it is worthy, now it is common to have multiple losses and different accuracies, the names of the columns should be read from the file.

Example:

I1130 13:15:14.285720 27974 solver.cpp:209] Iteration 363960, loss = 5.1968
I1130 13:15:14.285804 27974 solver.cpp:224]     Train net output #0: loss1/loss1 = 3.88811 (* 0.3 = 1.16643 loss)
I1130 13:15:14.285822 27974 solver.cpp:224]     Train net output #1: loss2/loss1 = 3.05069 (* 0.3 = 0.915209 loss)
I1130 13:15:14.285836 27974 solver.cpp:224]     Train net output #2: loss3 = 2.69015 (* 1 = 2.69015 loss)
I1130 13:15:14.285851 27974 solver.cpp:445] Iteration 363960, lr = 0.0096
I1130 13:15:34.253371 27974 solver.cpp:264] Iteration 364000, Testing net (#0)
I1130 13:19:59.756922 27974 solver.cpp:315]     Test net output #0: loss1/loss1 = 3.66044 (* 0.3 = 1.09813 loss)
I1130 13:19:59.757136 27974 solver.cpp:315]     Test net output #1: loss1/top-1 = 0.24982
I1130 13:19:59.757156 27974 solver.cpp:315]     Test net output #2: loss1/top-5 = 0.50206
I1130 13:19:59.757171 27974 solver.cpp:315]     Test net output #3: loss2/loss1 = 3.16358 (* 0.3 = 0.949075 loss)
I1130 13:19:59.757182 27974 solver.cpp:315]     Test net output #4: loss2/top-1 = 0.31974
I1130 13:19:59.757192 27974 solver.cpp:315]     Test net output #5: loss2/top-5 = 0.590359
I1130 13:19:59.757206 27974 solver.cpp:315]     Test net output #6: loss3 = 2.84699 (* 1 = 2.84699 loss)
I1130 13:19:59.757218 27974 solver.cpp:315]     Test net output #7: top-1 = 0.37248
I1130 13:19:59.757228 27974 solver.cpp:315]     Test net output #8: top-5 = 0.64302
I1130 13:20:00.006239 27974 solver.cpp:209] Iteration 364000, loss = 5.37725
I1130 13:20:00.006315 27974 solver.cpp:224]     Train net output #0: loss1/loss1 = 3.76329 (* 0.3 = 1.12899 loss)
I1130 13:20:00.006332 27974 solver.cpp:224]     Train net output #1: loss2/loss1 = 3.64319 (* 0.3 = 1.09296 loss)
I1130 13:20:00.006345 27974 solver.cpp:224]     Train net output #2: loss3 = 3.355 (* 1 = 3.355 loss)
I1130 13:20:00.006362 27974 solver.cpp:445] Iteration 364000, lr = 0.0096

PD: I forgot that csv files should include the time too, and that the delimiter should be allow to be user defined.

dgolden1 · 2014-12-08T23:34:53Z

OK, that sounds reasonable. Look for an updated PR within a couple of days.

Over version introduced in BVLC#1384 Highlights: * Interface change: column order is now determined by using a list of `OrderedDict` objects instead of `dict` objects, which obviates the need to pass around a tuple with the column orders. * The outputs are now named according to their names in the network protobuffer; e.g., if your top is named `loss`, then the corresponding column header will also be `loss`; we no longer rename it to, e.g., `TrainingLoss` or `TestLoss`. * Fixed the bug/feature of the first version where the initial learning rate was always NaN. * Add optional parameter to specify output table delimiter. It's still a comma by default. You can use Matlab code from [this gist](https://gist.github.com/drdan14/d8b45999c4a1cbf7ad85) to verify that your results are the same before and after the changes introduced in this pull request. That code assumes that your `top` names are `accuracy` and `loss`, but you can modify the code if that's not true.

shelhamer added interface enhancement labels Oct 31, 2014

shelhamer assigned sguada Dec 8, 2014

dgolden1 added 3 commits December 8, 2014 12:06

Created parse_log.py, competitor to parse_log.sh

9b78bcf

Store data in lists of dicts and use csv package

bae916f

Output format is unchanged (except that csv.DictWriter insists on writing ints as 0.0 instead of 0)

shelhamer added a commit that referenced this pull request Dec 8, 2014

Merge pull request #1384 from CellScope/log-parser-python

e11d258

Python log parser

shelhamer merged commit e11d258 into BVLC:dev Dec 8, 2014

dgolden1 deleted the log-parser-python branch December 8, 2014 23:35

dgolden1 mentioned this pull request Dec 9, 2014

Python log parser improvements #1547

Closed

dgolden1 mentioned this pull request Apr 22, 2015

Python log parser improvements #2350

Merged

Python log parser #1384

Python log parser #1384

Uh oh!

Conversation

dgolden1 commented Oct 31, 2014

Uh oh!

Yangqing commented Nov 10, 2014

Uh oh!

shelhamer commented Dec 8, 2014

Uh oh!

dgolden1 commented Dec 8, 2014

Uh oh!

shelhamer commented Dec 8, 2014

Uh oh!

sguada commented Dec 8, 2014

Uh oh!

sguada commented Dec 8, 2014

Uh oh!

dgolden1 commented Dec 8, 2014

Uh oh!

dgolden1 commented Dec 8, 2014

Uh oh!

sguada commented Dec 8, 2014

Uh oh!

dgolden1 commented Dec 8, 2014

Uh oh!

sguada commented Dec 8, 2014

Uh oh!

dgolden1 commented Dec 8, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants