Skip to content

Conversation

dgolden1
Copy link
Contributor

While tools/extra/parse_log.sh is a beautiful piece of Bash engineering, it is rather convoluted and difficult to read and modify unless you're a real Bash scripting pro. So I'm introducing parse_log.py, a python version of the log parser, which does basically the same thing but with a few functionality changes that I prefer over the default parse_log.sh implementation.

And of course, you can use it as a library instead of writing the output tables to files if you want to run the code from within some other Python program (e.g., plot_training_log.py).

Usage is very similar to parse_log.sh; see it with ./parse_log.py -h:

$ ./parse_log.py -h
usage: parse_log.py [-h] [--verbose] logfile_path output_dir

Parse a Caffe training log into two CSV files representing training and
testing information

positional arguments:
  logfile_path  Path to log file
  output_dir    Directory in which to place output CSV files

optional arguments:
  -h, --help    show this help message and exit
  --verbose     Print some extra info (e.g., output filenames)

Differences between parse_log.py and parse_log.sh:

  • The output tables are CSVs, not space-separated; this aids reading in the files with, e.g., Excel or Matlab's readtable command
  • The first column is NumIters not #Iters because some programs (e.g., Matlab readtable again) do not like weird characters in the column headers
  • You must specify the output directory for the parsed tables; I didn't like how with parse_log.sh the output files ended up in the current working directory
  • The time of the first row of the test file is the number of seconds between when the first network test started and when it ended (which can be a few seconds to a couple of minutes), as opposed to in parse_log.sh where the time of the first row was typically near zero; however, the offsets are the same
  • Because information is read sequentially and, in the current log format, the learning rate isn't reported until after the results of the first network test are reported, the initial value for the learning rate in the train table will always be NaN; fixing this would require a kludge to read ahead to find the first lr = ... line and then rewind, which isn't very aesthetically pleasing
  • Unlike parse_log.sh, parse_log.py doesn't assume that Test net output #0 is Accuracy and Test net output #1 is loss, which turns out to depend on the network definition (e.g., it's wrong for the GoogLeNet network defined in GoogLeNet training in Caffe #1367, which has multiple loss and accuracy layers); it explicitly looks for the accuracy = ... and loss = ... lines.
  • Update: train loss is now parsed from lines like Iteration N, loss = X instead of lines like Train net output #M: loss = X to match parse_log.sh.

For your amusement, here's the first few lines the parsed training file both from parse_log.sh and parse_log.py:

parse_log.sh:

#Iters Seconds TrainingLoss LearningRate
0      51.768315     1.94563   0.002
20     61.075860     1.58658   0.002
40     70.376403     1.45276   0.002
60     79.681565     1.47321   0.002
80     88.986380     1.36988   0.002
100    149.623934    1.43918   0.002

parse_log.py

NumIters,Seconds,TrainingLoss,LearningRate
0,51.768378,1.945630,nan
20,61.075930,1.586580,0.002000
40,70.376478,1.452760,0.002000
60,79.681636,1.473210,0.002000
80,88.986512,1.369880,0.002000
100,149.624000,1.439180,0.002000

@Yangqing
Copy link
Member

Thanks for the PR! Just in case you are wondering - we are currently at the CVPR crunch mode (deadline Nov 15) so kindly expect things to move faster after the deadline.

@shelhamer
Copy link
Member

@drdan14 this looks nice! Thanks for the improved tool.

For author / copyright date metadata we rely on commit messages instead of code comments. How about dropping the author and date lines in the comment block? If you are truly worried about attribution it's fine, since this is a bonus tool, but note that in the core we don't have any author comments in the code (since versioning tracks it perfectly).

Could you drop 83ce49d and instead ignore PyCharm in your local configuration?

@dgolden1
Copy link
Contributor Author

dgolden1 commented Dec 8, 2014

@shelhamer all done. I implemented your requested changes and squashed them

Was previously using `Train net output #M: loss = X` lines, but there may not be exactly one of those (e.g., for GoogLeNet, which has multiple loss layers); I believe that `Iteration N, loss = X` is the aggregated loss.

If there's only one loss layer, these two values will be equal and it won't matter. Otherwise, we presumably want to report the aggregated loss.
Output format is unchanged (except that csv.DictWriter insists on writing ints as 0.0 instead of 0)
shelhamer added a commit that referenced this pull request Dec 8, 2014
@shelhamer shelhamer merged commit e11d258 into BVLC:dev Dec 8, 2014
@shelhamer
Copy link
Member

Thanks for the orderly PR and useful improvement.

@sguada
Copy link
Contributor

sguada commented Dec 8, 2014

@drdan14 the initial learning rate you could get it from the solver, just look for base_lr: 0.01

@sguada
Copy link
Contributor

sguada commented Dec 8, 2014

I tested with some logs and failed where parse_log.sh worked. So I think it will need some review.
Assuming the name or the order of the tops for train or test is not useful, it should be read from the log.

@dgolden1
Copy link
Contributor Author

dgolden1 commented Dec 8, 2014

@sguada if the training is continued from a snapshot and lr_policy is "step", then the initial learning rate may not be equal to the base_lr, right?

@dgolden1
Copy link
Contributor Author

dgolden1 commented Dec 8, 2014

@sguada, the current version of parse_log.py requires the accuracy and loss layers to be explicitly named accuracy and loss. This is in contrast to parse_log.sh, which assumed that the first two tops were accuracy and loss, respectively, regardless of their names. I think this Python functionality is less fragile, but a compromise might be to use the names accuracy and loss if they're present, and otherwise, assume that the first two tops represent accuracy and loss regardless of their names. But this makes the script significantly more complicated.

What do you think?

@sguada
Copy link
Contributor

sguada commented Dec 8, 2014

Yeah, that's true. Probably it will be better to read everything from the log then, maybe just store all the available information per Iteration and then save it with appropriate heading.

I don't think is good to assume the name of the top or the name of the layer, since the log will print them.

For example given a log containing:

I1130 13:42:12.935655 27974 solver.cpp:209] Iteration 366680, loss = 5.20852
I1130 13:42:12.935745 27974 solver.cpp:224]     Train net output #0: loss1/loss1 = 4.01471 (* 0.3 = 1.20441 loss)
I1130 13:42:12.935762 27974 solver.cpp:224]     Train net output #1: loss2/loss1 = 3.91581 (* 0.3 = 1.17474 loss)
I1130 13:42:12.935806 27974 solver.cpp:224]     Train net output #2: loss3 = 3.00787 (* 1 = 3.00787 loss)
I1130 13:42:12.935822 27974 solver.cpp:445] Iteration 366680, lr = 0.0096

I would expect an output for .train as:

Iteration, loss, loss1/loss1, loss2/loss1, loss3, lr
366680, 5.20852, 4.01471, 3.91581, 3.00787, 0.0096

@dgolden1
Copy link
Contributor Author

dgolden1 commented Dec 8, 2014

@sguada I think that's a reasonable change, but changing the output format may break other scripts down the line, such as plot_training_log.py.example. Is that worth it?

@sguada
Copy link
Contributor

sguada commented Dec 8, 2014

@drdan14 Yeah it is worthy, now it is common to have multiple losses and different accuracies, the names of the columns should be read from the file.

Example:

I1130 13:15:14.285720 27974 solver.cpp:209] Iteration 363960, loss = 5.1968
I1130 13:15:14.285804 27974 solver.cpp:224]     Train net output #0: loss1/loss1 = 3.88811 (* 0.3 = 1.16643 loss)
I1130 13:15:14.285822 27974 solver.cpp:224]     Train net output #1: loss2/loss1 = 3.05069 (* 0.3 = 0.915209 loss)
I1130 13:15:14.285836 27974 solver.cpp:224]     Train net output #2: loss3 = 2.69015 (* 1 = 2.69015 loss)
I1130 13:15:14.285851 27974 solver.cpp:445] Iteration 363960, lr = 0.0096
I1130 13:15:34.253371 27974 solver.cpp:264] Iteration 364000, Testing net (#0)
I1130 13:19:59.756922 27974 solver.cpp:315]     Test net output #0: loss1/loss1 = 3.66044 (* 0.3 = 1.09813 loss)
I1130 13:19:59.757136 27974 solver.cpp:315]     Test net output #1: loss1/top-1 = 0.24982
I1130 13:19:59.757156 27974 solver.cpp:315]     Test net output #2: loss1/top-5 = 0.50206
I1130 13:19:59.757171 27974 solver.cpp:315]     Test net output #3: loss2/loss1 = 3.16358 (* 0.3 = 0.949075 loss)
I1130 13:19:59.757182 27974 solver.cpp:315]     Test net output #4: loss2/top-1 = 0.31974
I1130 13:19:59.757192 27974 solver.cpp:315]     Test net output #5: loss2/top-5 = 0.590359
I1130 13:19:59.757206 27974 solver.cpp:315]     Test net output #6: loss3 = 2.84699 (* 1 = 2.84699 loss)
I1130 13:19:59.757218 27974 solver.cpp:315]     Test net output #7: top-1 = 0.37248
I1130 13:19:59.757228 27974 solver.cpp:315]     Test net output #8: top-5 = 0.64302
I1130 13:20:00.006239 27974 solver.cpp:209] Iteration 364000, loss = 5.37725
I1130 13:20:00.006315 27974 solver.cpp:224]     Train net output #0: loss1/loss1 = 3.76329 (* 0.3 = 1.12899 loss)
I1130 13:20:00.006332 27974 solver.cpp:224]     Train net output #1: loss2/loss1 = 3.64319 (* 0.3 = 1.09296 loss)
I1130 13:20:00.006345 27974 solver.cpp:224]     Train net output #2: loss3 = 3.355 (* 1 = 3.355 loss)
I1130 13:20:00.006362 27974 solver.cpp:445] Iteration 364000, lr = 0.0096

PD: I forgot that csv files should include the time too, and that the delimiter should be allow to be user defined.

@dgolden1
Copy link
Contributor Author

dgolden1 commented Dec 8, 2014

OK, that sounds reasonable. Look for an updated PR within a couple of days.

@dgolden1 dgolden1 deleted the log-parser-python branch December 8, 2014 23:35
dgolden1 added a commit to dgolden1/caffe that referenced this pull request Apr 22, 2015
Over version introduced in BVLC#1384

Highlights:
* Interface change: column order is now determined by using a list of `OrderedDict` objects instead of `dict` objects, which obviates the need to pass around a tuple with the column orders.
* The outputs are now named according to their names in the network protobuffer; e.g., if your top is named `loss`, then the corresponding column header will also be `loss`; we no longer rename it to, e.g., `TrainingLoss` or `TestLoss`.
* Fixed the bug/feature of the first version where the initial learning rate was always NaN.
* Add optional parameter to specify output table delimiter. It's still a comma by default.

You can use Matlab code from [this gist](https://gist.github.com/drdan14/d8b45999c4a1cbf7ad85) to verify that your results are the same before and after the changes introduced in this pull request. That code assumes that your `top` names are `accuracy` and `loss`, but you can modify the code if that's not true.
ShaggO pushed a commit to ShaggO/caffe that referenced this pull request Jun 1, 2015
Over version introduced in BVLC#1384

Highlights:
* Interface change: column order is now determined by using a list of `OrderedDict` objects instead of `dict` objects, which obviates the need to pass around a tuple with the column orders.
* The outputs are now named according to their names in the network protobuffer; e.g., if your top is named `loss`, then the corresponding column header will also be `loss`; we no longer rename it to, e.g., `TrainingLoss` or `TestLoss`.
* Fixed the bug/feature of the first version where the initial learning rate was always NaN.
* Add optional parameter to specify output table delimiter. It's still a comma by default.

You can use Matlab code from [this gist](https://gist.github.com/drdan14/d8b45999c4a1cbf7ad85) to verify that your results are the same before and after the changes introduced in this pull request. That code assumes that your `top` names are `accuracy` and `loss`, but you can modify the code if that's not true.
matthiasplappert pushed a commit to matthiasplappert/caffe that referenced this pull request Aug 10, 2015
Over version introduced in BVLC#1384

Highlights:
* Interface change: column order is now determined by using a list of `OrderedDict` objects instead of `dict` objects, which obviates the need to pass around a tuple with the column orders.
* The outputs are now named according to their names in the network protobuffer; e.g., if your top is named `loss`, then the corresponding column header will also be `loss`; we no longer rename it to, e.g., `TrainingLoss` or `TestLoss`.
* Fixed the bug/feature of the first version where the initial learning rate was always NaN.
* Add optional parameter to specify output table delimiter. It's still a comma by default.

You can use Matlab code from [this gist](https://gist.github.com/drdan14/d8b45999c4a1cbf7ad85) to verify that your results are the same before and after the changes introduced in this pull request. That code assumes that your `top` names are `accuracy` and `loss`, but you can modify the code if that's not true.
cbfinn pushed a commit to cbfinn/caffe that referenced this pull request Aug 12, 2015
Over version introduced in BVLC#1384

Highlights:
* Interface change: column order is now determined by using a list of `OrderedDict` objects instead of `dict` objects, which obviates the need to pass around a tuple with the column orders.
* The outputs are now named according to their names in the network protobuffer; e.g., if your top is named `loss`, then the corresponding column header will also be `loss`; we no longer rename it to, e.g., `TrainingLoss` or `TestLoss`.
* Fixed the bug/feature of the first version where the initial learning rate was always NaN.
* Add optional parameter to specify output table delimiter. It's still a comma by default.

You can use Matlab code from [this gist](https://gist.github.com/drdan14/d8b45999c4a1cbf7ad85) to verify that your results are the same before and after the changes introduced in this pull request. That code assumes that your `top` names are `accuracy` and `loss`, but you can modify the code if that's not true.
wangyida pushed a commit to wangyida/caffe that referenced this pull request Sep 14, 2015
Over version introduced in BVLC#1384

Highlights:
* Interface change: column order is now determined by using a list of `OrderedDict` objects instead of `dict` objects, which obviates the need to pass around a tuple with the column orders.
* The outputs are now named according to their names in the network protobuffer; e.g., if your top is named `loss`, then the corresponding column header will also be `loss`; we no longer rename it to, e.g., `TrainingLoss` or `TestLoss`.
* Fixed the bug/feature of the first version where the initial learning rate was always NaN.
* Add optional parameter to specify output table delimiter. It's still a comma by default.

You can use Matlab code from [this gist](https://gist.github.com/drdan14/d8b45999c4a1cbf7ad85) to verify that your results are the same before and after the changes introduced in this pull request. That code assumes that your `top` names are `accuracy` and `loss`, but you can modify the code if that's not true.
wangyida pushed a commit to wangyida/caffe that referenced this pull request Sep 22, 2015
Over version introduced in BVLC#1384

Highlights:
* Interface change: column order is now determined by using a list of `OrderedDict` objects instead of `dict` objects, which obviates the need to pass around a tuple with the column orders.
* The outputs are now named according to their names in the network protobuffer; e.g., if your top is named `loss`, then the corresponding column header will also be `loss`; we no longer rename it to, e.g., `TrainingLoss` or `TestLoss`.
* Fixed the bug/feature of the first version where the initial learning rate was always NaN.
* Add optional parameter to specify output table delimiter. It's still a comma by default.

You can use Matlab code from [this gist](https://gist.github.com/drdan14/d8b45999c4a1cbf7ad85) to verify that your results are the same before and after the changes introduced in this pull request. That code assumes that your `top` names are `accuracy` and `loss`, but you can modify the code if that's not true.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants