upgrading Accuracy layer #5836

shaibagon · 2017-08-10T07:15:19Z

As noted by @Noiredd in PR #5734, current implementation of "Accuracy" layer is inefficient.
This PR makes two contributions:
(1) efficient CPU implementation, takes only O(L) for top_k per pixel prediction.
Unlike PR#5734 no need for fancy priority_queue or other complex data structures.
(2) GPU implementation: no need to sync host/device memory.

shelhamer · 2017-08-16T22:34:35Z

Thanks for boosting the efficiency!

Sorry for the total nitpick, but can you fix the file permissions changes that switch 100644 → 100755 for merge?

shaibagon · 2017-08-17T21:33:09Z

@shelhamer Thankyou for "focus"ing on this PR.
I just chomd the files and committed.
Auto-tests seems to pass smoothly.

Is there anything else I can do to ease the process of accepting this PR?

Noiredd · 2017-08-21T10:14:58Z

I just gave this PR a spin but the results don't look right... at first I noticed my DIGITS plots broken, like accuracy going out of scale. Then I dug a little and found the weird thing: it tends to stably increase even when the network is already at a loss minimum. It increases by about (accuracy for one minibatch)*(number of minibatches in a test epoch) at every epoch. It's just something I observed after 2 runs on 2 different nets and datasets (no time to investigate it deeper at the moment), so this may be a bit far-fetched theory, but something is definitely off. Maybe it's just for me though? Has anyone tested it in action beside doing make runtest?

EDIT: The CPU version of this works well. Is it possible that you forgot to reset the accumulators (acc_data, counts) at the beginning of AccuracyLayer::Forward_gpu(), and they basically integrate correct answers throughout epochs?

EDIT2: Yep, I can confirm that is the case. Added a relevant comment below.

Noiredd · 2017-08-22T07:13:51Z

src/caffe/layers/accuracy_layer.cu

+
+    // Similarly, this memory is never used elsewhere, and thus we can use it
+    // to avoid having to allocate additional GPU memory.
+    Dtype* counts = bottom[1]->mutable_gpu_diff();


acc_data and counts need to be reset between iterations, right here. Add:

caffe_gpu_set(bottom[0]->count(), Dtype(0), acc_data); caffe_gpu_set(bottom[1]->count(), Dtype(0), counts);

@Noiredd this is a bit of an overkill to go over all the arrays.
The bug is much simpler to fix. In line 32, there is a += for the accumulator:

acc[index] += (num_better_predictions < top_k);

Replacing it with a simple = assignment is enough to fix the bug.

BTW, it might be interesting to see if this fix will run even faster than the fix you proposed.

@shaibagon Yep this works too and is clearer. Regarding speed, it shaves about 30 seconds more (though bear in mind I didn't do any serious benchmarks with clean test environment, repeated runs, averaged times etc).

@Noiredd Thank you!!
You really upgraded this PR with all your tests and benchmarks.

Noiredd · 2017-08-23T09:56:19Z

I ran a quick benchmark on a TITAN Z using FCN AlexNet on PASCAL VOC 2012 data (1464 training images, 1449 validation, 21 classes) at top 3 accuracy, training for 6 epochs and testing on entire val set every epoch (7 times). Results look very good:

56m 1s on current master (partial_sort),
47m 9s on PR Optimized accuracy calculation #5734 (priority_queue),
28m 35s on this PR.

One thing I noticed is I keep getting 100% result at the first test iteration - current master (and #5734) output much smaller (and more believable) numbers. Over the next epochs it looks normal again. This seems to only happen for segmentation networks, regardless of params (top_k, ignore_label).

shaibagon · 2017-08-24T06:18:13Z

@Noiredd Thank you for spotting the bug! I updated test_accuracy_layer.cpp so that it run each forward test for several iterations: this way the test reproduce the error and the fix can be confirmed by the automatic testing.

shaibagon · 2017-08-24T07:12:59Z

@shelhamer After the very thorough review of @Noiredd, I made these changes:
(1) Updating test_accuracy_layer.cpp - run all tests for 3 iterations. This change catches the bug in the auto tests.
(2) Fixed the bug in accuracy_layer.cu and verified the new tests are passing.

Thank you @Noiredd for your work to make this PR very high quality!

Noiredd · 2017-08-24T08:45:04Z

@shaibagon I can confirm that accuracies look good now. What I'm still worried about is that first iteration (before the training starts) when training FCNs. I load a "convolutionized" AlexNet and initialize it with converted pretrained model except the last layer, which is initialized with Gaussian noise. Results at the first iteration should be bad and that's what the current master build says:

I0824 10:35:16.934343 29462 solver.cpp:397]     Test net output #0: accuracy = 0.0374341
I0824 10:35:16.934394 29462 solver.cpp:397]     Test net output #1: accuracy_perclass = 0
I0824 10:35:16.934404 29462 solver.cpp:397]     Test net output #2: accuracy_perclass = 0
I0824 10:35:16.934412 29462 solver.cpp:397]     Test net output #3: accuracy_perclass = 0
I0824 10:35:16.934418 29462 solver.cpp:397]     Test net output #4: accuracy_perclass = 0
I0824 10:35:16.934425 29462 solver.cpp:397]     Test net output #5: accuracy_perclass = 0
I0824 10:35:16.934432 29462 solver.cpp:397]     Test net output #6: accuracy_perclass = 0
I0824 10:35:16.934438 29462 solver.cpp:397]     Test net output #7: accuracy_perclass = 0
I0824 10:35:16.934445 29462 solver.cpp:397]     Test net output #8: accuracy_perclass = 0
I0824 10:35:16.934453 29462 solver.cpp:397]     Test net output #9: accuracy_perclass = 0
I0824 10:35:16.934459 29462 solver.cpp:397]     Test net output #10: accuracy_perclass = 0
I0824 10:35:16.934466 29462 solver.cpp:397]     Test net output #11: accuracy_perclass = 0
I0824 10:35:16.934473 29462 solver.cpp:397]     Test net output #12: accuracy_perclass = 0
I0824 10:35:16.934479 29462 solver.cpp:397]     Test net output #13: accuracy_perclass = 0
I0824 10:35:16.934486 29462 solver.cpp:397]     Test net output #14: accuracy_perclass = 0
I0824 10:35:16.934494 29462 solver.cpp:397]     Test net output #15: accuracy_perclass = 0
I0824 10:35:16.934499 29462 solver.cpp:397]     Test net output #16: accuracy_perclass = 0
I0824 10:35:16.934506 29462 solver.cpp:397]     Test net output #17: accuracy_perclass = 0
I0824 10:35:16.934514 29462 solver.cpp:397]     Test net output #18: accuracy_perclass = 0
I0824 10:35:16.934520 29462 solver.cpp:397]     Test net output #19: accuracy_perclass = 0.0621118
I0824 10:35:16.934526 29462 solver.cpp:397]     Test net output #20: accuracy_perclass = 0.057971
I0824 10:35:16.934533 29462 solver.cpp:397]     Test net output #21: accuracy_perclass = 0.0510697

When I switch to this PR, this is what I get instead:

I0824 10:27:48.350045 29098 solver.cpp:397]     Test net output #0: accuracy = 1
I0824 10:27:48.350124 29098 solver.cpp:397]     Test net output #1: accuracy_perclass = 0.995859
I0824 10:27:48.350136 29098 solver.cpp:397]     Test net output #2: accuracy_perclass = 0.0621118
I0824 10:27:48.350144 29098 solver.cpp:397]     Test net output #3: accuracy_perclass = 0.0545204
I0824 10:27:48.350152 29098 solver.cpp:397]     Test net output #4: accuracy_perclass = 0.0710835
I0824 10:27:48.350158 29098 solver.cpp:397]     Test net output #5: accuracy_perclass = 0.0496894
I0824 10:27:48.350164 29098 solver.cpp:397]     Test net output #6: accuracy_perclass = 0.0662526
I0824 10:27:48.350172 29098 solver.cpp:397]     Test net output #7: accuracy_perclass = 0.0510697
I0824 10:27:48.350178 29098 solver.cpp:397]     Test net output #8: accuracy_perclass = 0.0876467
I0824 10:27:48.350185 29098 solver.cpp:397]     Test net output #9: accuracy_perclass = 0.0821256
I0824 10:27:48.350191 29098 solver.cpp:397]     Test net output #10: accuracy_perclass = 0.0848861
I0824 10:27:48.350198 29098 solver.cpp:397]     Test net output #11: accuracy_perclass = 0.0489993
I0824 10:27:48.350205 29098 solver.cpp:397]     Test net output #12: accuracy_perclass = 0.0517598
I0824 10:27:48.350213 29098 solver.cpp:397]     Test net output #13: accuracy_perclass = 0.0883368
I0824 10:27:48.350219 29098 solver.cpp:397]     Test net output #14: accuracy_perclass = 0.0545204
I0824 10:27:48.350225 29098 solver.cpp:397]     Test net output #15: accuracy_perclass = 0.05245
I0824 10:27:48.350232 29098 solver.cpp:397]     Test net output #16: accuracy_perclass = 0.307798
I0824 10:27:48.350239 29098 solver.cpp:397]     Test net output #17: accuracy_perclass = 0.0586611
I0824 10:27:48.350245 29098 solver.cpp:397]     Test net output #18: accuracy_perclass = 0.0393375
I0824 10:27:48.350252 29098 solver.cpp:397]     Test net output #19: accuracy_perclass = 0.0621118
I0824 10:27:48.350260 29098 solver.cpp:397]     Test net output #20: accuracy_perclass = 0.057971
I0824 10:27:48.350265 29098 solver.cpp:397]     Test net output #21: accuracy_perclass = 0.0510697

Accuracy = 1 should imply each per-class accuracy of 1 - am I not right? In further iterations it drops to about 0.87 which is consistent with master. But this first one just doesn't make sense.

PS: @shelhamer will probably remind you about the chmod - you're setting 100755 again :P

shaibagon · 2017-08-24T09:10:53Z

(1) Regarding the high accuracy at first iteration:
It might be the case that all predictions has same probability. This implementation checks top_k stricktly greater probabilities than this label, thus for same probabilities it will count as "accurate".
(2) Regarding accuracy=1 and per-class is much lower: that's odd. Does it happen in GPU? CPU? Both?
(3) file permissions :( I'll fix. again.

Noiredd · 2017-08-24T13:10:13Z

Took me a while to get down to the bottom of this; turned out it was a problem of my model. Due to bad initialization, it outputted all zeros on the first iteration - so your answer (1) explains it. My fault of course.

On the other hand though, if P out of N elements of the output probability vector have equal, high values, and the real answer is also among them, should that be treated as a hit even if P > top_k? Thinking from the point of view of a deployed model, such answer is not very informative.

shaibagon · 2017-08-24T21:56:10Z

@Noiredd You are making a good point here: Accuracy layer should only count predictions that are strictly within top_k. If there are other predictions with equal probability, they should be counted as part of top_k.

I'll make the necessary changes

Noiredd · 2017-08-25T06:01:53Z

@shaibagon Just modifying the inequality in num_better_predictions increment from strict to >= does it. Note that we need to initialize num_better_predictions to -1, or in some other way account for the one element that will always be equal to prob_of_true_class.

shaibagon · 2017-08-25T09:17:56Z

@Noiredd does that solves also the issue with Accuracy=1 while per-class accuracy is less than 1?

Noiredd · 2017-08-25T09:58:18Z

@shaibagon Yes. Accuracy=1 and per-class accuracy<1 only happened when the network produced an all-zero output. With those changes, both total and per-class accuracy will be 0.

…op_k, no need for fancy priority_queue etc. (2) GPU implementation

shaibagon · 2017-08-27T07:23:19Z

@shelhamer @Noiredd
(1) Issue with many predictions with same probability fixed.
(2) File permissions fixed (again). Hopefully this time it will last.

Thank you for your efforts on improving this PR!

Noiredd · 2017-08-28T10:12:00Z

I have no further objections ;) Great contribution by @shaibagon!

vlomonaco · 2017-10-15T22:23:10Z

Hi, I've just created a new issue (#5981) which may be related to this merge!

shaibagon mentioned this pull request Aug 10, 2017

Optimized accuracy calculation #5734

Closed

shaibagon force-pushed the upgrade_accuracy branch 2 times, most recently from cda371e to b4ab3c5 Compare August 13, 2017 08:54

shaibagon closed this Aug 13, 2017

shaibagon reopened this Aug 13, 2017

shelhamer added the focus label Aug 16, 2017

shaibagon force-pushed the upgrade_accuracy branch from b4ab3c5 to 31bc9c6 Compare August 17, 2017 20:59

Noiredd reviewed Aug 22, 2017

View reviewed changes

shaibagon closed this Aug 24, 2017

shaibagon reopened this Aug 24, 2017

shaibagon force-pushed the upgrade_accuracy branch 2 times, most recently from 7ca82da to e370424 Compare August 24, 2017 07:25

shaibagon force-pushed the upgrade_accuracy branch from e370424 to 237328d Compare August 27, 2017 06:18

upgrading Accuracy layer: (1) efficient CPU implementation O(L) for t…

ddfd4d3

…op_k, no need for fancy priority_queue etc. (2) GPU implementation

shaibagon force-pushed the upgrade_accuracy branch from 237328d to ddfd4d3 Compare August 27, 2017 06:45

PENGUINLIONG mentioned this pull request Sep 2, 2017

Test accuracy changes with test batch size #5621

Open

shaibagon merged commit 62e0c85 into BVLC:master Oct 10, 2017

shaibagon deleted the upgrade_accuracy branch October 10, 2017 17:30

Noiredd mentioned this pull request Oct 23, 2017

add confusion matrix layer for classification task. #6000

Open

Noiredd mentioned this pull request Nov 29, 2017

accuracy = -nan(ind) #5949

Closed

upgrading Accuracy layer #5836

upgrading Accuracy layer #5836

Uh oh!

Conversation

shaibagon commented Aug 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shelhamer commented Aug 16, 2017

Uh oh!

shaibagon commented Aug 17, 2017

Uh oh!

Noiredd commented Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Noiredd Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shaibagon Aug 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Noiredd Aug 24, 2017

Choose a reason for hiding this comment

Uh oh!

shaibagon Aug 24, 2017

Choose a reason for hiding this comment

Uh oh!

Noiredd commented Aug 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shaibagon commented Aug 24, 2017

Uh oh!

shaibagon commented Aug 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Noiredd commented Aug 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shaibagon commented Aug 24, 2017

Uh oh!

Noiredd commented Aug 24, 2017

Uh oh!

shaibagon commented Aug 24, 2017

Uh oh!

Noiredd commented Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shaibagon commented Aug 25, 2017

Uh oh!

Noiredd commented Aug 25, 2017

Uh oh!

shaibagon commented Aug 27, 2017

Uh oh!

Noiredd commented Aug 28, 2017

Uh oh!

vlomonaco commented Oct 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shaibagon commented Aug 10, 2017 •

edited

Loading

Noiredd commented Aug 21, 2017 •

edited

Loading

Noiredd Aug 22, 2017 •

edited

Loading

shaibagon Aug 24, 2017 •

edited

Loading

Noiredd commented Aug 23, 2017 •

edited

Loading

shaibagon commented Aug 24, 2017 •

edited

Loading

Noiredd commented Aug 24, 2017 •

edited

Loading

Noiredd commented Aug 25, 2017 •

edited

Loading

vlomonaco commented Oct 15, 2017 •

edited

Loading