Make eltwise product gradient stabler (fixes issues with WITHIN_CHANNEL LRN in cifar_full example) #981
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The eltwise product layer's forward computation is (given inputs x, y, z)
p := x .* y .* z
. Previously I was computing the gradient w.r.t.x
asp ./ x
(analogously fory
,z
); this changes the layer to by default compute the gradient w.r.t.x
asy .* z
, which is asymptotically slower in the number of inputs (O(n^2) instead of O(n)), but stabler than dividing by the potentially near-zerox
. For the case of two inputs (which is probably 99% of the uses of this layer, including the CIFAR example) it's actually faster (just copy the other input) and more accurate, but if you have lots of inputs and you're sure dividing by them will not cause any instability (e.g. if you specifically took measures to condition the inputs as such) you can still set thestable_prod_grad: false
option for the old method.This division by near-zero was causing NaNs in the cifar_full example, which uses the eltwise product as part of the WITHIN_CHANNEL LRN computation.