If we are reusing weights in a linear layer can we use the same approximation to compute the covariances, or are there some subtleties?
for example if weights w are used 4x we can compute \Omega as (1/(4M)) A A^T where M is the batch size
deriving from the definition of a fisher block and assuming spatially uncorrelated derivatives seems to land you in the same place as the convolutional approximation