-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Basic cuDNN v3 support #2737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic cuDNN v3 support #2737
Conversation
Just out of curiosity where did you get cuDNN v3 from? |
cffc254
to
7c6d031
Compare
@philkr Sorry, I missed this question - I'm an engineer for NVIDIA |
I tried this PR with CUDNN v3 RC on my customised GoogleNet model and got only marginal performance increase (16ms vs 17ms with CUDNN v2 per image). While memory usage increased 1.5x times from ~2GB to 3GB on Amazon g2.2xlarge instance. |
I also tried this PR and got similar results to what @stas-sl describes. When I comment out the call to cudnnGetConvolutionForwardAlgorithm and force CUDNN_CONVOLUTION_FWD_ALGO_FFT as the algorithm the call to cudnnGetConvolutionForwardWorkspaceSize returns CUDNN_STATUS_NOT_SUPPORTED, which means: @slayton58 , Do you know do you know what are the limitations on these params for selecting the FFT algorithm? |
@stas-sl Most of the potential performance increases in cuDNNv3 come from either tunings for Maxwell, or using new FFT-based convolution algorithms. Given the GPUs you're using I suspect you're currently not using either of those features. @talda There are a few restrictions: H and W must be packed (wStride = 1, hStride = W), the stride must be 1 (default for Caffe's convolution layers) and the padded size (H + 2_pad_h, W + 2_pad_w) <= (256, 256) |
@slayton58, on Titan X, which is Maxwell, speed gain for GoogLeNet is ~5%. Is it maximum I can get for such architecture? |
@slayton58 I tried this with my own data and networks (1 convolutional + 2 fully connected), finding using gpu+cuDNN v3 is much slower than using gpu only (Time: CPU>>GPU+cuDNN>GPU only). I tried different batch sizes but the results showed that cuDNN is at least twice slower than gpu only, especially on the backward pass (at least 3 times slower). I'm quite new in caffe and currently working with C++ (vs2013) on windows, do you have any idea of the possible reasons? Many thanks! |
@slayton58 My problem was that in order to save some memory I changed the stride of one of the convolution layers to 3. Now that I fixed this I don't get this error but the amount of memory needed as workspace just for the fft forward convolution layers is so big that I can't really train my network on a K40. Can you post any example of a reasonably sized net that can be trained using the fft algorithm? |
Ping: are there plans to get this merged soon? Or is this the start of a longer term project? |
Basic cuDNN v3 support
What's the status on this? |
6cdf110
to
9fa56f0
Compare
Is cuDNN v3 currently supported? |
Merging as #3160 to include a few last details. Thanks for this integration @slayton58 and sorry for the earlier holdup! I appreciate having the TODO for choosing the algo on |
@slayton58 this is failing |
@shelhamer Fixed the convolution groups test failing - weight_offset_ wasn't being set correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing weight_offset_
caused the issue in #2737 (comment) since it was shadowing the BaseConvolutionLayer
member that is set in BaseConvolutionLayer::LayerSetup()
.
Once again merging as #3160 to include the little details mentioned there and the minimal fix for the |
This PR implements basic support for new features in cuDNN v3. Summarized these are: