Skip to content

Conversation

gongzg
Copy link

@gongzg gongzg commented Mar 13, 2017

The PR is mainly for bug fixing and also including some enhancement as below:

  1. The kernel/program binary cache mechanism is working with latest viennacl now. For details, please refer clcaffe's wiki page. With this feature enabled, the initialization time of opencl caffe application will reduce dramatically.
  2. Relax the image restrication for spatial convolution kernels, thus we need much less convolution kernels if the application need to process different image sizes for the same net model.
  3. Fixed one race condition bug and now all test cases could always pass. The random fail sympton get fixed.
  4. Add dilation support for the spatial convolution kernel.

Zhigang Gong and others added 19 commits March 13, 2017 16:23
For the to_gpu function with uninitialized memory case, we
do not need to finish queue, and if the HEAD is at CPU and
we support zero copy, then we also don't need to finish
the queue.

Signed-off-by: Zhigang Gong <[email protected]>
The caffe's timer has some overhead, and when our tunning kernel is
very tiny, the overhead may cause very unstable timing result, so
I increase the iteration count to reduce this type of overhead.

Signed-off-by: Zhigang Gong <[email protected]>
If the spatial dimension is relatively large, we should use the default code
path to achieve better parallelism.

Signed-off-by: Zhigang Gong <[email protected]>
Sometimes, the sub buffer creation may fail, we need to take care of it.

Signed-off-by: Zhigang Gong <[email protected]>
Some features e.g. opencl_unroll_hint are not allowed for beignet
compiler, use __BEIGNET__ macro to choose whether to build with these
features.

Also add an helper func to faciliate judging beignet driver.

Signed-off-by: Zhiwen Wu <[email protected]>
If the input image size changed during runtime, and the kernel type
change to 2 or 5, we need to swizzle the weights again.

Signed-off-by: Zhigang Gong <[email protected]>
Added a new basic convolution kernel that supports input image with
no padding, so that no image padding in host code need anymore.

Signed-off-by: Zhiwen Wu<[email protected]>
Change-Id: I392c4e73319fcfc18e628f9476b9bfdcba3cc206
If we simply use the cpu code path to copy the data, we will introduce
one race condition between the GPU queue and the CPU. The scenario is:
when we call it in an iteration loop. The data blob is in a zero-copy
blob, and the first pass may be still blocking on the GPU side. The
second pass will modify the data blob on CPU side before the data is
accessed at the first pass on GPU side.

We can simply add a synchronization point between the two iterations, but that
is not a good fix as we force the GPU queue to flush and wait it to finish.
The best way is to do the copy on the GPU side and in the same queue. Thus we
don't need to worry about this race condition any more and without any interfere
the GPU queue.

Signed-off-by: Zhigang Gong <[email protected]>
This will cause relu gradient fail.
Prepare to support varying sizes.

Signed-off-by: Zhigang Gong <[email protected]>
No need to tune different kernel for each different input size.

Signed-off-by: Zhigang Gong <[email protected]>
Add the platform and driver information to change to use
system cache directory if possible. After this change, we
can reuse a offline tuned configurations.

Signed-off-by: Zhigang Gong <[email protected]>
Signed-off-by: Zhigang Gong <[email protected]>
@naibaf7 naibaf7 merged commit 479d3a0 into BVLC:opencl Mar 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants