-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Description
When running Caffe on the ImageNet data, I observed that the memory usage (seen via top command) inexorably increases to almost 100%. With batchsize=256, this happens in around 2500 iterations. When I set the batchsize to 100, training was faster but by around 5000 iterations the memory consumption again increased to almost 100%. At that point the training slows down dramatically and in fact the loss does not change at all. I suspect the slowdown may be due to thrashing. I am wondering if there is a memory leak or something in Caffe that is unintentionally allocating more and more memory at each iteration.
The same issue occurs on MNIST, although the dataset is much smaller so the training can actually complete without issues.
I ran the MNIST data through the valgrind tool with --leak-check=full, and indeed some memory leaks were reported. These could be benign if the amount of leaked memory is constant, but maybe it is scaling with respect to the number of batches which could explain the forever-increasing memory consumption.
Any idea what could be the problem?
Update (12/13/2013): The problem may be in LevelDB. I was able to make it work by modifying src/caffe/layers/data_layer.cpp by setting options.max_open_files = 100. I think the default was 1000, which was just too much memory on the machine I was using. I also wonder whether it could be improved by setting ReadOptions::fill_cache=false, since Caffe seems to scan over the whole training set.