-
Notifications
You must be signed in to change notification settings - Fork 63
Add support for Dask SVD solver #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hopefully this solves #107 |
|
@rabernat - are you able to test and review this please? |
|
Thanks @ScottWales. I tried it and it seems to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a simple yet well-crafted update to enable the use of dask. Fantastic work! The tests look spot-on.
My one comment is about possibly targeting dask's svd_compressed function. However, it may be preferably to leave that for future work.
lib/eofs/standard.py
Outdated
| # Trim the arrays (since Dask doesn't support | ||
| # 'full_matrices=False') | ||
| A = A[:, :len(Lh)] | ||
| E = E[:len(Lh), :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment for discussion. Dask also has a svd_compressed function.
This computes the approximate singular value decomposition of a large array. This algorithm is generally faster than the normal algorithm but does not provide exact results. One can balance between performance and accuracy with input parameters (see below).
You have to explicitly specify the rank of the desired thin SVD decomposition.
Obviously this is not part of the numpy API. Nevertheless, it seems like it would could be useful to have the option to use it when dealing with dask arrays (which are presumably very large).
|
The test failure is, as far as I can tell, unrelated to this PR edit: that's not true, there are also tests failing because dask is not available in some of the travis envs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR needs a few modifications to get it in to shape. The algorithm choices need to be reviewed, and the testing needs to be revised. Hopefully not too much work.
|
|
||
| @classmethod | ||
| def modify_solution(cls): | ||
| import dask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means the test suite now requires dask or it will fail. This is not ideal, I don't want to make dask a required dependency at this stage. We should think about skipping this test when dask is not available. Not quite sure what the best approach would be right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would follow what xarray itself does here. Dask is an optional dependency
Something like:
dask = pytest.importorskip('dask')i.e. skip the tests where dask is not installed.
https://github.com/pydata/xarray/blob/master/xarray/tests/test_dask.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added pytest.importorskip to the Dask tests
lib/eofs/standard.py
Outdated
| A, Lh, E = (x.compute() for x in dsvd) | ||
| else: | ||
| # Basic numpy algorithm | ||
| A, Lh, E = np.linalg.svd(dataNoMissing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be way more expensive for numpy arrays than with full_matrices=False, I don't think it is a good idea to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, changed back to the original and moved the array trim to the dask section
|
@ajdawson, if this looks OK then I could add some mention to |
|
Looks good now @ScottWales. Can I ask you to rebase on the current master (which I just updated)? I had forgotten to merge in the 1.3 development branch back in to master after the last release, which included an upgrade of the CI testing to have more coverage and test on more recent versions of Python. I'd just like to see the updated test system pass for this PR, otherwise good to merge. |
The Dask array library includes a parallel SVD solver. Use this to
compute the EOF, provided the input data is a Dask array (e.g. from
using `xarray.open_dataset(..., chunks={'time': 10})`)
- Skip Dask tests if Dask is not available - Use `full_matrices=False` for numpy solver
|
Thanks @ajdawson, rebase now done |
lib/eofs/tests/test_solution.py
Outdated
| @classmethod | ||
| def modify_solution(cls): | ||
| # Skip this class if dask not available | ||
| pytest.importorskip('dask.array', minversion='1.0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that this works correctly. I'm seeing skipped dask tests in the test output:
SKIPPED [30] /home/travis/build/ajdawson/eofs/lib/eofs/tests/test_solution.py:289: module 'dask.array' has __version__ None, required is: '1.0'
Should this just be 'dask' instead of 'dask.array'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting - using dask.array is intentional - dask can be installed with only some of its components, so this ensures the array component is installed.
There is a problem with checking the version here though - the version of Dask being used by the old Python 3.4 tests had a different syntax so I wanted to disable tests for that version. dask.array doesn't have a __version__ attribute though (that's part of the dask namespace), so the tests got skipped all the time.
Since the Python 3.4 tests are no longer being used I've removed the version constraint. There are now no skipped tests in any of the 'extras' test runs
|
Thanks @ScottWales, this is a nice update. And thanks @rabernat and @navidcy for testing/review. |
The Dask array library includes a parallel SVD solver. Use this to compute the EOF, provided the input data is a Dask array (e.g. from using
xarray.open_dataset(..., chunks={'time': 10}))Dask arrays are only used for the input to the EOF calculation. Because of the way that masking works the output from the SVD solver is converted into normal Numpy arrays. There should still be some benefit from the parallel SVD algorithm that Dask uses, though I've not measured performance.
Both the standard and Xarray solvers are supported. Iris can in theory use Dask, but I couldn't see how to set up test data for it.