Skip to content

Conversation

@ScottWales
Copy link

The Dask array library includes a parallel SVD solver. Use this to compute the EOF, provided the input data is a Dask array (e.g. from using xarray.open_dataset(..., chunks={'time': 10}))

Dask arrays are only used for the input to the EOF calculation. Because of the way that masking works the output from the SVD solver is converted into normal Numpy arrays. There should still be some benefit from the parallel SVD algorithm that Dask uses, though I've not measured performance.

Both the standard and Xarray solvers are supported. Iris can in theory use Dask, but I couldn't see how to set up test data for it.

@ScottWales
Copy link
Author

Hopefully this solves #107

@ajdawson
Copy link
Owner

ajdawson commented Apr 2, 2019

@rabernat - are you able to test and review this please?

@navidcy
Copy link

navidcy commented Apr 2, 2019

Thanks @ScottWales. I tried it and it seems to work.
@ajdawson, why do most of travis-ci tests fail?

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a simple yet well-crafted update to enable the use of dask. Fantastic work! The tests look spot-on.

My one comment is about possibly targeting dask's svd_compressed function. However, it may be preferably to leave that for future work.

# Trim the arrays (since Dask doesn't support
# 'full_matrices=False')
A = A[:, :len(Lh)]
E = E[:len(Lh), :]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment for discussion. Dask also has a svd_compressed function.

This computes the approximate singular value decomposition of a large array. This algorithm is generally faster than the normal algorithm but does not provide exact results. One can balance between performance and accuracy with input parameters (see below).

You have to explicitly specify the rank of the desired thin SVD decomposition.

Obviously this is not part of the numpy API. Nevertheless, it seems like it would could be useful to have the option to use it when dealing with dask arrays (which are presumably very large).

@rabernat
Copy link
Contributor

rabernat commented Apr 2, 2019

The test failure is, as far as I can tell, unrelated to this PR

__________ TestConstructorStandard.test_input_with_all_missing_values __________
self = <eofs.tests.test_error_handling.TestConstructorStandard object at 0x7f625167f6d8>
    def test_input_with_all_missing_values(self):
        # input with only missing values, missing values are propagated
        # because the default for the center argument is True
        solution = reference_solution('standard', 'equal')
        data = solution['sst']
        mask = data.mask
        mask[-1] = True
        with pytest.raises(ValueError):
>           solver = self.solver_class(data)
E           Failed: DID NOT RAISE <class 'ValueError'>

edit: that's not true, there are also tests failing because dask is not available in some of the travis envs.

Copy link
Owner

@ajdawson ajdawson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs a few modifications to get it in to shape. The algorithm choices need to be reviewed, and the testing needs to be revised. Hopefully not too much work.


@classmethod
def modify_solution(cls):
import dask
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the test suite now requires dask or it will fail. This is not ideal, I don't want to make dask a required dependency at this stage. We should think about skipping this test when dask is not available. Not quite sure what the best approach would be right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would follow what xarray itself does here. Dask is an optional dependency

Something like:

dask = pytest.importorskip('dask')

i.e. skip the tests where dask is not installed.

https://github.com/pydata/xarray/blob/master/xarray/tests/test_dask.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added pytest.importorskip to the Dask tests

A, Lh, E = (x.compute() for x in dsvd)
else:
# Basic numpy algorithm
A, Lh, E = np.linalg.svd(dataNoMissing)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be way more expensive for numpy arrays than with full_matrices=False, I don't think it is a good idea to change this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, changed back to the original and moved the array trim to the dask section

@navidcy
Copy link

navidcy commented Apr 10, 2019

@ajdawson, if this looks OK then I could add some mention to dask in the documentation.

@ajdawson
Copy link
Owner

Looks good now @ScottWales. Can I ask you to rebase on the current master (which I just updated)? I had forgotten to merge in the 1.3 development branch back in to master after the last release, which included an upgrade of the CI testing to have more coverage and test on more recent versions of Python. I'd just like to see the updated test system pass for this PR, otherwise good to merge.

Scott Wales added 3 commits April 29, 2019 10:45
The Dask array library includes a parallel SVD solver. Use this to
compute the EOF, provided the input data is a Dask array (e.g. from
using `xarray.open_dataset(..., chunks={'time': 10})`)
 - Skip Dask tests if Dask is not available
 - Use `full_matrices=False` for numpy solver
- Add check for all missing data
- Pre v1.0 dask has different arguments, just check later versions
@ScottWales
Copy link
Author

Thanks @ajdawson, rebase now done

@classmethod
def modify_solution(cls):
# Skip this class if dask not available
pytest.importorskip('dask.array', minversion='1.0')
Copy link
Owner

@ajdawson ajdawson Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that this works correctly. I'm seeing skipped dask tests in the test output:

SKIPPED [30] /home/travis/build/ajdawson/eofs/lib/eofs/tests/test_solution.py:289: module 'dask.array' has __version__ None, required is: '1.0'

Should this just be 'dask' instead of 'dask.array'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting - using dask.array is intentional - dask can be installed with only some of its components, so this ensures the array component is installed.

There is a problem with checking the version here though - the version of Dask being used by the old Python 3.4 tests had a different syntax so I wanted to disable tests for that version. dask.array doesn't have a __version__ attribute though (that's part of the dask namespace), so the tests got skipped all the time.

Since the Python 3.4 tests are no longer being used I've removed the version constraint. There are now no skipped tests in any of the 'extras' test runs

@ajdawson
Copy link
Owner

Thanks @ScottWales, this is a nice update. And thanks @rabernat and @navidcy for testing/review.

@ajdawson ajdawson merged commit efb7f4b into ajdawson:master Apr 30, 2019
@ajdawson
Copy link
Owner

ajdawson commented Apr 30, 2019

@ajdawson, if this looks OK then I could add some mention to dask in the documentation.

@navidcy - some mention of dask in the documentation should definitely be made now. If you would like to have a go I will happily review.

@ScottWales ScottWales deleted the dask branch May 1, 2019 00:00
@ajdawson ajdawson mentioned this pull request May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants