-
Notifications
You must be signed in to change notification settings - Fork 613
Several improvements to Ridge
#7410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A quick benchmark comparing the new cupy-based bench.pyfrom time import perf_counter
from itertools import product
import cuml
from cuml.datasets import make_regression
N_RUNS = 5
N_FEATURES = [10, 100, 1000]
N_SAMPLES = [1000, 10000, 50000]
for fit_intercept in [True, False]:
print(f"# {fit_intercept = }")
for n_features, n_samples in product(N_FEATURES, N_SAMPLES):
X, y = make_regression(
n_features=n_features, n_samples=n_samples, random_state=42
)
start = perf_counter()
for _ in range(N_RUNS):
cuml.Ridge(fit_intercept=fit_intercept, solver="svd").fit(X, y)
duration = (perf_counter() - start) / N_RUNS
print(f"- {X.shape}: {duration * 1e3:.3f} ms")On main On this PR |
87b97fc to
121033e
Compare
| - "sklearn.tests.test_common::test_search_cv[GridSearchCV(cv=2,estimator=Ridge(),param_grid={'alpha':[0.1,1.0]})-check_complex_data]" | ||
| - "sklearn.tests.test_common::test_search_cv[GridSearchCV(cv=2,estimator=Ridge(),param_grid={'alpha':[0.1,1.0]})-check_dtype_object]" | ||
| - "sklearn.tests.test_common::test_search_cv[GridSearchCV(cv=2,estimator=Ridge(),param_grid={'alpha':[0.1,1.0]})-check_estimators_nan_inf]" | ||
| - "sklearn.tests.test_common::test_search_cv[GridSearchCV(cv=2,estimator=Ridge(),param_grid={'alpha':[0.1,1.0]})-check_fit1d]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though we're now more compatible with sklearn, we only see new xfails instead of any new passing tests. The new failures are due the following:
- Several
test_search_cvtests now are accelerated that previously would fallback since the input data was wider than long. These tests just check for nice errors if nan/inf in the data, we'll fix them uniformly across all estimators when we revamp data ingestion. Nothing to do here now. - Some
test_solver_consistencytests that check the different solvers result in similar enough results. We're very close, but don't quite match with the tolerances used in the tests here. - A test for
n_iter, previously we'd fallback for the fit combo used, we now run accelerated but lack the attribute - A test for array-like
alpha. We support all the cases used, but don't match their error message test since their error message has a bug in it (the incorrect values are accidentally flipped).
| # column 2D y. The sklearn 1.6+ behavior is what we implement in | ||
| # cuml.Ridge, here we adjust the shape of `coef_` after the fit to | ||
| # match the older behavior. This will also trickle down to change the | ||
| # output shape of `predict` to match the older behavior transparently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is gross, but isolated to cuml.accel so I'm ok with it.
csadorf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
|
I ran comprehensive benchmarks comparing the nightly build against this PR (commit Performance ResultsThis PR delivers significant performance improvements across the board with no regressions:
Performance Summary by Problem Size
New Capabilities BenchmarkedThis PR also enables 8 previously unsupported configurations, all with competitive performance:
Detailed Benchmark ResultsNightly Build Results (click to expand)Version: PR #7410 Results (click to expand)Version: SummaryPerformance Improvements:
Benchmark Environment
|
61c2a98 to
121033e
Compare
- Move SVD solver to pure python, simplifying maintenance and improving flexibility. - Add support for multi-target inputs - Add support for array-like `alpha` - Add support for `n_features > n_samples` in SVD solver - Add `copy_X` parameter, allowing the solver to mutate X instead of making a copy. - Fix bug where solver would accidentally mutate y and sample_weight in some cases, and add a test. - Improve testing coverage - Cleanup docstring a bit
121033e to
4d0e61d
Compare
csadorf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
|
/merge |
This PR includes several improvements to `Ridge`. I started trying to fix one issue, but they were all so coupled it's a bit of a large PR. Hopefully still understandable though. - Move SVD solver to pure python. This simplifies maintenance, and improves flexibility. The new python-based solver is as fast or faster than the old C++ version, which makes sense since it's (mostly) just a thin layer over cublas and cusolver calls. - Adds support for multi-target regression to `Ridge`. Fixes rapidsai#4412. - Adds support for `X` with more features than samples to `Ridge` with `solver="svd"`. Fixes rapidsai#7198. - Adds support for array-like `alpha`, improving our sklearn compatibility. Fixes a long standing TODO in our docs. - Adds a `copy_X` parameter, mirroring the sklearn equivalent. This lets the solver mutate X instead of making a copy, reducing memory usage. - Fixes a bug where the solver would accidentally mutate `y` and `sample_weight` in some configurations, and adds a test. - Moves `Ridge` tests out into their own file, the `test_linear_models.py` file was getting a bit untenable. - Vastly improved testing coverage and hygiene - Cleanup docstrings to match conventions It does include one breaking change, which I think we want to do based on other conversations. Previously if a user explicitly passed `solver="eig"` but the `"eig"` solver couldn't handle the inputs, it'd warn and fallback to `"svd"`. We now error in this case, alerting the user that the solver they specified isn't supported. This better matches sklearn conventions, and is something I think we'll want to apply to the `LinearRegression` implementation as well (see rapidsai#7355 (comment)). With the default of `solver="auto"` we'll still fallback to `"svd"` in those cases, we only fail if the user _explicitly requested_ a solver that doesn't support the input types. Fixes rapidsai#4412. Fixes rapidsai#7198. Followup to rapidsai#7330. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#7410
This PR includes several improvements to
Ridge. I started trying to fix one issue, but they were all so coupled it's a bit of a large PR. Hopefully still understandable though.Ridge. Fixes [FEA] Support multi-target for cuML ridge regression #4412.Xwith more features than samples toRidgewithsolver="svd". Fixes [BUG]Ridge(solver='svd')fails on wide matrices(n_features > n_samples)withCUSOLVER_STATUS_INVALID_VALUEinstead ofCPUfallback #7198.alpha, improving our sklearn compatibility. Fixes a long standing TODO in our docs.copy_Xparameter, mirroring the sklearn equivalent. This lets the solver mutate X instead of making a copy, reducing memory usage.yandsample_weightin some configurations, and adds a test.Ridgetests out into their own file, thetest_linear_models.pyfile was getting a bit untenable.It does include one breaking change, which I think we want to do based on other conversations. Previously if a user explicitly passed
solver="eig"but the"eig"solver couldn't handle the inputs, it'd warn and fallback to"svd". We now error in this case, alerting the user that the solver they specified isn't supported. This better matches sklearn conventions, and is something I think we'll want to apply to theLinearRegressionimplementation as well (see #7355 (comment)). With the default ofsolver="auto"we'll still fallback to"svd"in those cases, we only fail if the user explicitly requested a solver that doesn't support the input types.Fixes #4412. Fixes #7198. Followup to #7330.