-
Notifications
You must be signed in to change notification settings - Fork 611
Rewrite HDBSCAN python wrapper
#6913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # | ||
|
|
||
| import cupy as cp | ||
| import hdbscan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes to the tests are due to changes made to _condense_hierarchy/_extract_clusters to keep them out of the HDBSCAN class definition.
| utilizing plotting tools. This requires the `hdbscan` CPU | ||
| Python package to be installed. | ||
| gen_single_linkage_tree_ : bool, optional (default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These parameters were never defined.
|
|
||
| labels_ = CumlArrayDescriptor() | ||
| probabilities_ = CumlArrayDescriptor() | ||
| outlier_scores_ = CumlArrayDescriptor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This attribute was never defined
| # Minimum Spanning Tree | ||
| mst_src_ = CumlArrayDescriptor() | ||
| mst_dst_ = CumlArrayDescriptor() | ||
| mst_weights_ = CumlArrayDescriptor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These attributes were undocumented before, and are unneeded with the new organization. I've opted to remove them fully.
| cdef lib.HDBSCANParams params | ||
| params.min_samples = self.min_samples | ||
| # params.alpha = self.alpha | ||
| params.alpha = self.alpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alpha was never forwarded properly before.
|
|
||
| self.labels_test = CumlArray.empty(n_leaves, dtype="int32") | ||
| self.probabilities_test = CumlArray.empty(n_leaves, dtype="float32") | ||
| def cpu_to_gpu(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu_to_gpu/gpu_to_cpu/get_attr_names will all go away in the followup to port over to InteropMixin. I've ensured we're at least as correct as we were before, but I wouldn't fuss too much about the code quality of these methods - they'll be ripped out pretty soon.
More hacks to workaround deficiences of `UniversalBase`. Will remove all of these in the followup.
|
/merge |
Finishing up a few lingering tasks needed to resolve `HDBSCAN` for #7317. Most of the work for `HDBSCAN` regarding this was already done in #6913. - Removes validation and hyperparam mutation in `__init__` - Releases the GIL where it makes sense Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Divye Gala (https://github.com/divyegala) URL: #7319
This is pretty much a full rewrite of the python side of our
HDBSCANwrapper. There were a few goals here:HDBSCANoutput arrays may become invalid #6541HDBSCANto ease Porthdbscanproxies to useProxyBase#6711HDBSCANinstance (and that resources are being properly tracked and freed)newthen passing it around via a pythonintbefore later freeing)The changes I've made here accomplish all these goals. There are still further simplifications that could be made, but those will require some API changes to the C++ layer. For now I've punted on those, once this is in I'll open some follow-up issues to discuss further improvements.
Note that for the
cuml.accellayer we're not in a fully correct state yet. We're better off than we were before (and all tests continue to pass), but forsklearn <> cumlinterop to be fully correct we'll want to finish up #6711. I've left that work out of this PR to minimize the already large diff.Fixes #6541.