Skip to content

Conversation

@jcrist
Copy link
Member

@jcrist jcrist commented Jun 16, 2025

This is pretty much a full rewrite of the python side of our HDBSCAN wrapper. There were a few goals here:

  • Fix HDBSCAN output arrays may become invalid #6541
  • Significantly simplify and reduce the state stored on an instance of HDBSCAN to ease Port hdbscan proxies to use ProxyBase #6711
  • Isolate the different code paths for initializing state (e.g. from fit, from sklearn, unpickling, ...) to better ensure we always have a valid HDBSCAN instance (and that resources are being properly tracked and freed)
  • Remove several cases of poor cython practice (e.g. allocating something via new then passing it around via a python int before later freeing)

The changes I've made here accomplish all these goals. There are still further simplifications that could be made, but those will require some API changes to the C++ layer. For now I've punted on those, once this is in I'll open some follow-up issues to discuss further improvements.

Note that for the cuml.accel layer we're not in a fully correct state yet. We're better off than we were before (and all tests continue to pass), but for sklearn <> cuml interop to be fully correct we'll want to finish up #6711. I've left that work out of this PR to minimize the already large diff.

Fixes #6541.

@jcrist jcrist self-assigned this Jun 16, 2025
@jcrist jcrist requested review from a team as code owners June 16, 2025 19:49
@jcrist jcrist added the bug Something isn't working label Jun 16, 2025
@jcrist jcrist requested review from divyegala and teju85 June 16, 2025 19:49
@jcrist jcrist added Cython / Python Cython or Python issue non-breaking Non-breaking change labels Jun 16, 2025
@github-actions github-actions bot added the CMake label Jun 16, 2025
#

import cupy as cp
import hdbscan
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes to the tests are due to changes made to _condense_hierarchy/_extract_clusters to keep them out of the HDBSCAN class definition.

utilizing plotting tools. This requires the `hdbscan` CPU
Python package to be installed.
gen_single_linkage_tree_ : bool, optional (default=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters were never defined.


labels_ = CumlArrayDescriptor()
probabilities_ = CumlArrayDescriptor()
outlier_scores_ = CumlArrayDescriptor()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute was never defined

# Minimum Spanning Tree
mst_src_ = CumlArrayDescriptor()
mst_dst_ = CumlArrayDescriptor()
mst_weights_ = CumlArrayDescriptor()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These attributes were undocumented before, and are unneeded with the new organization. I've opted to remove them fully.

cdef lib.HDBSCANParams params
params.min_samples = self.min_samples
# params.alpha = self.alpha
params.alpha = self.alpha
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alpha was never forwarded properly before.


self.labels_test = CumlArray.empty(n_leaves, dtype="int32")
self.probabilities_test = CumlArray.empty(n_leaves, dtype="float32")
def cpu_to_gpu(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu_to_gpu/gpu_to_cpu/get_attr_names will all go away in the followup to port over to InteropMixin. I've ensured we're at least as correct as we were before, but I wouldn't fuss too much about the code quality of these methods - they'll be ripped out pretty soon.

More hacks to workaround deficiences of `UniversalBase`. Will remove all
of these in the followup.
@jcrist jcrist force-pushed the cleanup-hdbscan branch from fff503a to 9ba30fe Compare June 16, 2025 21:44
@jcrist
Copy link
Member Author

jcrist commented Jun 17, 2025

/merge

@rapids-bot rapids-bot bot merged commit 767d7fe into rapidsai:branch-25.08 Jun 17, 2025
67 checks passed
@jcrist jcrist deleted the cleanup-hdbscan branch June 17, 2025 01:11
rapids-bot bot pushed a commit that referenced this pull request Oct 8, 2025
Finishing up a few lingering tasks needed to resolve `HDBSCAN` for #7317. Most of the work for `HDBSCAN` regarding this was already done in #6913.

- Removes validation and hyperparam mutation in `__init__`
- Releases the GIL where it makes sense

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: #7319
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CMake Cython / Python Cython or Python issue non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HDBSCAN output arrays may become invalid

2 participants