Rewrite `HDBSCAN` python wrapper #6913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

rapids-bot merged 17 commits into rapidsai:branch-25.08 from jcrist:cleanup-hdbscan

Jun 17, 2025

Member

jcrist commented Jun 16, 2025

This is pretty much a full rewrite of the python side of our HDBSCAN wrapper. There were a few goals here:

Fix HDBSCAN output arrays may become invalid #6541
Significantly simplify and reduce the state stored on an instance of HDBSCAN to ease Port hdbscan proxies to use ProxyBase #6711
Isolate the different code paths for initializing state (e.g. from fit, from sklearn, unpickling, ...) to better ensure we always have a valid HDBSCAN instance (and that resources are being properly tracked and freed)
Remove several cases of poor cython practice (e.g. allocating something via new then passing it around via a python int before later freeing)

The changes I've made here accomplish all these goals. There are still further simplifications that could be made, but those will require some API changes to the C++ layer. For now I've punted on those, once this is in I'll open some follow-up issues to discuss further improvements.

Note that for the cuml.accel layer we're not in a fully correct state yet. We're better off than we were before (and all tests continue to pass), but for sklearn <> cuml interop to be fully correct we'll want to finish up #6711. I've left that work out of this PR to minimize the already large diff.

Fixes #6541.

jcrist added 12 commits

June 16, 2025 10:09


          Move headers to separate file

3a4932d

WIP

f3e9f18


          WIP2

ed37475


          More simplifications

a059aef


          Support pickling HDBSCAN models

b18ece2


          Port prediction functions

0c5a102


          A few fixups

e02a24c


          Fixups

8c8eed5


          Fixup prediction

a686b74


          Docstrings

8b553cb


          Add from_sklearn

aa2ddf0


          More docstring improvements

b80a9ba

jcrist self-assigned this

jcrist requested review from a team as code owners

June 16, 2025 19:49

jcrist added the bug label

jcrist requested review from divyegala and teju85

June 16, 2025 19:49

jcrist added Cython / Python non-breaking labels

github-actions bot added the CMake label

jcrist commented

View reviewed changes

python/cuml/cuml/cluster/hdbscan/CMakeLists.txt Show resolved Hide resolved

python/cuml/cuml/cluster/hdbscan/headers.pxd Show resolved Hide resolved

python/cuml/cuml/tests/test_hdbscan.py

    
              #

              import cupy as cp

              import hdbscan

Member Author

jcrist Jun 16, 2025

All changes to the tests are due to changes made to _condense_hierarchy/_extract_clusters to keep them out of the HDBSCAN class definition.

jcrist commented

View reviewed changes

python/cuml/cuml/cluster/hdbscan/hdbscan.pyx

    
                      utilizing plotting tools. This requires the `hdbscan` CPU

                      Python package to be installed.

                  gen_single_linkage_tree_ : bool, optional (default=False)

Member Author

jcrist Jun 16, 2025

These parameters were never defined.

python/cuml/cuml/cluster/hdbscan/hdbscan.pyx

    
                  labels_ = CumlArrayDescriptor()

                  probabilities_ = CumlArrayDescriptor()

                  outlier_scores_ = CumlArrayDescriptor()

Member Author

jcrist Jun 16, 2025

This attribute was never defined

python/cuml/cuml/cluster/hdbscan/hdbscan.pyx

    
                  # Minimum Spanning Tree

                  mst_src_ = CumlArrayDescriptor()

                  mst_dst_ = CumlArrayDescriptor()

                  mst_weights_ = CumlArrayDescriptor()

Member Author

jcrist Jun 16, 2025

These attributes were undocumented before, and are unneeded with the new organization. I've opted to remove them fully.

python/cuml/cuml/cluster/hdbscan/hdbscan.pyx

    
                      cdef lib.HDBSCANParams params

                      params.min_samples = self.min_samples

                      # params.alpha = self.alpha

                      params.alpha = self.alpha

Member Author

jcrist Jun 16, 2025

alpha was never forwarded properly before.

python/cuml/cuml/cluster/hdbscan/hdbscan.pyx

    
                      self.labels_test = CumlArray.empty(n_leaves, dtype="int32")

                      self.probabilities_test = CumlArray.empty(n_leaves, dtype="float32")

                  def cpu_to_gpu(self):

Member Author

jcrist Jun 16, 2025

cpu_to_gpu/gpu_to_cpu/get_attr_names will all go away in the followup to port over to InteropMixin. I've ensured we're at least as correct as we were before, but I wouldn't fuss too much about the code quality of these methods - they'll be ripped out pretty soon.


          Fixup cpu_to_gpu

More hacks to workaround deficiences of `UniversalBase`. Will remove all
of these in the followup.

divyegala reviewed

View reviewed changes

python/cuml/cuml/cluster/hdbscan/CMakeLists.txt Show resolved Hide resolved

python/cuml/cuml/cluster/hdbscan/hdbscan.pyx Show resolved Hide resolved

python/cuml/cuml/cluster/hdbscan/hdbscan.pyx Outdated Show resolved Hide resolved

python/cuml/cuml/cluster/hdbscan/headers.pxd Show resolved Hide resolved

divyegala reviewed

View reviewed changes

python/cuml/cuml/cluster/hdbscan/hdbscan.pyx Show resolved Hide resolved

jcrist added 2 commits

June 16, 2025 14:31


          Add comments on _HDBSCANState attributes

09b11f4


          Deprecate cuml.cluster.hdbscan.prediction namespace

9ba30fe

jcrist force-pushed the cleanup-hdbscan branch from fff503a to 9ba30fe Compare

June 16, 2025 21:44


          Avoid unnecessary copy in init from pickle/sklearn

537bbd0


          Fixup xfail list

c558015

divyegala approved these changes

View reviewed changes

Member Author

jcrist commented Jun 17, 2025

/merge

rapids-bot bot merged commit 767d7fe into rapidsai:branch-25.08

67 checks passed

jcrist deleted the cleanup-hdbscan branch

June 17, 2025 01:11

This was referenced Jul 1, 2025

[BUG] HDBSCAN condensed tree and single linkage raise errors #5720

Closed

[BUG] Can't reconstruct serialized HDBSCAN model in cuml-cpu conda environment #5305

Closed

This was referenced Oct 8, 2025

A few HDBSCAN cleanups #7319

Merged

[TRACKER] Cleanup python estimator implementations #7317

Open

rapids-bot bot pushed a commit that referenced this pull request


          A few HDBSCAN cleanups (#7319)

57bd81a

Finishing up a few lingering tasks needed to resolve `HDBSCAN` for #7317. Most of the work for `HDBSCAN` regarding this was already done in #6913.

- Removes validation and hyperparam mutation in `__init__`
- Releases the GIL where it makes sense

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: #7319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug CMake Cython / Python non-breaking