New API: PathGraph dataclass #246

jni · 2025-06-02T03:25:37Z

The Skeleton class is a bit of a hodgepodge of data, generated data,
things that should be properties, manually cached properties, and
functions that should not be part of the class at all. Thanks to the
helpful prodding of @kevinyamauchi, this PR attempts to simplify the
concepts and data structures in skan into a simple class that can be
created without an image — since the path graph, not the image, is the
core of the computational abilities of skan.

This is still a work in progress but should already be useful: if you
have a graph as a scipy.sparse.csr_array (note: csr_array, not the
deprecated csr_matrix used by skan so far) generated through your own
means, you can make a "Skeleton" as follows:

from skan.csr import PathGraph, Skeleton, summarize

g = PathGraph.from_graph(
        node_coordinates=coordinates,  # (n, ndim) NumPy array
        graph=graph,  # scipy.sparse.csr_array
        )

s = Skeleton.from_path_graph(g)
summary = summarize(s, separator='_')

Still to do in this PR:

allow making PathGraphs from Skeletons
allow summarize to take in PathGraphs directly

In the future, we might want to deprecate Skeleton altogether, but I'm
happy to do this over a long time.

@kevinyamauchi, I'm curious what you think about paths being a data
attribute, even though it is generated. I think it's expensive enough to
compute that we want it to be data and serializable. But it kinda breaks
the dataclass paradigm a little bit.

@DragaDoncila, you should be able to pull down this branch and use it on
your networkx tracking graphs after exporting them to csr arrays (I'm
pretty sure that's built-in to nx).

kevinyamauchi

I like the direction this is going!

kevinyamauchi · 2025-06-02T07:06:30Z

src/skan/csr.py

+    - [Nt], [Np], Nr, Nc: the dimensions of the source image
+    """
+    node_coordinates: np.ndarray | None
+    node_values: np.ndarray | None


What is node_values used for? Are these annotations computed from an accompanying image? If so, I don't think this should be on the core graph model because the data structure one wants to use for this will be very application specific.

Well, this is one of the things that skan computes statistics on — the values along the path. If you don't want them, you can just leave it as None, but if you add them, skan can quickly compute statistics on them, such as the mean intensity along a path. (Useful for filtering branches.) Definitely open to discussion but I think that they are generic enough to keep on the graph. Since you mention application-specific data structures, one possible enhancement would be for this to have shape (n_nodes, n_props) and you would be able to compute stats on many attributes (branch image intensity, branch thickness, branch hessian eigenvalues...) simultaneously. Alternatively, it could be a tuple of arrays, which could have arbitrary shapes, as long as the leading dimension had shape (n_nodes,).

Thoughts?

Got it. I think this is a sensible approach.

Intuitively, I might have gone for two separate objects (e.g., PathGraph, and PathData) and then functions that take a PathGraph and PathData object to do the computations you mentioned. You could still have a convenience class that has both a PathGraph and PathData object (via composition not inheritance). I think this approach has a couple of advantages:

you keep the validation and serialization/deserialization logic with the data and it doesn't have to be repeated across the various functions that consume the data.

If you have multiple types of PathData (e.g., one array vs. tuple of arrays), you can dispatch the analysis functions based on the class rather than inspecting the shapes, etc. I find this to be easier to read/understand. The functions also don't have to do as much checking, etc. at runtime because the input is already validated

As a developer depending on a library, I personally find it easier to grok the framework when the data being passed around is in a class that has been validated rather than plain arrays (e.g., you can read the docstring about it in your IDE and you don't have to do any checks because it is already validated)

You can still "hide" these from users who don't want to know the details with convenience classes and convenience constructor/export methods

Of course this comes with the trade-off that you now have extra custom dataclasses. I think some would argue this hurts interoperability with other libraries. In most cases users will want to do a chain of operations in library A and then pass the result off to library B. I think this is largely addressed by having convience constructors and export methods. For me, this is a reasonable trade-off for the advantages I listed above, but I think the other way is also a reasonable approach.

I definitely do worry about too many levels of indirection here. For example, one could say that the coordinates themselves are also optional, really. So we can do:

CompleteSkeleton

AugmentedSpatialPathGraph

SpatialPathGraph

PathGraph

Coordinates

PathData

Images

SkeletonImage

SourceImage

I think I would rather just have optional things, especially when it comes to the graph, which could/should have:

nodes

edges

(node properties)

(edge properties)

((node coordinates))

The coordinates indeed feel even more optional for a "graph", and define rather a "spatial" graph. Also, things like spatial metadata (scale, offset/translate, pixel meaning) perhaps belong in another level of abstraction. But again I'm super worried about the depth of the tree here. ("Flat is better than nested.")

Here's another alternative, only two levels:

SpatialPathGraph

pathgraph:

adjacency (defines nodes implicitly, edges/scalar edge property explicitly)

node properties

spatial info

node coordinates

edge arcs? (this could be where paths lives)

coordinate transformations (similar/identical to ngff)

It's late here, but I kinda like it. Thoughts?

Fwiw (I would say it's not worth much), my gut feeling leans towards the last presented option, SpatialPathGraph as I feel like it's a natural differentiation between the abstract graph properties and the spatial information that is strictly speaking optional (but in reality fundamental) to most of what people would want to do with a skeleton.

I'm semi-surprised not to see edge properties also on the pathgraph, when node properties are as I would put those on the same "level" conceptually, but could be convinced otherwise.

My feelings on this are not strong, as our use-case for skan is pretty limited at the moment, so I'm more than happy for this opinion to be discounted!

I agree, I think SpatialPathGraph is the best option so far!

@kevinyamauchi fyi:

I started playing around with the ideas above and I was having a bit of trouble expressing the NGFF model in dataclasses*. Ultimately I ran out of time cos @DragaDoncila needs this for the Janelia trackathon, so I'm going to merge this as-is, which already allows a lot of what you wanted to do, and then we can make a new API or build upon this one while deprecating some attributes — I tried to make it as small to update as I could should we go with SpatialGraph in the future.

*: I think I need a discriminated union for NGFF transforms and I'm not sure whether it's enough to just do a union of several dataclasses or whether I need to be more clever. It's probably simple but I just ran out of time to play with it, and it's more of a priority to express the things expressed in this PR.

kevinyamauchi · 2025-06-02T07:22:49Z

src/skan/csr.py

+    node_coordinates: np.ndarray | None
+    node_values: np.ndarray | None
+    graph: scipy.sparse.csr_array  # pixel/node-id neighbor graph
+    paths: scipy.sparse.csr_array  # paths[i, j] = 1 iff coord j is in path i


I am open to having the paths property as I know that it can take some time to generate so it would be nice to be able to serialize it. The downside is that it is possible for the graph and paths to get out of sync. For big graphs, this is probably worth the cost. I suppose these could also be properties without setters that only get mutated by class methods that make sure they are updated in sync.

I am open to having the paths property as I know that it can take some time to generate so it would be nice to be able to serialize it.

Do you mean as an attribute?

The downside is that it is possible for the graph and paths to get out of sync.

I think an important idea behind this refactor is that these attributes will be ~read-only after construction. So I'm not super concerned about this. We should just keep any in-place modifications well contained.

I am open to having the paths property as I know that it can take some time to generate so it would be nice to be able to serialize it.

Do you mean as an attribute?

Yes, sorry.

I think an important idea behind this refactor is that these attributes will be ~read-only after construction. So I'm not super concerned about this. We should just keep any in-place modifications well contained.

Will you do some validation on input to make sure they are compatible or should that rest on the user? I think both are fine - just trying to understand your mental model for what the dataclass should/shouldn't do.

Will you do some validation on input to make sure they are compatible or should that rest on the user? I think both are fine - just trying to understand your mental model for what the dataclass should/shouldn't do.

It's kind of you to assume I have a mental model. 😂

Basically when you say "validation" I think "ok let's go full pydantic". 😂 I also think that validation would be as expensive as a rebuild. So, mostly, I'm thinking, user beware. Ser/deser should be simple and not mess with the data, and any messing that happens beyond those bounds is equivalent to corrupting data. Maybe?

alisterburt · 2025-06-07T16:28:15Z

The high level of this looks really great - haven't looked at the low level but would love to see the API evolve in this direction :-)

jni added 3 commits June 2, 2025 13:14

Initial work on a PathGraph dataclass

8fbecdc

Ensure index arrays are the right dtype for NBGraph

3670df5

Fix bool check

c38d7de

jni changed the title ~~New API: PathGraph dataclcass~~ New API: PathGraph dataclass Jun 2, 2025

kevinyamauchi reviewed Jun 2, 2025

View reviewed changes

jni added 5 commits July 17, 2025 16:52

Update summarize to allow pathgraph input

f783776

Ensure graph index data is correct dtype for numba

22579dd

Rename graph.graph to graph.adj

829be6f

Add tests for pathgraph

4af902e

Add docstrings to PathGraph

751cfa2

jni mentioned this pull request Jul 17, 2025

Use skan for computing CCA live-image-tracking-tools/traccuracy#251

Merged

jni marked this pull request as ready for review July 17, 2025 17:59

Add 0.13.0 release notes

e20facc

jni merged commit 153fe4c into main Jul 17, 2025
7 checks passed

jni mentioned this pull request Oct 1, 2025

Release Notes for 0.13 #252

Open

New API: PathGraph dataclass #246

New API: PathGraph dataclass #246

Uh oh!

Conversation

jni commented Jun 2, 2025

Uh oh!

kevinyamauchi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alisterburt commented Jun 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants