-
Notifications
You must be signed in to change notification settings - Fork 43
New API: PathGraph dataclass #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kevinyamauchi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction this is going!
| - [Nt], [Np], Nr, Nc: the dimensions of the source image | ||
| """ | ||
| node_coordinates: np.ndarray | None | ||
| node_values: np.ndarray | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is node_values used for? Are these annotations computed from an accompanying image? If so, I don't think this should be on the core graph model because the data structure one wants to use for this will be very application specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is one of the things that skan computes statistics on — the values along the path. If you don't want them, you can just leave it as None, but if you add them, skan can quickly compute statistics on them, such as the mean intensity along a path. (Useful for filtering branches.) Definitely open to discussion but I think that they are generic enough to keep on the graph. Since you mention application-specific data structures, one possible enhancement would be for this to have shape (n_nodes, n_props) and you would be able to compute stats on many attributes (branch image intensity, branch thickness, branch hessian eigenvalues...) simultaneously. Alternatively, it could be a tuple of arrays, which could have arbitrary shapes, as long as the leading dimension had shape (n_nodes,).
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think this is a sensible approach.
Intuitively, I might have gone for two separate objects (e.g., PathGraph, and PathData) and then functions that take a PathGraph and PathData object to do the computations you mentioned. You could still have a convenience class that has both a PathGraph and PathData object (via composition not inheritance). I think this approach has a couple of advantages:
- you keep the validation and serialization/deserialization logic with the data and it doesn't have to be repeated across the various functions that consume the data.
- If you have multiple types of PathData (e.g., one array vs. tuple of arrays), you can dispatch the analysis functions based on the class rather than inspecting the shapes, etc. I find this to be easier to read/understand. The functions also don't have to do as much checking, etc. at runtime because the input is already validated
- As a developer depending on a library, I personally find it easier to grok the framework when the data being passed around is in a class that has been validated rather than plain arrays (e.g., you can read the docstring about it in your IDE and you don't have to do any checks because it is already validated)
- You can still "hide" these from users who don't want to know the details with convenience classes and convenience constructor/export methods
Of course this comes with the trade-off that you now have extra custom dataclasses. I think some would argue this hurts interoperability with other libraries. In most cases users will want to do a chain of operations in library A and then pass the result off to library B. I think this is largely addressed by having convience constructors and export methods. For me, this is a reasonable trade-off for the advantages I listed above, but I think the other way is also a reasonable approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely do worry about too many levels of indirection here. For example, one could say that the coordinates themselves are also optional, really. So we can do:
CompleteSkeleton
- AugmentedSpatialPathGraph
- SpatialPathGraph
- PathGraph
- Coordinates
- PathData
- SpatialPathGraph
- Images
- SkeletonImage
- SourceImage
I think I would rather just have optional things, especially when it comes to the graph, which could/should have:
- nodes
- edges
- (node properties)
- (edge properties)
- ((node coordinates))
The coordinates indeed feel even more optional for a "graph", and define rather a "spatial" graph. Also, things like spatial metadata (scale, offset/translate, pixel meaning) perhaps belong in another level of abstraction. But again I'm super worried about the depth of the tree here. ("Flat is better than nested.")
Here's another alternative, only two levels:
SpatialPathGraph
- pathgraph:
- adjacency (defines nodes implicitly, edges/scalar edge property explicitly)
- node properties
- spatial info
- node coordinates
- edge arcs? (this could be where
pathslives) - coordinate transformations (similar/identical to ngff)
It's late here, but I kinda like it. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fwiw (I would say it's not worth much), my gut feeling leans towards the last presented option, SpatialPathGraph as I feel like it's a natural differentiation between the abstract graph properties and the spatial information that is strictly speaking optional (but in reality fundamental) to most of what people would want to do with a skeleton.
I'm semi-surprised not to see edge properties also on the pathgraph, when node properties are as I would put those on the same "level" conceptually, but could be convinced otherwise.
My feelings on this are not strong, as our use-case for skan is pretty limited at the moment, so I'm more than happy for this opinion to be discounted!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think SpatialPathGraph is the best option so far!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinyamauchi fyi:
I started playing around with the ideas above and I was having a bit of trouble expressing the NGFF model in dataclasses*. Ultimately I ran out of time cos @DragaDoncila needs this for the Janelia trackathon, so I'm going to merge this as-is, which already allows a lot of what you wanted to do, and then we can make a new API or build upon this one while deprecating some attributes — I tried to make it as small to update as I could should we go with SpatialGraph in the future.
*: I think I need a discriminated union for NGFF transforms and I'm not sure whether it's enough to just do a union of several dataclasses or whether I need to be more clever. It's probably simple but I just ran out of time to play with it, and it's more of a priority to express the things expressed in this PR.
| node_coordinates: np.ndarray | None | ||
| node_values: np.ndarray | None | ||
| graph: scipy.sparse.csr_array # pixel/node-id neighbor graph | ||
| paths: scipy.sparse.csr_array # paths[i, j] = 1 iff coord j is in path i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to having the paths property as I know that it can take some time to generate so it would be nice to be able to serialize it. The downside is that it is possible for the graph and paths to get out of sync. For big graphs, this is probably worth the cost. I suppose these could also be properties without setters that only get mutated by class methods that make sure they are updated in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to having the paths property as I know that it can take some time to generate so it would be nice to be able to serialize it.
Do you mean as an attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside is that it is possible for the graph and paths to get out of sync.
I think an important idea behind this refactor is that these attributes will be ~read-only after construction. So I'm not super concerned about this. We should just keep any in-place modifications well contained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to having the paths property as I know that it can take some time to generate so it would be nice to be able to serialize it.
Do you mean as an attribute?
Yes, sorry.
I think an important idea behind this refactor is that these attributes will be ~read-only after construction. So I'm not super concerned about this. We should just keep any in-place modifications well contained.
Will you do some validation on input to make sure they are compatible or should that rest on the user? I think both are fine - just trying to understand your mental model for what the dataclass should/shouldn't do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you do some validation on input to make sure they are compatible or should that rest on the user? I think both are fine - just trying to understand your mental model for what the dataclass should/shouldn't do.
It's kind of you to assume I have a mental model. 😂
Basically when you say "validation" I think "ok let's go full pydantic". 😂 I also think that validation would be as expensive as a rebuild. So, mostly, I'm thinking, user beware. Ser/deser should be simple and not mess with the data, and any messing that happens beyond those bounds is equivalent to corrupting data. Maybe?
|
The high level of this looks really great - haven't looked at the low level but would love to see the API evolve in this direction :-) |
The Skeleton class is a bit of a hodgepodge of data, generated data,
things that should be properties, manually cached properties, and
functions that should not be part of the class at all. Thanks to the
helpful prodding of @kevinyamauchi, this PR attempts to simplify the
concepts and data structures in skan into a simple class that can be
created without an image — since the path graph, not the image, is the
core of the computational abilities of skan.
This is still a work in progress but should already be useful: if you
have a graph as a
scipy.sparse.csr_array(note:csr_array, not thedeprecated
csr_matrixused by skan so far) generated through your ownmeans, you can make a "Skeleton" as follows:
Still to do in this PR:
summarizeto take in PathGraphs directlyIn the future, we might want to deprecate Skeleton altogether, but I'm
happy to do this over a long time.
@kevinyamauchi, I'm curious what you think about
pathsbeing a dataattribute, even though it is generated. I think it's expensive enough to
compute that we want it to be data and serializable. But it kinda breaks
the dataclass paradigm a little bit.
@DragaDoncila, you should be able to pull down this branch and use it on
your networkx tracking graphs after exporting them to csr arrays (I'm
pretty sure that's built-in to nx).