Materialization and Versioning

The MICrONS Dataset is public and open access, ready for analysis. But: manual edits to the segmentation continue to improve data quality.

When beginning your analysis in the MICrONS dataset, it is important to understand:

  1. Why the data changes
  2. What types of data change with time
  3. How to set the version for your analysis
  4. How to cross-reference data across time

The data is regularly versioned; that is, a long-term copy of the dataset is made available for users. We highly recommend setting the version or timestamp in your analysis for future consistency.

However, even if you do not set the version, there is a lineage graph of changes to the dataset. Meaning, you can find the past version of your cell, annotation table, mesh, skeleton etc. as long as you know the root id of the object you are interested in–or the date at which you performed an analysis.

Why the data changes

The automatic segmentation from EM imagery to 3D reconstruction was largely effective, and the only way to process data at this scale (The MICrONS Consortium et al. 2025). However, due to imaging defects and the nature of thin, branching axons, the automated methods do make mistakes that have large impacts on the biological accuracy of the reconstructions.

Manual Proofreading, or the correction of segmentation and adding of annotations, is an ongoing effort.

Different aspects of the data require different level of manual intervention. For example, the segmentation methods produced highly accurate dendritic arbors before proofreading, enabling morphological identification of broad cell types. Most dendritic spines are properly associated with their dendritic trunk. Recovery of larger-caliber axons, those of inhibitory neurons, and the initial portions of excitatory neurons was also typically successful. Owing to the high frequency of imaging defects in the shallower and deeper portions of the dataset, processes near the pia and white matter often contain errors. Many non-neuronal objects are also well-segmented, including astrocytes, microglia and blood vessels. The two subvolumes of the dataset were segmented separately, but the alignment between the two is sufficient for manually tracing between them.

Changes to the dataset represent an improvement in accuracy, and reflect an investment in the long-term usefulness of this open-access resource.

What types of data change with time

Proofreading edits to the segmentation change what supervoxels (groups of locally aggregated voxels) are associated with what segmented object. Any time the supervoxel is associated with a different segmented object, all of the ids upstream of that supervoxel will update. In practice, this means the 18-digit segmentation id or pt_root_id of your neuron or microglia or axon etc. will change every time it is proofread.

a, Automated segmentation overlaid on EM data. Each color represents an individual putative cell. b, Different colors represent supervoxels that make up putative cells. c, Supervoxels belonging to a particular neuron, with an overlaid cartoon of its supervoxel graph. These data corresponds to the framed square in a and the full panel in b. d, One-dimensional representation of the supervoxel graph. The ChunkedGraph data structure adds an octree structure to the graph to store the connected component information. Each abstract node (black nodes in levels >1) represents the connected component in the spatially underlying graph.

Figure from (Dorkenwald et al. 2025)

The pt_root_id is always associated with the same collection of supervoxels, and therefore the same mesh and same skeleton. But if that pt_root_id is expired, then you may not find that object in current Annotation Tables, Synapse Connectivity Tables, and Neuroglancer views of the current version of the dataset (default).

Creating a new pt_root_id for an edited object is the only way to have the flexibility of both merging two or more segments that should be connected (for example: extending an axon) and splitting an object into two, as in the following example:

f, To submit a split operation, users place labels for each side of the split (top right). The backend system first connects each set of labels on each side by identifying supervoxels between them in the graph (left). The extended sets are used to identify the edges needed to be cut with a maximum-flow minimum-cut algorithm.

Figure from (Dorkenwald et al. 2025)

But this also means we can track the histories of what ids used to be part of which segmented objects, which helps for finding the same cell, axon, or arbitrary segment across time. See Lineage Graphs below for details.

How to set the version of your analysis

Most programmatic access to the CAVE services occurs through CAVEclient, a Python client to access various types of data from the online services.

Full documentation for CAVEclient is available here.

Initial Setup

Before using any programmatic access to the data, you first need to set up your CAVEclient token.

To initialize a caveclient, we give it a datastack, which is a name that defines a particular combination of imagery, segmentation, and annotation database. For the MICrONs public data, we use the datastack name minnie65_public.

from caveclient import CAVEclient
from datetime import datetime

# initialize cave client
client = CAVEclient('minnie65_public')

# see the available materialization versions
client.materialize.get_versions()
[1300, 1078, 117, 661, 343, 1181, 795, 943]

And these are their associated timestamps (all timestamps are in UTC):

for version in client.materialize.get_versions():
    print(f"Version {version}: {client.materialize.get_timestamp(version)}")
Version 1300: 2025-01-13 10:10:01.286229+00:00
Version 1078: 2024-06-05 10:10:01.203215+00:00
Version 117: 2021-06-11 08:10:00.215114+00:00
Version 661: 2023-04-06 20:17:09.199182+00:00
Version 343: 2022-02-24 08:10:00.184668+00:00
Version 1181: 2024-09-16 10:10:01.121167+00:00
Version 795: 2023-08-23 08:10:01.404268+00:00
Version 943: 2024-01-22 08:10:01.497934+00:00

You can set the overall materialization version for the dataset using client.version. This will ensure all of the subsequent CAVE queries are performed at the same materialization, so you will get consistency between, for example, a cell type query and a synapse query.

# set materialization version, for consistency
client.version = 1300 # current public as of 1/13/2025

However, you can also set individual queries to a different version with optional argument materialization_version. For more about table queries, see CAVE Query Cell Types.

nuc_v661 = client.materialize.tables.nucleus_detection_v0().query(materialization_version=661)

nuc_v661.sample(3)
id created superceded_id valid volume pt_supervoxel_id pt_root_id pt_position bb_start_position bb_end_position
125443 398306 2020-09-28 22:44:49.542064+00:00 NaN t 286.630216 97758472447767564 864691136965779150 [239760, 182240, 17320] [nan, nan, nan] [nan, nan, nan]
128287 113295 2020-09-28 22:44:55.201266+00:00 NaN t 296.171274 79246820408581754 864691135213765688 [105024, 147216, 21941] [nan, nan, nan] [nan, nan, nan]
72688 292606 2020-09-28 22:43:04.266731+00:00 NaN t 288.664289 91628420580581006 864691135545198632 [195152, 122896, 19950] [nan, nan, nan] [nan, nan, nan]

You can even do the same thing with an arbitrary timestamp, using optional argument timestamp. However, due to how the ChunkedGraph operates, this will be more time-intensive than looking up a specific materialized version.

%%time
nuc_v661 = client.materialize.tables.nucleus_detection_v0().query(timestamp=datetime(2023,4,6,20))

nuc_v661.sample(3)
CPU times: total: 672 ms
Wall time: 56.6 s
id valid volume pt_supervoxel_id pt_root_id pt_position bb_start_position bb_end_position
98260 565400 t 308.334428 109727549667948559 864691135195335722 [326704, 229712, 15842] [nan, nan, nan] [nan, nan, nan]
138720 423107 t 163.617997 100567105912819520 864691136594656674 [260336, 136800, 15156] [nan, nan, nan] [nan, nan, nan]
124152 633845 t 50.923274 0 0 [366304, 92704, 16738] [nan, nan, nan] [nan, nan, nan]

How to cross-reference across time

Lineage Graphs

CAVEclient combines materialized snapshots with ChunkedGraph-based tracking of neuron edit histories to facilitate analysis queries for arbitrary time points. The ChunkedGraph tracks the edit lineage of neurons as they are being proofread, allowing us to map any segment used in a query to the closest available snapshot time point. This produces an overinclusive set of segments with which we query the snapshot database.

a, Edits change the assignment of synapses to segment IDs. Each of the four synapses is assigned to the segment IDs (colors) according to the presynaptic and postsynaptic points (point, bar). The identity of the segments changes through proofreading (time passed: ΔT) indicated by different colors. The lineage graph shows the current segment ID (color) for each point in time.

Figure from (Dorkenwald et al. 2025)

When we query the ‘live’ database for all changes to annotations since the used materialization snapshot and add them to the set of annotations. The resulting set of annotations is then mapped back to the query timestamp using the lineage graph and supervoxel to root lookups and finally reduced to only include the queried set of root IDs.

b, Analysis queries are not necessarily aligned to exported snapshots. Queries for other time points are supported by on-the-fly delta updates from both the annotations and segmentation through the use of the lineage graph.

Figure from (Dorkenwald et al. 2025)

Example querying the ChunkedGraph history for a root id

Initial Setup

Before using any programmatic access to the data, you first need to set up your CAVEclient token.

from caveclient import CAVEclient
from datetime import datetime

# initialize cave client
client = CAVEclient('minnie65_public')

# return the current timestamp
client.materialize.get_timestamp()
datetime.datetime(2025, 1, 13, 10, 10, 1, 286229, tzinfo=datetime.timezone.utc)

Most commonly, what you will want is to look-up the current root id for a pt_root_id in a previous analysis. This is not always a trivial thing to do, for example in the case of a multi-soma object that has been manually split. Which of the two new cells was your original cell of interest?

The ChunkedGraph will make its best guess, given supervoxel overlap, with the function suggest_latest_roots()

example_id = 864691135919440816

# Access the ChunkedGraph service of caveclient
client.chunkedgraph.suggest_latest_roots(example_id, timestamp = client.materialize.get_timestamp())
np.int64(864691135970572133)

Now we have updated pt_root_id for our cell, at the current materialized version.

If you want to run this for a large number of root ids, you can first check if the pt_root_ids are current to your CAVEclient materialization version using is_latest_roots(), and then only update the ids that have expired:

# Check if roots are current
print(client.chunkedgraph.is_latest_roots(example_id))

# See when the id was generated (when the segment was last edited)
client.chunkedgraph.get_root_timestamps(example_id)
[False]
array([datetime.datetime(2023, 2, 1, 8, 47, 29, 891000, tzinfo=<UTC>)],
      dtype=object)

Using the timestamp argument, you can also lookup the suggested root at any arbitrary time. Here we use the timestamp for a different materialization, verion 943:

client.chunkedgraph.suggest_latest_roots(example_id, 
                                         timestamp=client.materialize.get_version_metadata(943)['time_stamp']
                                        )
np.int64(864691135808631069)

Sometimes you may want to check the lineage graph for a cell of interest, to better understand what was edited and why. You can access this and more advanced features from the get_lineage_graph(). See the ChunkedGraph documentation for more use cases.

client.chunkedgraph.get_lineage_graph(example_id)

Static annotations

The CAVEclient Materialization Engine updates segmentation data and creates databases that combine spatial annotation points and segmentation information.

The live database is written to by the Annotation service and is actively managed by the Materialization service to keep root IDs up to date for all BoundSpatialPoints in all tables. Snapshotted databases are copies of a time-locked state of the ‘live’ database’s segmentation and annotation information used to facilitate consistent querying.

This means if you use a static annotation label to index your analysis, for example a nucleus_id or a synapse_id which do not undergo proofreading, you can look up the current pt_root_id at any time by asking CAVEclient to materialize the new segmentation under the static point.

Using our example cell from above, let’s find its nucleus id. If we try to query the nucleus table with the expired id, we will return no result:

example_id = 864691135919440816

client.materialize.tables.nucleus_detection_v0(pt_root_id=example_id).query()
id created superceded_id valid volume pt_supervoxel_id pt_root_id pt_position bb_start_position bb_end_position

This is expected, since the example id is expired at the time of this materialization. Instead, let’s query the version we know this id existed at: version 661

client.materialize.tables.nucleus_detection_v0(pt_root_id=example_id).query(materialization_version=661)
id created superceded_id valid volume pt_supervoxel_id pt_root_id pt_position bb_start_position bb_end_position
0 260620 2020-09-28 22:44:52.109195+00:00 NaN t 292.154409 88605588438852384 864691135919440816 [173584, 145120, 21127] [nan, nan, nan] [nan, nan, nan]

This returns both the expected pt_root_id 864691135919440816, and the id from the nucleus_detection_v0 – better known as the nucleus_id.

Given the nucleus_id, we can now query the current materialized version for the current root id.

nuc_id = client.materialize.tables.nucleus_detection_v0(pt_root_id=example_id).query(materialization_version=661)['id']

client.materialize.tables.nucleus_detection_v0(id=nuc_id).query()
id created superceded_id valid volume pt_supervoxel_id pt_root_id pt_position bb_start_position bb_end_position
0 260620 2020-09-28 22:44:52.109195+00:00 NaN t 292.154409 88605588438852384 864691135970572133 [173584, 145120, 21127] [nan, nan, nan] [nan, nan, nan]

This returns the same pt_root_id as the lineage graph example above. But, it has the benefit of guaranteeing the id belongs to your cell of interest and not an arbitrary chunk of the previous segmented object.

If you are working with a neuron, glial cell, or any cell that has a nucleus detection, we recommend using the nucleus_id as your identifier rather than the pt_root_id.

If you do use pt_root_id, be sure to note the dataset materialization version in your analysis.

Back to top

References

Dorkenwald, Sven, Casey M. Schneider-Mizell, Derrick Brittain, Akhilesh Halageri, Chris Jordan, Nico Kemnitz, Manual A. Castro, et al. 2025. CAVE: Connectome Annotation Versioning Engine.” Nature Methods, April. https://doi.org/10.1038/s41592-024-02426-z.
The MICrONS Consortium, J. Alexander Bae, Mahaly Baptiste, Maya R. Baptiste, Caitlyn A. Bishop, Agnes L. Bodor, Derrick Brittain, et al. 2025. “Functional Connectomics Spanning Multiple Areas of Mouse Visual Cortex.” Nature 640 (8058): 435–47. https://doi.org/10.1038/s41586-025-08790-w.