Materialization and Versioning

Version update

This tutorial has recently been updated to materialization version 1412 (from 1300).

We have released a new public version 1412, as part of our quarterly release schedule. See details at Release Manifests: 1412.

The MICrONS Dataset is public and open access, ready for analysis. But: manual edits to the segmentation continue to improve data quality.

When beginning your analysis in the MICrONS dataset, it is important to understand:

Why the data changes
What types of data change with time
How to set the version for your analysis
How to cross-reference data across time

The data is regularly versioned; that is, a long-term copy of the dataset is made available for users. We highly recommend setting the version or timestamp in your analysis for future consistency.

However, even if you do not set the version, there is a lineage graph of changes to the dataset. Meaning, you can find the past version of your cell, annotation table, mesh, skeleton etc. as long as you know the root id of the object you are interested in–or the date at which you performed an analysis.

Why the data changes

The automatic segmentation from EM imagery to 3D reconstruction was largely effective, and the only way to process data at this scale (The MICrONS Consortium et al. 2025). However, due to imaging defects and the nature of thin, branching axons, the automated methods do make mistakes that have large impacts on the biological accuracy of the reconstructions.

Manual Proofreading, or the correction of segmentation and adding of annotations, is an ongoing effort.

Different aspects of the data require different level of manual intervention. For example, the segmentation methods produced highly accurate dendritic arbors before proofreading, enabling morphological identification of broad cell types. Most dendritic spines are properly associated with their dendritic trunk. Recovery of larger-caliber axons, those of inhibitory neurons, and the initial portions of excitatory neurons was also typically successful. Owing to the high frequency of imaging defects in the shallower and deeper portions of the dataset, processes near the pia and white matter often contain errors. Many non-neuronal objects are also well-segmented, including astrocytes, microglia and blood vessels. The two subvolumes of the dataset were segmented separately, but the alignment between the two is sufficient for manually tracing between them.

Changes to the dataset represent an improvement in accuracy, and reflect an investment in the long-term usefulness of this open-access resource.

What types of data change with time

Proofreading edits to the segmentation change what supervoxels (groups of locally aggregated voxels) are associated with what segmented object. Any time the supervoxel is associated with a different segmented object, all of the ids upstream of that supervoxel will update. In practice, this means the 18-digit segmentation id or pt_root_id of your neuron or microglia or axon etc. will change every time it is proofread.

a, Automated segmentation overlaid on EM data. Each color represents an individual putative cell. b, Different colors represent supervoxels that make up putative cells. c, Supervoxels belonging to a particular neuron, with an overlaid cartoon of its supervoxel graph. These data corresponds to the framed square in a and the full panel in b. d, One-dimensional representation of the supervoxel graph. The ChunkedGraph data structure adds an octree structure to the graph to store the connected component information. Each abstract node (black nodes in levels >1) represents the connected component in the spatially underlying graph.

Figure from (Dorkenwald et al. 2025)

The pt_root_id is always associated with the same collection of supervoxels, and therefore the same mesh and same skeleton. But if that pt_root_id is expired, then you may not find that object in current Annotation Tables, Synapse Connectivity Tables, and Neuroglancer views of the current version of the dataset (default).

Creating a new pt_root_id for an edited object is the only way to have the flexibility of both merging two or more segments that should be connected (for example: extending an axon) and splitting an object into two, as in the following example:

f, To submit a split operation, users place labels for each side of the split (top right). The backend system first connects each set of labels on each side by identifying supervoxels between them in the graph (left). The extended sets are used to identify the edges needed to be cut with a maximum-flow minimum-cut algorithm.

Figure from (Dorkenwald et al. 2025)

But this also means we can track the histories of what ids used to be part of which segmented objects, which helps for finding the same cell, axon, or arbitrary segment across time. See Lineage Graphs below for details.

How to set the version of your analysis

Most programmatic access to the CAVE services occurs through CAVEclient, a Python client to access various types of data from the online services.

Full documentation for CAVEclient is available here.

Initial Setup

Before using any programmatic access to the data, you first need to set up your CAVEclient token.

To initialize a caveclient, we give it a datastack, which is a name that defines a particular combination of imagery, segmentation, and annotation database. For the MICrONs public data, we use the datastack name minnie65_public.

from caveclient import CAVEclient
from datetime import datetime, timezone

# initialize cave client
client = CAVEclient('minnie65_public')

# see the available materialization versions
client.materialize.get_versions()

[1300, 1078, 117, 661, 343, 1181, 795, 943, 1412]

And these are their associated timestamps (all timestamps are in UTC):

for version in client.materialize.get_versions():
    print(f"Version {version}: {client.materialize.get_timestamp(version)}")

Version 1300: 2025-01-13 10:10:01.286229+00:00
Version 1078: 2024-06-05 10:10:01.203215+00:00
Version 117: 2021-06-11 08:10:00.215114+00:00
Version 661: 2023-04-06 20:17:09.199182+00:00
Version 343: 2022-02-24 08:10:00.184668+00:00
Version 1181: 2024-09-16 10:10:01.121167+00:00
Version 795: 2023-08-23 08:10:01.404268+00:00
Version 943: 2024-01-22 08:10:01.497934+00:00
Version 1412: 2025-04-29 10:10:01.200893+00:00

You can set the overall materialization version for the dataset using client.version. This will ensure all of the subsequent CAVE queries are performed at the same materialization, so you will get consistency between, for example, a cell type query and a synapse query.

# set materialization version, for consistency
client.version = 1412 # current public as of 4/29/2025

However, you can also set individual queries to a different version with optional argument materialization_version. For more about table queries, see CAVE Query Cell Types.

nuc_v661 = client.materialize.tables.nucleus_detection_v0().query(materialization_version=661)

nuc_v661.sample(3)

	id	created	superceded_id	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
27095	717226	2020-09-28 22:41:36.353256+00:00	NaN	t	340.089078	118380363788982290	864691135386337877	[389712, 210880, 24904]	[nan, nan, nan]	[nan, nan, nan]
143388	164389	2020-09-28 22:45:24.957212+00:00	NaN	t	454.651412	81435535339497985	864691135475705920	[121056, 201904, 18463]	[nan, nan, nan]	[nan, nan, nan]
34048	126135	2020-09-28 22:41:48.617493+00:00	NaN	t	158.299423	78694108986133068	864691135625910878	[101216, 223936, 16893]	[nan, nan, nan]	[nan, nan, nan]

You can even do the same thing with an arbitrary timestamp, using optional argument timestamp. However, due to how the ChunkedGraph operates, this will be more time-intensive than looking up a specific materialized version.

%%time
nuc_v661 = client.materialize.tables.nucleus_detection_v0().query(timestamp=datetime(2023,4,6,20))

nuc_v661.sample(3)

CPU times: total: 812 ms
Wall time: 59.7 s

	id	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
99694	500984	t	245.666775	103042450989740874	864691136287472323	[278528, 229328, 21380]	[nan, nan, nan]	[nan, nan, nan]
110157	563160	t	333.836780	107894390181643519	864691136974411804	[313680, 203072, 25566]	[nan, nan, nan]	[nan, nan, nan]
120527	612800	t	291.528704	113866387252299373	864691135763026486	[357008, 133440, 22939]	[nan, nan, nan]	[nan, nan, nan]

How to set the timestamp to an expired version

Materialization versions expire at regular intervals. Indeed, every version between our major long-term public releases existed at some point, but has since expired.

This does not mean the data from those versions is gone.

It does mean it takes longer to materialize data from that date, because the chunkedgraph has to calculate differences between an extant materialized version and the requested time. In order to materialize data from an expired version, you must set the optional timestamp argument in every query:

# set the timestamp of a version that may or may not exist
example_timestamp = datetime(2022, 2, 24, 8, 10, 0, 184668, tzinfo=timezone.utc)

# example table query
nuc_timestamp = client.materialize.tables.nucleus_detection_v0().query(timestamp=example_timestamp, limit=100)
nuc_timestamp.head(3)

201 - "Limited query to 100 rows

	id	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
0	11294	t	49.112842	73556435283294116	864691135269406572	[63968, 218032, 20683]	[nan, nan, nan]	[nan, nan, nan]
1	11300	t	323.577446	73626116832723681	864691135151717168	[64400, 213072, 20630]	[nan, nan, nan]	[nan, nan, nan]
2	11301	t	277.511864	0	0	[64496, 219104, 20408]	[nan, nan, nan]	[nan, nan, nan]

# example synapse query
syn_timestamp = client.materialize.synapse_query(post_ids=nuc_timestamp.pt_root_id, timestamp=example_timestamp)
syn_timestamp.head(3)

	id	valid	pre_pt_supervoxel_id	pre_pt_root_id	post_pt_supervoxel_id	size	pre_pt_position	post_pt_position	ctr_pt_position
0	507533913	t	120060554860852929	864691132931221450	120060554860853553	420	[402320, 146520, 23899]	[402354, 146594, 23900]	[402350, 146558, 23899]
1	512239447	t	107692559937453733	864691134764563389	107692560004163297	1300	[312342, 272446, 16895]	[312272, 272510, 16899]	[312298, 272460, 16898]
4	416272231	t	110295859874434808	864691135385181117	110295859874440074	1884	[330842, 269600, 16654]	[330878, 269532, 16662]	[330854, 269538, 16663]

Timestamps for all public release versions

Long-term releases are made available for analysis, but are not permanent. You can lookup the timestamp associated with any version here.

CAVEclient materialization version timestamps (as datetime.datetime objects)
Version	Timestamp
117	datetime(2021, 6, 11, 8, 10, 0, 215114, tzinfo=datetime.timezone.utc)
343	datetime(2022, 2, 24, 8, 10, 0, 184668, tzinfo=datetime.timezone.utc)
661	datetime(2023, 4, 6, 20, 17, 9, 199182, tzinfo=datetime.timezone.utc)
795	datetime(2023, 8, 23, 8, 10, 1, 404268, tzinfo=datetime.timezone.utc)
943	datetime(2024, 1, 22, 8, 10, 1, 497934, tzinfo=datetime.timezone.utc)
1078	datetime(2024, 6, 5, 10, 10, 1, 203215, tzinfo=datetime.timezone.utc)
1181	datetime(2024, 9, 16, 10, 10, 1, 121167, tzinfo=datetime.timezone.utc)
1300	datetime(2025, 1, 13, 10, 10, 1, 286229, tzinfo=datetime.timezone.utc)
1412	datetime(2025, 4, 29, 10, 10, 1, 200893, tzinfo=datetime.timezone.utc)

How to cross-reference across time

Lineage Graphs

CAVEclient combines materialized snapshots with ChunkedGraph-based tracking of neuron edit histories to facilitate analysis queries for arbitrary time points. The ChunkedGraph tracks the edit lineage of neurons as they are being proofread, allowing us to map any segment used in a query to the closest available snapshot time point. This produces an overinclusive set of segments with which we query the snapshot database.

a, Edits change the assignment of synapses to segment IDs. Each of the four synapses is assigned to the segment IDs (colors) according to the presynaptic and postsynaptic points (point, bar). The identity of the segments changes through proofreading (time passed: ΔT) indicated by different colors. The lineage graph shows the current segment ID (color) for each point in time.

Figure from (Dorkenwald et al. 2025)

When we query the ‘live’ database for all changes to annotations since the used materialization snapshot and add them to the set of annotations. The resulting set of annotations is then mapped back to the query timestamp using the lineage graph and supervoxel to root lookups and finally reduced to only include the queried set of root IDs.

b, Analysis queries are not necessarily aligned to exported snapshots. Queries for other time points are supported by on-the-fly delta updates from both the annotations and segmentation through the use of the lineage graph.

Figure from (Dorkenwald et al. 2025)

Example querying the ChunkedGraph history for a root id

Initial Setup

Before using any programmatic access to the data, you first need to set up your CAVEclient token.

from caveclient import CAVEclient
from datetime import datetime

# initialize cave client
client = CAVEclient('minnie65_public')

# return the current timestamp
client.materialize.get_timestamp()

datetime.datetime(2025, 4, 29, 10, 10, 1, 200893, tzinfo=datetime.timezone.utc)

Most commonly, what you will want is to look-up the current root id for a pt_root_id in a previous analysis. This is not always a trivial thing to do, for example in the case of a multi-soma object that has been manually split. Which of the two new cells was your original cell of interest?

The ChunkedGraph will make its best guess, given supervoxel overlap, with the function suggest_latest_roots()

example_id = 864691135919440816

# Access the ChunkedGraph service of caveclient
client.chunkedgraph.suggest_latest_roots(example_id, timestamp = client.materialize.get_timestamp())

np.int64(864691135970572133)

Now we have updated pt_root_id for our cell, at the current materialized version.

If you want to run this for a large number of root ids, you can first check if the pt_root_ids are current to your CAVEclient materialization version using is_latest_roots(), and then only update the ids that have expired:

# Check if roots are current
print(client.chunkedgraph.is_latest_roots(example_id))

# See when the id was generated (when the segment was last edited)
client.chunkedgraph.get_root_timestamps(example_id)

[False]

array([datetime.datetime(2023, 2, 1, 8, 47, 29, 891000, tzinfo=<UTC>)],
      dtype=object)

Using the timestamp argument, you can also lookup the suggested root at any arbitrary time. Here we use the timestamp for a different materialization, verion 943:

client.chunkedgraph.suggest_latest_roots(example_id, 
                                         timestamp=client.materialize.get_version_metadata(943)['time_stamp']
                                        )

np.int64(864691135808631069)

Sometimes you may want to check the lineage graph for a cell of interest, to better understand what was edited and why. You can access this and more advanced features from the get_lineage_graph(). See the ChunkedGraph documentation for more use cases.

client.chunkedgraph.get_lineage_graph(example_id)

Static annotations

The CAVEclient Materialization Engine updates segmentation data and creates databases that combine spatial annotation points and segmentation information.

The live database is written to by the Annotation service and is actively managed by the Materialization service to keep root IDs up to date for all BoundSpatialPoints in all tables. Snapshotted databases are copies of a time-locked state of the ‘live’ database’s segmentation and annotation information used to facilitate consistent querying.

This means if you use a static annotation label to index your analysis, for example a nucleus_id or a synapse_id which do not undergo proofreading, you can look up the current pt_root_id at any time by asking CAVEclient to materialize the new segmentation under the static point.

Using our example cell from above, let’s find its nucleus id. If we try to query the nucleus table with the expired id, we will return no result:

example_id = 864691135919440816

client.materialize.tables.nucleus_detection_v0(pt_root_id=example_id).query()

	id	created	superceded_id	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position

This is expected, since the example id is expired at the time of this materialization. Instead, let’s query the version we know this id existed at: version 661

client.materialize.tables.nucleus_detection_v0(pt_root_id=example_id).query(materialization_version=661)

	id	created	superceded_id	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
0	260620	2020-09-28 22:44:52.109195+00:00	NaN	t	292.154409	88605588438852384	864691135919440816	[173584, 145120, 21127]	[nan, nan, nan]	[nan, nan, nan]

This returns both the expected pt_root_id 864691135919440816, and the id from the nucleus_detection_v0 – better known as the nucleus_id.

Given the nucleus_id, we can now query the current materialized version for the current root id.

nuc_id = client.materialize.tables.nucleus_detection_v0(pt_root_id=example_id).query(materialization_version=661)['id']

client.materialize.tables.nucleus_detection_v0(id=nuc_id).query()

	id	created	superceded_id	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
0	260620	2020-09-28 22:44:52.109195+00:00	NaN	t	292.154409	88605588438852384	864691135970572133	[173584, 145120, 21127]	[nan, nan, nan]	[nan, nan, nan]

This returns the same pt_root_id as the lineage graph example above. But, it has the benefit of guaranteeing the id belongs to your cell of interest and not an arbitrary chunk of the previous segmented object.

If you are working with a neuron, glial cell, or any cell that has a nucleus detection, we recommend using the `nucleus_id` as your identifier rather than the `pt_root_id`.

If you do use `pt_root_id`, be sure to note the dataset materialization version in your analysis.

References

Dorkenwald, Sven, Casey M. Schneider-Mizell, Derrick Brittain, Akhilesh Halageri, Chris Jordan, Nico Kemnitz, Manual A. Castro, et al. 2025. “CAVE: Connectome Annotation Versioning Engine.” Nature Methods, April. https://doi.org/10.1038/s41592-024-02426-z.

The MICrONS Consortium, J. Alexander Bae, Mahaly Baptiste, Maya R. Baptiste, Caitlyn A. Bishop, Agnes L. Bodor, Derrick Brittain, et al. 2025. “Functional Connectomics Spanning Multiple Areas of Mouse Visual Cortex.” Nature 640 (8058): 435–47. https://doi.org/10.1038/s41586-025-08790-w.

The MICrONS Dataset is public and open access, ready for analysis. But: manual edits to the segmentation continue to improve data quality.

Why the data changes

What types of data change with time

How to set the version of your analysis

How to set the timestamp to an expired version

Timestamps for all public release versions

How to cross-reference across time

Lineage Graphs

Example querying the ChunkedGraph history for a root id

Static annotations

If you are working with a neuron, glial cell, or any cell that has a nucleus detection, we recommend using the nucleus_id as your identifier rather than the pt_root_id.

If you do use pt_root_id, be sure to note the dataset materialization version in your analysis.

References

If you are working with a neuron, glial cell, or any cell that has a nucleus detection, we recommend using the `nucleus_id` as your identifier rather than the `pt_root_id`.

If you do use `pt_root_id`, be sure to note the dataset materialization version in your analysis.