CAVE Query: Proofread Cells

Version update

We have released a new public version 1507, as part of our quarterly release schedule. See details at Release Manifests: 1507.

Tutorials remain pinned to v1412 but will updated in coming weeks.

The Connectome Annotation Versioning Engine (CAVE) is a suite of tools developed at the Allen Institute and Seung Lab to manage large connectomics data.

Initial Setup

Before using any programmatic access to the data, you first need to set up your CAVEclient token.

CAVEclient

Most programmatic access to the CAVE services occurs through CAVEclient, a Python client to access various types of data from the online services.

Full documentation for CAVEclient is available here.

To initialize a caveclient, we give it a datastack, which is a name that defines a particular combination of imagery, segmentation, and annotation database. For the MICrONs public data, we use the datastack name minnie65_public.

import numpy as np
import pandas as pd

from caveclient import CAVEclient
datastack_name = 'minnie65_public'
client = CAVEclient(datastack_name)

# Show the description of the datastack
client.info.get_datastack_info()['description']

'This is the publicly released version of the minnie65 volume and segmentation. '

Materialization versions

Data in CAVE is timestamped and periodically versioned - each (materialization) version corresponds to a specific timestamp. Individual versions are made publicly available. The Materialization client allows one to interact with the materialized annotation tables that were posted to the annotation service. These are called queries to the dataset, and available from client.materialize. For more, see the CAVEclient Documentation.

Periodic updates are made to the public datastack, which will include updates to the available tables. Some cells will have different pt_root_id because they have undergone proofreading.

Tip

For analysis consistency, is worth checking the version of the data you are using, and consider specifying the version with client.version = your_version

Read more about setting the version of your analysis

# see the available materialization versions
client.materialize.get_versions()

[1300, 1078, 117, 661, 343, 1181, 795, 943, 1412]

And these are their associated timestamps (all timestamps are in UTC):

for version in client.materialize.get_versions():
    print(f"Version {version}: {client.materialize.get_timestamp(version)}")

Version 1300: 2025-01-13 10:10:01.286229+00:00
Version 1078: 2024-06-05 10:10:01.203215+00:00
Version 117: 2021-06-11 08:10:00.215114+00:00
Version 661: 2023-04-06 20:17:09.199182+00:00
Version 343: 2022-02-24 08:10:00.184668+00:00
Version 1181: 2024-09-16 10:10:01.121167+00:00
Version 795: 2023-08-23 08:10:01.404268+00:00
Version 943: 2024-01-22 08:10:01.497934+00:00
Version 1412: 2025-04-29 10:10:01.200893+00:00

# set materialization version, for consistency
client.version = 1412 # current public as of 4/29/2025

Querying Proofread neurons

Proofread neurons

Proofreading is necessary to obtain accurate reconstructions of a cell. In the MICrONS dataset, the general rule is that dendrites onto cells with a single cell body are sufficiently proofread to trust synaptic connections onto a cell. Axons on the other hand require so much proofreading that only ~1800 cells have axons such that their outputs should be used for analysis.

Table name: proofreading_status_and_strategy

The table proofreading_status_and_strategy describes the status of cells that have undergone manual proofreading.

Because of the inherent difference in the difficulty and time required for different kinds of proofreading, we describe the status of axons and dendrites separately.

Each compartment status may be either:

FALSE: indicates no comprehensive proofreading has been performed, or is not applicable.
TRUE: indicates that false merges have been comprehensively removed, and the compartment is at least ‘clean’. Consult the strategy column if completeness of the compartment is relevant to your research.

An axon or dendrite labeled as status=TRUE can be trusted to be correct, but may not be complete. The degree of completion can be read from the strategy column. For more information, please see Proofreading and Data Quality; or also the microns-explorer page on proofreading strategies.

The key columns are:

Proofreading Status Table
Column	Description
`id`	Soma ID for the cell
`pt_position` `pt_supervoxel_id` `pt_root_id`	Bound spatial point columns associated with the centroid of the cell nucleus
`valid_id`	The root id of the neuron when it the proofreading assessment was made. NOTE: if this does not match the `pt_root_id` then the cell has undergone further changes. This is usually and improvement in proofreading, but proceed with caution.
`status_dendrite`	The status of the dendrite proofreading. May be `TRUE` or `FALSE`
`status_axon`	The status of the axon proofreading. May be `TRUE` or `FALSE`
`strategy_dendrite`	The strategy employed to proofread the dendrite. See strategy table below for details
`strategy_axon`	The strategy employed to proofread the axon. See strategy table below for details

The specific strategies are as follows (and will update over time):

Proofreading Strategies

Proofreading Strategy Table
Strategy	Description
`none`	No cleaning, and no extension. Indicates an entry in `proofreading_status` that is `FALSE` for that compartment
`dendrite_clean`	The dendrite had incorrectly-merged axon and dendritic segments comprehensively removed, meaning the input synapses are accurate. The dendrite may be incorrectly truncated by segmentation error. Not all dendrite tips have been checked for extension. No comprehensive attempt was made to re-attach spine heads.
`dendrite_extended`	The dendrite had incorrectly-merged axon and dendritic segments comprehensively removed, meaning the input synapses are accurate. Every tip was identified, manually inspected, and extended if possible. No comprehensive attempt was made to re-attach spine heads.
`axon_column_truncated`	AThe axon was extended within the V1 cortical column, with a preference for local connections. In some cases the axon was cut at the column boundary and/or the layer boundary, especially the boundary between layers 2/3 and layer 4. Output synapses represent a sampling of potential partners
`axon_interareal`	The axon was extended with a preference for branches that projected to other brain areas. Some axon branches were fully extended, but local connections may be incomplete. Output synapses represent a sampling of potential partners.
`axon_partially_extended`	The axon was extended outward from the soma, following each branch to its termination. Output synapses represent a sampling of potential partners.
`axon_fully_extended`	Axon was extended outward from the soma, following each branch to its termination. After initial extension, every endpoint was identified, manually inspected, and extended again if possible. Output synapses represent a largely complete sampling of partners.

This table, proofreading_status_and_strategy, supercedes proofreading_status_public_release.

# Standard query
client.materialize.query_table('proofreading_status_and_strategy')

# Content-aware query
client.materialize.tables.proofreading_status_and_strategy(status_axon='t').query()

Here we query and return the table as of version 1300.

For the commands used for querying tables, see the previous quickstart notebook on CAVE queries

proof_all_df = client.materialize.query_table("proofreading_status_and_strategy", 
                                              desired_resolution=[1, 1, 1], 
                                              split_positions=True)

proof_all_df["strategy_axon"].value_counts()

strategy_axon
axon_partially_extended    1702
axon_fully_extended         156
axon_interareal             124
none                         38
Name: count, dtype: int64

Filtering Queries by proofreading status

We can filter our query to only return rows that match a condition by adding a filter to our query:

proof_axon_df = client.materialize.query_table("proofreading_status_and_strategy", 
                                               filter_in_dict={"strategy_axon": ["axon_partially_extended", "axon_fully_extended", "axon_interareal"]}, 
                                               desired_resolution=[1, 1, 1], 
                                               split_positions=True)
proof_axon_df.tail()

	id	created	superceded_id	valid	pt_position_x	pt_position_y	pt_position_z	valid_id	status_dendrite	status_axon	strategy_dendrite	strategy_axon	pt_supervoxel_id	pt_root_id
1977	3639	2025-04-26 15:26:01.884071+00:00	NaN	t	768192.0	722624.0	865560.0	864691135124248359	t	t	dendrite_extended	axon_partially_extended	91213973684426061	864691135124248359
1978	3640	2025-04-26 15:26:01.897693+00:00	NaN	t	742784.0	743936.0	823560.0	864691135447653010	t	t	dendrite_extended	axon_partially_extended	90299867070611957	864691135447653010
1979	3641	2025-04-26 15:26:01.911083+00:00	NaN	t	726848.0	730944.0	864320.0	864691135849699166	t	t	dendrite_extended	axon_partially_extended	89736504934533504	864691135849699166
1980	3642	2025-04-26 15:26:01.924637+00:00	NaN	t	792832.0	936128.0	952600.0	864691135396485877	t	t	dendrite_extended	axon_partially_extended	92065545708721615	864691135396485877
1981	3643	2025-04-26 15:36:38.403644+00:00	NaN	t	738176.0	727168.0	893440.0	864691135646493679	t	t	dendrite_extended	axon_fully_extended	90158580028006378	864691135646493679

A more unified filter interface is available through a “table manager” interface.

Rather than passing a table name to the query_table function, client.materialize.tables has a subproperty for each table in the database that can be used to filter that table.

The general pattern for usage is

client.materialize.tables.{table_name}({filter options}).query({format and timestamp options})

where {table_name} is the name of the table you want to filter, {filter options} is a collection of arguments for filtering the query, and {format and timestamp options} are those parameters controlling the format and timestamp of the query.

Caution

Use of this functionality will show a brief warning that the interface is experimental. This is because the interface is still being developed and may change in the near future in response to user feedback.

With this, we can easily query all proofread cells with proofread axons:

proof_axon_df = client.materialize.tables.proofreading_status_and_strategy(strategy_axon=["axon_partially_extended", "axon_fully_extended", "axon_interareal"]).query(
    select_columns=['pt_root_id','status_axon','status_dendrite','strategy_axon','strategy_dendrite'],
)
proof_axon_df.tail()

	pt_root_id	status_axon	status_dendrite	strategy_axon	strategy_dendrite
1977	864691135124248359	t	t	axon_partially_extended	dendrite_extended
1978	864691135447653010	t	t	axon_partially_extended	dendrite_extended
1979	864691135849699166	t	t	axon_partially_extended	dendrite_extended
1980	864691135396485877	t	t	axon_partially_extended	dendrite_extended
1981	864691135646493679	t	t	axon_fully_extended	dendrite_extended

Combining proofread cells and cell types

For analysis, often you are interested in neurons that are at the intersection of two or more groups. For example: proofread cells that are also layer 2/3 pyramidal cells. The general workflow for this type of analysis is to:

Query from one table, for example the proofreading_status_and_strategy table
Query from another table, for example the aibs_metamodel_celltypes_v661
Merge the two tables on the shared index, in this case pt_root_id

We covered querying cell types in the previous quickstart notebook. Now lets put that together with the proofreading query:

# Query proofread cells with status_axon==True
proof_df = client.materialize.tables.proofreading_status_and_strategy(status_axon="t").query(
    select_columns=['pt_root_id','status_axon','status_dendrite','strategy_axon','strategy_dendrite'],
)

# Query cell types
cell_type_df = client.materialize.tables.aibs_metamodel_celltypes_v661().query(
    select_columns = {'nucleus_detection_v0': ['pt_root_id', 'id'],
                      'aibs_metamodel_celltypes_v661': ['classification_system','cell_type'],
                     },
)

Tip

Note the ‘select_columns’ argument differs between the two tables. Thay is because the second table, aibs_metamodel_celltypes_v661 is itself a reference on nucleus_detection_v0. That means the id column returned here is the same as the nucleus id of the cell. This is handy for referencing the same cell across materialization versions as the nucleus id does not change, whereas the pt_root_id will change with proofreading.

Now we can merge the two tables together on the shared index!

But it is worth checking if there are duplicates in either of the tables. How you handle duplicates will depend on your question, and the table you are using. Here we might see duplicates from multi-soma merges in the cell type table

cell_type_df.value_counts('pt_root_id')

pt_root_id
864691135891586697    350
0                     175
864691137020183406    102
864691136974041116     60
864691135455264362     43
                     ... 
864691135568905862      1
864691135568905350      1
864691135568904326      1
864691135568904172      1
864691137199312705      1
Name: count, Length: 91446, dtype: int64

For analytical simplicity, we will drop any multi-soma objects. We will also rename the id column for clarity

# Drop duplicate pt_root_id and rename the nucleus_id
cell_type_df = (cell_type_df
                .drop_duplicates('pt_root_id', keep=False)
                .rename(columns={'id': 'nucleus_id'})
               )
                        
cell_type_df.head()

	pt_root_id	nucleus_id	classification_system	cell_type
0	864691136274724621	336365	excitatory_neuron	5P-IT
1	864691135489403194	110648	excitatory_neuron	23P
2	864691136147292311	112071	excitatory_neuron	23P
3	864691135655940290	197927	nonneuron	oligo
4	864691135809440972	198087	nonneuron	astrocyte

Now we can merge the two tables with pandas.merge, on index pt_root_id. We will keep the inner join of the two tables: cells that 1) are proofread, and 2) have a cell type

proof_cell_type_df = pd.merge(proof_df, cell_type_df, on='pt_root_id', how='inner')
proof_cell_type_df.tail()

	pt_root_id	status_axon	status_dendrite	strategy_axon	strategy_dendrite	nucleus_id	classification_system	cell_type
1929	864691135124248359	t	t	axon_partially_extended	dendrite_extended	301119	excitatory_neuron	5P-IT
1930	864691135447653010	t	t	axon_partially_extended	dendrite_extended	300956	excitatory_neuron	5P-IT
1931	864691135849699166	t	t	axon_partially_extended	dendrite_extended	301083	excitatory_neuron	5P-IT
1932	864691135396485877	t	t	axon_partially_extended	dendrite_extended	342334	excitatory_neuron	6P-IT
1933	864691135646493679	t	t	axon_fully_extended	dendrite_extended	301203	inhibitory_neuron	BPC

And we have the list of all proofread cells, by their cell type!

We can do this same kind of query more simply by: querying the second table by BOTH the root ids of interest and the cell type of interest. If we wanted only the proofread 23P cells, we could do:

# Query the proofread 23P cells, and merge the proofreading status
proof_23p_df = (client.materialize.tables.aibs_metamodel_celltypes_v661(pt_root_id=proof_df.pt_root_id, cell_type='23P').query(
    select_columns = {'nucleus_detection_v0': ['pt_root_id', 'id'],
                      'aibs_metamodel_celltypes_v661': ['classification_system','cell_type'],
                     },  )
                .rename(columns={'id': 'nucleus_id'})
                .merge(proof_df, on='pt_root_id', how='inner')
               )

proof_23p_df.head()

	pt_root_id	nucleus_id	classification_system	cell_type	status_axon	status_dendrite	strategy_axon	strategy_dendrite
0	864691135473477426	258375	excitatory_neuron	23P	t	t	axon_partially_extended	dendrite_clean
1	864691135257669039	258377	excitatory_neuron	23P	t	t	axon_partially_extended	dendrite_clean
2	864691135763593014	258403	excitatory_neuron	23P	t	t	axon_partially_extended	dendrite_clean
3	864691135763433270	258225	excitatory_neuron	23P	t	t	axon_partially_extended	dendrite_clean
4	864691135645874159	292833	excitatory_neuron	23P	t	t	axon_partially_extended	dendrite_clean