CAVE Query: Proofread Cells

Version update

We have released a new public version 1412, as part of our quarterly release schedule. See details at Release Manifests: 1412.

Tutorials remain pinned to v1300 as the latest major version.

The Connectome Annotation Versioning Engine (CAVE) is a suite of tools developed at the Allen Institute and Seung Lab to manage large connectomics data.

Initial Setup

Before using any programmatic access to the data, you first need to set up your CAVEclient token.

CAVEclient

Most programmatic access to the CAVE services occurs through CAVEclient, a Python client to access various types of data from the online services.

Full documentation for CAVEclient is available here.

To initialize a caveclient, we give it a datastack, which is a name that defines a particular combination of imagery, segmentation, and annotation database. For the MICrONs public data, we use the datastack name minnie65_public.

import numpy as np
import pandas as pd

from caveclient import CAVEclient
datastack_name = 'minnie65_public'
client = CAVEclient(datastack_name)

# Show the description of the datastack
client.info.get_datastack_info()['description']
'This is the publicly released version of the minnie65 volume and segmentation. '

Materialization versions

Data in CAVE is timestamped and periodically versioned - each (materialization) version corresponds to a specific timestamp. Individual versions are made publicly available. The Materialization client allows one to interact with the materialized annotation tables that were posted to the annotation service. These are called queries to the dataset, and available from client.materialize. For more, see the CAVEclient Documentation.

Periodic updates are made to the public datastack, which will include updates to the available tables. Some cells will have different pt_root_id because they have undergone proofreading.

Tip

For analysis consistency, is worth checking the version of the data you are using, and consider specifying the version with client.version = your_version

Read more about setting the version of your analysis

# see the available materialization versions
client.materialize.get_versions()
[1300, 1078, 117, 661, 343, 1181, 795, 943]

And these are their associated timestamps (all timestamps are in UTC):

for version in client.materialize.get_versions():
    print(f"Version {version}: {client.materialize.get_timestamp(version)}")
Version 1300: 2025-01-13 10:10:01.286229+00:00
Version 1078: 2024-06-05 10:10:01.203215+00:00
Version 117: 2021-06-11 08:10:00.215114+00:00
Version 661: 2023-04-06 20:17:09.199182+00:00
Version 343: 2022-02-24 08:10:00.184668+00:00
Version 1181: 2024-09-16 10:10:01.121167+00:00
Version 795: 2023-08-23 08:10:01.404268+00:00
Version 943: 2024-01-22 08:10:01.497934+00:00
# set materialization version, for consistency
client.version = 1300 # current public as of 1/13/2025

Querying Proofread neurons

Proofread neurons

Proofreading is necessary to obtain accurate reconstructions of a cell. In the MICrONS dataset, the general rule is that dendrites onto cells with a single cell body are sufficiently proofread to trust synaptic connections onto a cell. Axons on the other hand require so much proofreading that only ~1800 cells have axons such that their outputs should be used for analysis.

Table name: proofreading_status_and_strategy

The table proofreading_status_and_strategy describes the status of cells that have undergone manual proofreading.

Because of the inherent difference in the difficulty and time required for different kinds of proofreading, we describe the status of axons and dendrites separately.

Each compartment status may be either:

  • FALSE: indicates no comprehensive proofreading has been performed, or is not applicable.
  • TRUE: indicates that false merges have been comprehensively removed, and the compartment is at least ‘clean’. Consult the strategy column if completeness of the compartment is relevant to your research.

An axon or dendrite labeled as status=TRUE can be trusted to be correct, but may not be complete. The degree of completion can be read from the strategy column. For more information, please see Proofreading and Data Quality; or also the microns-explorer page on proofreading strategies.

The key columns are:

Proofreading Status Table
Column Description
id Soma ID for the cell
pt_position  pt_supervoxel_id  pt_root_id Bound spatial point columns associated with the centroid of the cell nucleus
valid_id The root id of the neuron when it the proofreading assessment was made. NOTE: if this does not match the pt_root_id then the cell has undergone further changes. This is usually and improvement in proofreading, but proceed with caution.
status_dendrite The status of the dendrite proofreading. May be TRUE or FALSE
status_axon The status of the axon proofreading. May be TRUE or FALSE
strategy_dendrite The strategy employed to proofread the dendrite. See strategy table below for details
strategy_axon The strategy employed to proofread the axon. See strategy table below for details

The specific strategies are as follows (and will update over time):

Proofreading Strategies

Proofreading Strategy Table
Strategy Description
none No cleaning, and no extension. Indicates an entry in proofreading_status that is FALSE for that compartment
dendrite_clean The dendrite had incorrectly-merged axon and dendritic segments comprehensively removed, meaning the input synapses are accurate. The dendrite may be incorrectly truncated by segmentation error. Not all dendrite tips have been checked for extension. No comprehensive attempt was made to re-attach spine heads.
dendrite_extended The dendrite had incorrectly-merged axon and dendritic segments comprehensively removed, meaning the input synapses are accurate. Every tip was identified, manually inspected, and extended if possible. No comprehensive attempt was made to re-attach spine heads.
axon_column_truncated AThe axon was extended within the V1 cortical column, with a preference for local connections. In some cases the axon was cut at the column boundary and/or the layer boundary, especially the boundary between layers 2/3 and layer 4. Output synapses represent a sampling of potential partners
axon_interareal The axon was extended with a preference for branches that projected to other brain areas. Some axon branches were fully extended, but local connections may be incomplete. Output synapses represent a sampling of potential partners.
axon_partially_extended The axon was extended outward from the soma, following each branch to its termination. Output synapses represent a sampling of potential partners.
axon_fully_extended Axon was extended outward from the soma, following each branch to its termination. After initial extension, every endpoint was identified, manually inspected, and extended again if possible. Output synapses represent a largely complete sampling of partners.

This table, proofreading_status_and_strategy, supercedes proofreading_status_public_release.

# Standard query
client.materialize.query_table('proofreading_status_and_strategy')

# Content-aware query
client.materialize.tables.proofreading_status_and_strategy(status_axon='t').query()

Here we query and return the table as of version 1300.

For the commands used for querying tables, see the previous quickstart notebook on CAVE queries

proof_all_df = client.materialize.query_table("proofreading_status_and_strategy", 
                                              desired_resolution=[1, 1, 1], 
                                              split_positions=True)
proof_all_df["strategy_axon"].value_counts()
strategy_axon
axon_partially_extended    1459
none                        149
axon_interareal             130
axon_fully_extended         127
Name: count, dtype: int64

Filtering Queries by proofreading status

We can filter our query to only return rows that match a condition by adding a filter to our query:

proof_axon_df = client.materialize.query_table("proofreading_status_and_strategy", 
                                               filter_in_dict={"strategy_axon": ["axon_partially_extended", "axon_fully_extended", "axon_interareal"]}, 
                                               desired_resolution=[1, 1, 1], 
                                               split_positions=True)
proof_axon_df.tail()
id created superceded_id valid pt_position_x pt_position_y pt_position_z valid_id status_dendrite status_axon strategy_dendrite strategy_axon pt_supervoxel_id pt_root_id
1711 3062 2025-01-12 03:00:00.936352+00:00 NaN t 705920.0 705664.0 903960.0 864691135867413893 t t dendrite_clean axon_partially_extended 89031992993240476 864691135867413893
1712 3063 2025-01-12 03:00:00.954516+00:00 NaN t 684800.0 659968.0 891160.0 864691135276601317 t t dendrite_clean axon_partially_extended 88326724936602021 864691135276601317
1713 3064 2025-01-12 03:00:00.972946+00:00 NaN t 742400.0 667008.0 830000.0 864691135742247787 t t dendrite_clean axon_partially_extended 90297324450184090 864691135742247787
1714 3065 2025-01-12 03:00:00.992770+00:00 NaN t 689472.0 839744.0 904880.0 864691135591960075 t t dendrite_extended axon_fully_extended 88473509805848805 864691135591960075
1715 3066 2025-01-12 03:00:01.012280+00:00 NaN t 746880.0 612224.0 891720.0 864691135939049988 t t dendrite_clean axon_partially_extended 90436206713993499 864691135939049988

A more unified filter interface is available through a “table manager” interface.

Rather than passing a table name to the query_table function, client.materialize.tables has a subproperty for each table in the database that can be used to filter that table.

The general pattern for usage is

client.materialize.tables.{table_name}({filter options}).query({format and timestamp options})

where {table_name} is the name of the table you want to filter, {filter options} is a collection of arguments for filtering the query, and {format and timestamp options} are those parameters controlling the format and timestamp of the query.

Caution

Use of this functionality will show a brief warning that the interface is experimental. This is because the interface is still being developed and may change in the near future in response to user feedback.

With this, we can easily query all proofread cells with proofread axons:

proof_axon_df = client.materialize.tables.proofreading_status_and_strategy(strategy_axon=["axon_partially_extended", "axon_fully_extended", "axon_interareal"]).query(
    select_columns=['pt_root_id','status_axon','status_dendrite','strategy_axon','strategy_dendrite'],
)
proof_axon_df.tail()
pt_root_id status_axon status_dendrite strategy_axon strategy_dendrite
1711 864691135867413893 t t axon_partially_extended dendrite_clean
1712 864691135276601317 t t axon_partially_extended dendrite_clean
1713 864691135742247787 t t axon_partially_extended dendrite_clean
1714 864691135591960075 t t axon_fully_extended dendrite_extended
1715 864691135939049988 t t axon_partially_extended dendrite_clean

Combining proofread cells and cell types

For analysis, often you are interested in neurons that are at the intersection of two or more groups. For example: proofread cells that are also layer 2/3 pyramidal cells. The general workflow for this type of analysis is to:

  1. Query from one table, for example the proofreading_status_and_strategy table
  2. Query from another table, for example the aibs_metamodel_celltypes_v661
  3. Merge the two tables on the shared index, in this case pt_root_id

We covered querying cell types in the previous quickstart notebook. Now lets put that together with the proofreading query:

# Query proofread cells with status_axon==True
proof_df = client.materialize.tables.proofreading_status_and_strategy(status_axon="t").query(
    select_columns=['pt_root_id','status_axon','status_dendrite','strategy_axon','strategy_dendrite'],
)

# Query cell types
cell_type_df = client.materialize.tables.aibs_metamodel_celltypes_v661().query(
    select_columns = {'nucleus_detection_v0': ['pt_root_id', 'id'],
                      'aibs_metamodel_celltypes_v661': ['classification_system','cell_type'],
                     },
)
Tip

Note the ‘select_columns’ argument differs between the two tables. Thay is because the second table, aibs_metamodel_celltypes_v661 is itself a reference on nucleus_detection_v0. That means the id column returned here is the same as the nucleus id of the cell. This is handy for referencing the same cell across materialization versions as the nucleus id does not change, whereas the pt_root_id will change with proofreading.

Now we can merge the two tables together on the shared index!

But it is worth checking if there are duplicates in either of the tables. How you handle duplicates will depend on your question, and the table you are using. Here we might see duplicates from multi-soma merges in the cell type table

cell_type_df.value_counts('pt_root_id')
pt_root_id
864691134966128927    350
0                     175
864691137020183406    102
864691136974041116     60
864691135303545767     43
                     ... 
864691135571655661      1
864691135571652845      1
864691135571650797      1
864691135571649317      1
864691135777169085      1
Name: count, Length: 91423, dtype: int64

For analytical simplicity, we will drop any multi-soma objects. We will also rename the id column for clarity

# Drop duplicate pt_root_id and rename the nucleus_id
cell_type_df = (cell_type_df
                .drop_duplicates('pt_root_id', keep=False)
                .rename(columns={'id': 'nucleus_id'})
               )
                        
cell_type_df.head()
pt_root_id nucleus_id classification_system cell_type
0 864691136274724621 336365 excitatory_neuron 5P-IT
1 864691135489403194 110648 excitatory_neuron 23P
2 864691136147292311 112071 excitatory_neuron 23P
3 864691136050858227 197927 nonneuron oligo
4 864691135809440972 198087 nonneuron astrocyte

Now we can merge the two tables with pandas.merge, on index pt_root_id. We will keep the inner join of the two tables: cells that 1) are proofread, and 2) have a cell type

proof_cell_type_df = pd.merge(proof_df, cell_type_df, on='pt_root_id', how='inner')
proof_cell_type_df.tail()
pt_root_id status_axon status_dendrite strategy_axon strategy_dendrite nucleus_id classification_system cell_type
1700 864691135867413893 t t axon_partially_extended dendrite_clean 263073 excitatory_neuron 5P-IT
1701 864691135276601317 t t axon_partially_extended dendrite_clean 262889 excitatory_neuron 4P
1702 864691135742247787 t t axon_partially_extended dendrite_clean 298800 excitatory_neuron 4P
1703 864691135591960075 t t axon_fully_extended dendrite_extended 269591 excitatory_neuron 5P-IT
1704 864691135939049988 t t axon_partially_extended dendrite_clean 296761 excitatory_neuron 4P

And we have the list of all proofread cells, by their cell type!

We can do this same kind of query more simply by: querying the second table by BOTH the root ids of interest and the cell type of interest. If we wanted only the proofread 23P cells, we could do:

# Query the proofread 23P cells, and merge the proofreading status
proof_23p_df = (client.materialize.tables.aibs_metamodel_celltypes_v661(pt_root_id=proof_df.pt_root_id, cell_type='23P').query(
    select_columns = {'nucleus_detection_v0': ['pt_root_id', 'id'],
                      'aibs_metamodel_celltypes_v661': ['classification_system','cell_type'],
                     },  )
                .rename(columns={'id': 'nucleus_id'})
                .merge(proof_df, on='pt_root_id', how='inner')
               )

proof_23p_df.head()
pt_root_id nucleus_id classification_system cell_type status_axon status_dendrite strategy_axon strategy_dendrite
0 864691135473477426 258375 excitatory_neuron 23P t t axon_partially_extended dendrite_clean
1 864691135257669039 258377 excitatory_neuron 23P t t axon_partially_extended dendrite_clean
2 864691135763593014 258403 excitatory_neuron 23P t t axon_partially_extended dendrite_clean
3 864691135763433270 258225 excitatory_neuron 23P t t axon_partially_extended dendrite_clean
4 864691135645874159 292833 excitatory_neuron 23P t t axon_partially_extended dendrite_clean
Back to top