Skip to content

Datasets

Listing datasets

cxg list (alias for cxg dataset list) fetches all public datasets and applies client-side filters.

Filtering

Repeating the same filter creates an OR condition. Combining different filters creates an AND condition.

# Human datasets from brain or lung tissue
cxg list --organism "Homo sapiens" --tissue brain --tissue lung

# Large datasets with at least 100k cells
cxg list --min-cells 100000

# Search by title
cxg list --title "aging"

# Spatial datasets only
cxg list --has-spatial

# Combine filters (AND logic across different fields)
cxg list --organism "Mus musculus" --tissue heart --assay "10x"

Available filters

Option Type Description
--organism text (repeatable) Filter by organism label
--tissue text (repeatable) Filter by tissue label
--assay text (repeatable) Filter by assay label
--cell-type text (repeatable) Filter by cell type label
--disease text (repeatable) Filter by disease label
--suspension-type text (repeatable) Filter by suspension type: cell, nucleus, or na
--title text Substring match on dataset title
--collection text Substring match on collection name
--collection-id text Exact match on collection ID
--min-cells integer Minimum cell count
--max-cells integer Maximum cell count
--schema-version text Exact match on schema version
--has-spatial flag Only include datasets with spatial data

Text filters for ontology fields (organism, tissue, assay, cell type, disease) use case-insensitive substring matching. Use cxg field values to discover valid values.

Persistent filter defaults

Set default filters in the config file to avoid repeating them on every command. For example, a mouse genetics researcher could set defaults.organism: [Mus musculus] to always filter to mouse datasets. CLI flags override config defaults when provided.

Output formats

# Default: Rich table
cxg list --organism "Homo sapiens"

# JSON for programmatic use
cxg list --organism "Homo sapiens" --output json

# TSV for spreadsheets or command-line tools
cxg list --tissue lung --output tsv > lung_datasets.tsv

# Just the count
cxg list --organism "Homo sapiens" --count

# One dataset ID per line (useful for piping)
cxg list --tissue retina --id-only

Customizing table output

# Select specific columns
cxg list --columns "title,organism,cell_count"

# Limit the number of results
cxg list --limit 10

# Sort by cell count (descending)
cxg list --sort-by "cell_count desc"

# Sort by title (ascending)
cxg list --sort-by "title asc"

Dataset details

Retrieve full details for a specific dataset using its UUID or numeric index from cxg dataset list:

# By UUID
cxg dataset view DATASET_ID

# By numeric index
cxg dataset view 42

# As JSON
cxg dataset view DATASET_ID --output json

# Force-refresh the cache before lookup
cxg dataset view DATASET_ID --refresh

# Open the dataset's collection on cellxgene.cziscience.com
cxg dataset view DATASET_ID -w

Individual datasets do not have their own page on cellxgene.cziscience.com; -w/--web opens the collection that owns the dataset.

Downloading datasets

Download one or more datasets by UUID or numeric index:

# Download by UUID
cxg dataset download ID1 ID2 ID3

# Download by numeric index
cxg dataset download 42 57

# Download as RDS instead of H5AD (default)
cxg dataset download ID1 --filetype rds

# Specify output directory
cxg dataset download ID1 --output-dir ./data

# Skip confirmation prompt
cxg dataset download ID1 --yes

# Overwrite existing files
cxg dataset download ID1 --overwrite

Downloading a whole collection

Pass --collection-id to download every dataset in a collection without listing IDs by hand:

cxg dataset download --collection-id 4fefa187-5d14-4f1e-915b-c892ed320aab

--collection-id is mutually exclusive with positional dataset IDs and stdin input.

Parallel downloads

Multiple datasets are downloaded concurrently, each with its own progress bar. The default is 3 parallel downloads; tune it with --parallel/-j:

# Download 5 datasets, 5 in parallel
cxg dataset download ID1 ID2 ID3 ID4 ID5 -j 5

# Disable parallelism (sequential)
cxg dataset download ID1 ID2 ID3 -j 1

The valid range is 1-16. Higher values rarely help on a single network connection and can clutter the terminal.

Piped workflows

Dataset IDs can be piped from one command to another:

# Filter datasets, then download the matches
cxg list --organism "Homo sapiens" --tissue retina --id-only | cxg dataset download

# Download the 5 largest human datasets
cxg list --organism "Homo sapiens" --sort-by "cell_count desc" --limit 5 --id-only \
  | cxg dataset download --output-dir ./large_datasets