Datasets¶
Listing datasets¶
cxg list (alias for cxg dataset list) fetches all public datasets and applies client-side filters.
Filtering¶
Repeating the same filter creates an OR condition. Combining different filters creates an AND condition.
# Human datasets from brain or lung tissue
cxg list --organism "Homo sapiens" --tissue brain --tissue lung
# Large datasets with at least 100k cells
cxg list --min-cells 100000
# Search by title
cxg list --title "aging"
# Spatial datasets only
cxg list --has-spatial
# Combine filters (AND logic across different fields)
cxg list --organism "Mus musculus" --tissue heart --assay "10x"
Available filters¶
| Option | Type | Description |
|---|---|---|
--organism |
text (repeatable) | Filter by organism label |
--tissue |
text (repeatable) | Filter by tissue label |
--assay |
text (repeatable) | Filter by assay label |
--cell-type |
text (repeatable) | Filter by cell type label |
--disease |
text (repeatable) | Filter by disease label |
--suspension-type |
text (repeatable) | Filter by suspension type: cell, nucleus, or na |
--title |
text | Substring match on dataset title |
--collection |
text | Substring match on collection name |
--collection-id |
text | Exact match on collection ID |
--min-cells |
integer | Minimum cell count |
--max-cells |
integer | Maximum cell count |
--schema-version |
text | Exact match on schema version |
--has-spatial |
flag | Only include datasets with spatial data |
Text filters for ontology fields (organism, tissue, assay, cell type, disease) use case-insensitive substring matching. Use cxg field values to discover valid values.
Persistent filter defaults
Set default filters in the config file to avoid repeating them on every command. For example, a mouse genetics researcher could set defaults.organism: [Mus musculus] to always filter to mouse datasets. CLI flags override config defaults when provided.
Output formats¶
# Default: Rich table
cxg list --organism "Homo sapiens"
# JSON for programmatic use
cxg list --organism "Homo sapiens" --output json
# TSV for spreadsheets or command-line tools
cxg list --tissue lung --output tsv > lung_datasets.tsv
# Just the count
cxg list --organism "Homo sapiens" --count
# One dataset ID per line (useful for piping)
cxg list --tissue retina --id-only
Customizing table output¶
# Select specific columns
cxg list --columns "title,organism,cell_count"
# Limit the number of results
cxg list --limit 10
# Sort by cell count (descending)
cxg list --sort-by "cell_count desc"
# Sort by title (ascending)
cxg list --sort-by "title asc"
Dataset details¶
Retrieve full details for a specific dataset using its UUID or numeric index from cxg dataset list:
# By UUID
cxg dataset view DATASET_ID
# By numeric index
cxg dataset view 42
# As JSON
cxg dataset view DATASET_ID --output json
# Force-refresh the cache before lookup
cxg dataset view DATASET_ID --refresh
# Open the dataset's collection on cellxgene.cziscience.com
cxg dataset view DATASET_ID -w
Individual datasets do not have their own page on cellxgene.cziscience.com; -w/--web opens the collection that owns the dataset.
Downloading datasets¶
Download one or more datasets by UUID or numeric index:
# Download by UUID
cxg dataset download ID1 ID2 ID3
# Download by numeric index
cxg dataset download 42 57
# Download as RDS instead of H5AD (default)
cxg dataset download ID1 --filetype rds
# Specify output directory
cxg dataset download ID1 --output-dir ./data
# Skip confirmation prompt
cxg dataset download ID1 --yes
# Overwrite existing files
cxg dataset download ID1 --overwrite
Downloading a whole collection¶
Pass --collection-id to download every dataset in a collection without listing IDs by hand:
--collection-id is mutually exclusive with positional dataset IDs and stdin input.
Parallel downloads¶
Multiple datasets are downloaded concurrently, each with its own progress bar. The default is 3 parallel downloads; tune it with --parallel/-j:
# Download 5 datasets, 5 in parallel
cxg dataset download ID1 ID2 ID3 ID4 ID5 -j 5
# Disable parallelism (sequential)
cxg dataset download ID1 ID2 ID3 -j 1
The valid range is 1-16. Higher values rarely help on a single network connection and can clutter the terminal.
Piped workflows¶
Dataset IDs can be piped from one command to another:
# Filter datasets, then download the matches
cxg list --organism "Homo sapiens" --tissue retina --id-only | cxg dataset download
# Download the 5 largest human datasets
cxg list --organism "Homo sapiens" --sort-by "cell_count desc" --limit 5 --id-only \
| cxg dataset download --output-dir ./large_datasets