Data structure

Data types and directory structure

The ABCD Study® provides a rich multimodal dataset that includes a variety of data types and file formats. This page offers an overview of the structure of the ABCD data, including the different types of data available and how they are organized within the dataset. For more detailed information on specific data types and their usage, please refer to the domain-specific documentation pages.

At a high level, the ABCD release data can be categorized into two main types—tabulated data and file-based data—which are described below.

Tabulated data

ABCD’s tabulated data resource consists of a set of tables that represent the main database of the study. Data from all assessment domains undergo a rigorous curation process while being prepared for inclusion in the tabulated data resource, ensuring that they are standardized, consistent, and ready for analysis.

The data are organized into tables, each containing a set of related variables (e.g., all items, summary scores, and administrative variables for a given assessment instrument; acquisition information and analysis results for a biospecimen collection; scores for a given imaging measure across all regions of interest for a specified brain atlas, etc.). The data tables are linked by identifier columns for participant and event, with one row per participant for static tables and one row per participant/event for longitudinal tables. Accompanying metadata describes all variables and tables and provides essential information for understanding the data and its use in analyses.

The tabulated data resource can be queried and accessed using different tools available through the NBDC Datahub. Additionally, it is provided as a set of flat files, one for each database table, as part of the file-based data resource (see below).

File-based data

ABCD’s file-based data resource consists of raw and processed data files generated from various instruments and modalities, including magnetic resonance imaging (MRI), genetic analyses, wearable sensors, neurocognitive experiments, and other assessments. The resulting data are provided either as individual-level files (one file, or collection of files, per participant/event) or in a concatenated format (one file per instrument or modality, containing data for all participants/events).

Brain Imaging Data Structure (BIDS)

Where possible, the file-based data are organized in accordance with the Brain Imaging Data Structure (BIDS) standard, with some modifications to meet the specific needs of the ABCD Study®. BIDS is a widely adopted standard for organizing and formatting neuroimaging data, facilitating data sharing, processing, and analysis across various platforms and tools.

The top-level directory structure of the file-based data resource is organized as follows:

├── abcc
│   ├── derivatives
│   └── rawdata
└── dairc
    ├── concat
    ├── derivatives
    ├── rawdata
    └── sourcedata

The dairc/ directory contains the full file-based dataset prepared by the data core of the ABCD Study, the Data Analysis, Informatics & Resource Center (DAIRC), which is described in more detail below and in the domain-specific documentation pages.
The abcc/ directory contains imaging raw data and derivatives from the ABCD-BIDS Community Collection (ABCC), which is described in more detail here.

Concatenated data

dairc
└── concat
    ├── genetics
    │   └── genotype_microarray
    ├── imaging
    │   ├── corrmat
    │   ├── vertexwise
    │   └── voxelwise
    ├── novel_technologies
    │   └── ears
    └── substance_use
        └── tlfb

The concat/ directory contains concatenated data files for various instruments and modalities, organized into subdirectories based on the type of data. The 6.0 release includes the following concatenated data:

concat/genetics/: Genotype microarray data (see here).
concat/imaging/: Concatenated imaging data for different modalities (see here).
concat/novel_technologies/: Raw data for screentime assessment (see here).
concat/substance_use/: Raw data for Timeline Followback interviews (see here).

Notes

The BIDS specification does not account for concatenated data (besides the tabulated data, which is organized as one file per table in the rawdata/phenotype/ directory; see below). As such, the concat/ directory is not part of the BIDS structure but is an ABCD-specific addition to the file-based data resource.

Files in the concat/ directory are provided in a variety of formats, depending on the data type and intended use. The data in the novel_technologies/ears/ and substance_use/tlfb/ directories are available in both csv (plain-text) format and parquet format (see below for more information about these formats).

Derivatives

dairc
└── derivatives
    ├── freesurfer
    │   ├── sub-<participant>_ses-<event>
    │   ├── ...
    │   └── dataset_description.json
    └── mmps_mproc
        ├── sub-<participant>
        │   ├── ses-<event>
        │   └── ...
        ├── ...
        └── dataset_description.json

The derivatives/ directory contains processed file-based data that are outputs from imaging processing pipelines, organized into subdirectories by derivative. The 6.0 release includes the following derivatives:

derivatives/freesurfer/: Freesurfer data, with one directory per participant/event (see here).
derivatives/mmps_mproc/: Minimally processed imaging data, with one directory per participant and, nested within each participant, one directory per event (see here).

Rawdata

dairc
└── rawdata
    ├── phenotype
    │   ├── <table_name>.json
    │   ├── <table_name>.parquet
    │   ├── <table_name>.tsv
    │   └── ...
    ├── sub-<participant>
    │   ├── ses-<event>
    │   └── ...
    ├── ...
    ├── dataset_description.json
    ├── participants.json
    ├── participants.tsv
    ├── scans.json
    ├── sessions.json
    ├── task-<experiment>_beh.json
    └── ...

The rawdata/ directory contains raw data from MR imaging, neurocognitive experiments, and wearable sensors, as well as a copy of the tabulated data resource. The raw data for the 6.0 release is organized as follows:

rawdata/phenotype/: Tabulated data, with one file per database table provided in both tsv (plain-text) and parquet formats, along with a BIDS sidecar json file that provides metadata for each table (see the domain-specific documentation pages).
rawdata/sub-<participant>/: Raw imaging data (in NIfTI format), neurocognitive data, and wearable data, with one directory per participant and, nested within each participant, one directory per event (see here for imaging raw data and the domain-specific documentation pages for raw data from other domains).

Note

The BIDS specification currently does not officially support parquet files. To allow users to take advantage of the features of this modern and efficient open-source format, we include parquet files for the tabulated data in the rawdata/phenotype/ directory as an alternative to the tsv (plain-text) file format. This should be considered an ABCD-specific addition to the file-based data resource. See below for more information about the different file formats.

Sourcedata

dairc
└── sourcedata
    ├── sub-<participant>
    │   ├── ses-<event>
    │   └── ...
    ├── ...
    ├── dataset_description.json
    ├── scans.json
    └── sessions.json

The sourcedata/ directory contains the MR imaging source data. The source data for the 6.0 release is organized as follows:

sourcedata/sub-<participant>/: Imaging source data (in DICOM format), with one directory per participant and, nested within each participant, one directory per event (see here).

Plain-text vs. Parquet files

Data from several assessment types and domains are made available in both delimiter-separated (CSV/TSV) plain-text format and Apache Parquet format in the file-based data resource.

Delimiter-separated formats (TSV/CSV) are widely compatible and easy to inspect, but they are less efficient for large datasets. These formats do not support selective column loading and cannot embed metadata, such as data type specifications. As a result, programming languages like Python or R must guess the data types during import, which can lead to errors. For example, categorical values provided as numbers that are formatted as strings (e.g., "0"/"1" to represent “Yes”/“No”; see data type standards for more details) may be interpreted as numeric, and columns with mostly missing values may be treated as empty if the first few rows lack data. To avoid these issues, users need to manually specify column types using the accompanying metadata upon import. The NBDCtools R package offers a helper function, read_dsv_formatted(), to automate this process (see the R packages page for details).

Apache Parquet is a modern, compressed, columnar format optimized for large datasets and commonly used in the data science community. In contrast to plain-text files, Parquet supports selective column loading and results in smaller file sizes. This improves loading speed and memory usage, enhancing performance for analytical workflows. Crucially, parquet files can embed metadata (including column types, variable/value labels, and categorical coding), allowing for reliable and reproducible import of data without any manual steps.

Important

Both csv/tsv and parquet formats are provided to support a range of tools and user preferences. However, since the parquet format ensures that data is imported with correctly specified data types and facilitates faster loading speeds and lower memory usage, we recommend using parquet files over plain-text files whenever possible.

Loading Parquet files with R and Python

Python

Using the arrow package:

library(arrow)
data <- read_parquet("path/to/file.parquet")

Using the polars library (recommended):

import polars as pl
data = pl.read_parquet("path/to/file.parquet")

Using the pandas library:

import pandas as pd
data = pd.read_parquet("path/to/file.parquet")