• Study
  • Data Usage
    • Access & download data
    • Responsible use
    • Acknowledgment
  • Documentation
    • Curation & structure
    • Non-imaging
    • Imaging
    • Substudies
    • Release notes
  • Tools
    • Data tools
    • R Packages
  • Info
    • FAQs
    • Report issues
    • Changelog
    • Cite this website
  • Version
    • empty
  1. Curation & structure
  2. Curation standards
  • Curation & structure
    • Data structure
    • Curation standards
    • Naming convention
    • Metadata
  • Non-imaging data
    • ABCD (General)
    • Friends, Family, & Community
    • Genetics
    • Linked External Data
    • Mental Health
    • Neurocognition
    • Novel Technologies
    • Physical Health
    • Substance Use
  • Imaging data
    • Administrative tables
    • Data types
      • Documentation
        • Imaging
          • Concatenated
          • MRI derivatives data documentation
          • Source data / raw data
          • Supplementary tables
    • Scan types
      • Documentation
        • Imaging
          • Diffusion MRI
          • MRI Quality Control
          • Resting-state fMRI
          • Structural MRI
          • Task-based fMRI
          • Task-based fMRI (Behavioral performance)
          • Trial level behavioral performance during task-based fMRI
    • ABCD BIDS Community Collection (ABCC)
      • Documentation
        • Imaging
          • ABCD-BIDS community collection
          • BIDS conversion
          • Data processing
          • Derivatives
          • Quality control procedures
  • Substudy data
    • COVID-19 rapid response research
    • Endocannabinoid
    • IRMA
    • MR Spectroscopy
  • Release notes
    • 6.0 data release

On this page

  • Overview
  • Table-level standards
    • Identifier columns
    • Date timestamps and participant age at administration
    • Anonymized birthdays
  • Variable-level standards
    • Variable types
      • Summary score standards
    • Data types
    • Measurement levels
    • Units
    • Labels
    • Label standards
    • Other improvements
    • Categorical coding standards
      • Non-response/missingness codes
      • Binary standardized codes
  1. Curation & structure
  2. Curation standards

Curation standards

Overview

ABCD’s curation standards are designed to ensure that the tabulated data resource is well-organized, consistent, and user-friendly. These standards encompass a variable naming convention, table- and variable-level standards, as well as general improvements to the accompanying metadata that enhance the usability and accessibility of the data. This page highlights the key elements of ABCD’s curation standards to facilitate effective use of the data and the data dictionary.

Why are curation standards important?

Well-developed and systematically maintained “curation standards” are essential for organizing published datasets and making them more accessible to the user community. Large multi-modal longitudinal studies, such as the ABCD Study®, often face challenges in managing and organizing data over time, a phenomenon known as “data curation debt,” which refers to the accumulation of deficiencies in data management (Butters, Wilson, and Burton 2020).

To proactively address these issues, the ABCD consortium undertook a comprehensive recuration effort before the 6.0 data release and developed curation standards that were applied consistently across the entire tabulated data resource. This initiative aimed to enhance transparency and functionality for users, while also supporting ABCD’s open science model.

The variable naming convention is the cornerstone of ABCD’s data curation standards. Variable names constructed using this convention provide structured information about the assessment domain, data source, and the measure a variable belongs to. The standard incorporates a keyword system that links variables to their respective summary scores, indicates branching logic and versioning, and connects concepts across domains. For more details on the variable naming convention, see here.

The ABCD data dictionary implements standards for variable and data types, units, variable labels, and the coding of non-responses/missingness. It also includes administrative information and hyperlinks to relevant domain-, table-, and variable-specific documentation, as well as responsible data use and data quality warnings. Additionally, the data dictionary provides historic variable and table names to relate the new names to previously used ones. For more details on ABCD’s metadata, see here.

Table-level standards

Identifier columns

All tables in the tabulated data resource utilize the following columns to uniquely identify participants and events, as well as to link data between tables:

Name Description Example
participant_id Unique identifier for a participant sub-ABCD1234
session_id Unique identifier for a session/event ses-00A

Tables with longitudinal data use both participant_id and session_id to uniquely identify an assessment, while tables with static data use only the participant_id column. The column(s) used for each variable or table are listed in the identifier_columns column in the data dictionary).

The identifier columns adhere to the BIDS (Brain Imaging Data Structure) naming convention to ensure consistency and compatibility across all released data types. In ABCD, the participant_id values consist of an 8-letter random alphanumeric code prefixed by sub-, while the session_id values use a 3-letter code prefixed by ses-, following the standard outlined below:

  • Core study events:
    • Two numbers to indicate the year of assessment (e.g., 01 for the 1-year follow-up)
    • A letter to indicate the type of event (A for annual assessments; M for mid-years; S for screening)
  • Substudy events1:
    • A letter indicating the substudy (e.g., C for the COVID substudy)
    • Two digits to indicate the assessment wave
List of session/event identifiers

The table below shows the session identifiers and their corresponding labels for data included in the ABCD 6.0 data release.

Study Session/event ID Session/event label
Core ses-00S Screener
Core ses-00A Baseline
Core ses-00M 0.5 Year
Core ses-01A 1 Year
Core ses-01M 1.5 Year
Core ses-02A 2 Year
Core ses-02M 2.5 Year
Core ses-03A 3 Year
Core ses-03M 3.5 Year
Core ses-04A 4 Year
Core ses-04M 4.5 Year
Core ses-05A 5 Year
Core ses-05M 5.5 Year
Core ses-06A 6 Year
Substudy ses-C01 COVID Wave 1
Substudy ses-C02 COVID Wave 2
Substudy ses-C03 COVID Wave 3
Substudy ses-C04 COVID Wave 4
Substudy ses-C05 COVID Wave 5
Substudy ses-C06 COVID Wave 6
Substudy ses-C07 COVID Wave 7
Substudy ses-S01 SDev Wave 1
Substudy ses-S02 SDev Wave 2
Substudy ses-S03 SDev Wave 3
Substudy ses-S04 SDev Wave 4
Substudy ses-S05 SDev Wave 5

Date timestamps and participant age at administration

All tables with assessment data include a timestamp indicating when data collection began, as well as the age of the youth participant2 at that timepoint (reported in years with day-level precision, calculated as the difference between the participant’s anonymized date of birth and the timestamp). These are standardized using the following naming conventions:

Variable Suffix Description
{table_name}_dtt Indicates the timestamp when data collection for this table started.
{table_name}_age Indicates the youth participant’s age (in years, with decimals) at the time data collection for this table started; it is calculated based on the youth’s anonymized date of birth and {table_name}_dtt.

The {table_name}_dtt and {table_name}_age variables are based on the actual date and time when the data was collected, when available. These variables allow for more precise temporal alignment and age-related analyses.

Issue with timestamps and ages

Please be aware of the following issues with the timestamps and ages reported for each table:

  1. Timestamps were not always collected for all measures and tables, especially in earlier events. In those cases, {table_name}_dtt has a missing value and {table_name}_age is computed using the timestamp indicating the start of the event, ab_g_dyn__visit_dtt.
  2. Timestamps, especially those collected in earlier events, are sometimes not precise. We corrected or removed extreme outliers and are working on a more thorough QC process for these values but please use these values with caution.

Anonymized birthdays

To anonymize birthdays, we applied the following procedure:

  1. Determine whether a birthday falls in the first (1st–15th day) or second (16th–last day) half of the month.
  2. Randomly draw another day from the respective half of the month and use this day as the anonymized birthday.

We randomly assigned anonymized birthdays for each unique birthday rather than each participant, meaning that participants who share the same real birthday also share the same anonymized birthday. The anonymized birthday is provided in the variable ab_g_stc__cohort_dob.

Variable-level standards

Variable types

To differentiate between variables in the dataset that serve different purposes, each variable is assigned one of the following four types (indicated in the type_var column of the data dictionary):

  • administrative: Variables that provide supplementary information about the assessment (e.g., collection dates/timestamps, language of administration, quality control (QC) information, etc.).
  • item: Variables that capture original data provided by the participant (e.g., responses to questions in a questionnaire, anthropometric measurements, etc.).
  • derived item: Variables computed from one or more original items, often representing recoded or reformatted information (e.g., a height value derived from separate original entries for feet and inches).
  • summary score: Variables that summarize a set of items or raw data based on specified algorithms (e.g., scores for validated psychometric scales, biospecimen results, derived imaging scores, etc.).

Summary score standards

The code for all summary scores computed by the DAIRC3 is available in the ABCDscores R package on GitHub, along with accompanying online documentation. The R package aims to support transparency and reproducibility of ABCD release data by providing the exact algorithms and code used to compute the released summary scores, allowing users to tie a specific data release version to a corresponding version of the codebase (see also here for the rationale behind creating this R package).

For standardized or normed scales, the computation follows the published algorithms. For internal scores, unless otherwise specified by domain experts, the following standards are applied:

  • Allowed missingness is set to a maximum of 20%, meaning that at least 80% of the input items must have a value for a summary score to be computed.
  • Means are preferred over (prorated) sums whenever possible.

Data types

To provide clarity about the format of the underlying values and their use in exploratory and inferential analysis, each variable is assigned one of the following data types (indicated in the type_data column of the data dictionary):

  • character: Used exclusively for categorical variables, i.e., variables with defined levels. These variables store numeric values formatted as character strings (e.g., "0", "1", "777", which differentiates them from variables with type_data: 'integer') to represent categorical labels (e.g., "No", "Yes", "Decline to answer"). The value-to-label correspondence is defined in the levels table in the metadata (see below for more information on categorical coding standards).
  • double: Numeric values with decimals (e.g., 2.5, 17.325).
  • integer: Whole numbers without decimals (e.g., 0, 1, 2).
  • date: Calendar dates in YYYY-MM-DD format; often denoted by a variable name ending in _dt.
  • time: Time of day formatted as a character string "HH:MM:SS", representing a time without a date component; often denoted by a variable name ending in _t.4
  • timestamp: Combined date and time values (e.g., 2019-09-16 10:49:00); often denoted by a variable name ending in _dtt.
  • text: Used for arbitrary-length string values, such as administrative information like IDs, scanner details, medication names, or RxNorm codes.

Measurement levels

To help researchers further understand the data and determine which types of analyses are appropriate, each variable is assigned one of the following levels of measurement (indicated in the type_level column of the data dictionary):

  • nominal: For categorical variables (type_data: 'character' or 'text') that represent categories with no inherent order (e.g., race/ethnicity, type of visit, language).
  • ordinal: For categorical variables (type_data: 'character') that represent categories with a meaningful order (e.g., Likert scales, education levels, frequency ratings like “never” to “often”).
  • interval: For quantitative variables (type_data: 'date', 'timestamp', 'time', 'double', or 'integer') with meaningful intervals between values but no true zero point (e.g., temperature in Celsius, or dates).
  • ratio: For quantitative variables (type_data: 'double' or 'integer') with equal intervals and a true zero point, allowing both differences and ratios to be interpreted meaningfully (e.g., age, reaction time, income).

Units

Wherever appropriate, units are provided for numeric fields (type_data: integer or double) in the unit column of the data dictionary. They are also included as part of the variable label. Units are reported using both the full term and the standard abbreviation (e.g., degrees Celsius (°C), grams (g), milliseconds (ms)).

Labels

Variable labels are standardized and made unique throughout the data dictionary enabling researchers to understand the information provided by each variable without relying on descriptive fields or other variables within the table for additional context.5

Label standards

  • Administrative variables that exist in more than one table are prepended with the full table name, including source/respondent, of the measure to which they belong.
  • Administrative variables that can be collected more than once, such as toxicology or MRI screeners, additionally include the visit day number and run number (when applicable) within round brackets () after the table name in the label.
  • Variables where the youth and parent/caregiver are asked the exact same question, include the respondent information in square brackets [] at the end of the label.
  • Variables that are duplicated across forms, such as items from pilot versions of forms, pre/post surveys, or substudy measures that are also used in the core protocol, include those details in square brackets [] at the end of the label.
  • Cross-listed variables, i.e., variables that are duplicated from their original table to another table, include the tag Cross listed: followed by the original variable name in square brackets [] at the end of the label.
  • Multi-select variable labels include the root question at the beginning of the label, followed by [Multi-select]: and the specific multi-select response option.
  • Non-response variables that indicate that the participant did not want to or could not respond to a question, have the same label as the corresponding question, followed by [Non-response] at the end of the label.
  • Variables that have changed over time, creating longitudinal or multiple versions, include Longitudinal or Version # in square brackets [] at the end of the label.

Examples

  • We put a lot of energy into what we do at home. [Parent]
  • Breathalyzer (Day 4; Test 2): Result
  • What is the biological father's current height?: [Non-response]
  • Detentions or suspensions: For what? [Multi-select]: Talking Back to a Teacher

Other improvements

  • When the original label of an item presented during data collection is not sufficiently descriptive—such as when a core part of the question is included in a header or descriptive field preceding the item—that information is incorporated into the variable label in the format Context of question: original item label/question to ensure that items can be understood individually.

Categorical coding standards

Non-response/missingness codes

Categorical variables often include levels that indicate non-response or missing data (e.g., “Don’t know”, “Decline to answer”, etc.). To ensure consistency and clarity, ABCD has established standardized codes for handling missing/non-response options across all categorical variables in the dataset. This standardization ensures that non-responses can be reliably excluded and/or assessed for their relevance to a given analysis.

The full set of standardized non-response/missingness codes is as follows:

Value Label
222 Don’t understand / I don’t understand this question
444 Not Applicable
555 Not administered
666 Quantity not sufficient
777 Decline to answer
888 Not asked due to branching logic
999 Don’t know / I don’t know

Binary standardized codes

Coding of categorical variables with binary response options (e.g., “Yes”/“No” or “True”/“False”) follows a standardized pattern: The negative response option is coded as 0, while the positive response is coded as 1. This standardization allows researchers to interpret and analyze these responses more consistently across the dataset.

The standardized coding for binary responses is as follows:

Label Value
No / None / Never 0
FALSE / False 0
Yes 1
TRUE / True 1
Variables re-coded for the 6.0 data release

To comply with the coding standards described above, some previously released items were recoded for the 6.0 data release. For a list of affected variables, please see the release note on re-coding here.

References

Butters, Oliver W., Rebecca C. Wilson, and Paul R. Burton. 2020. International Journal of Epidemiology 49 (4): 1067–74. doi:10.1093/ije/dyaa087.

Footnotes

  1. Some substudies are conducted as part of visits for core events; they use the same session_id as the core event.↩︎

  2. The youth’s age is also computed for tables with data provided by the parent/caregiver or other sources, as all ABCD instruments inquire about or are interpreted in relation to the youth participant↩︎

  3. Other summary scores, such as proprietary scores or summary scores imported from external sources, are not included in the ABCDscores R package.↩︎

  4. For help converting times to HMS format, we provide a function in the NBDCtools R package.↩︎

  5. Spanish translations of the labels for the parent/caregiver forms are provided in their own data dictionary column, label_en, to indicate the version received by any Spanish-speaking parent or caregiver. The Spanish labels are not standardized in the way the English labels are.↩︎

 

ABCD Study®, Teen Brains. Today’s Science. Brighter Future.® and the ABCD Study Logo are registered marks of the U.S. Department of Health & Human Services (HHS). Adolescent Brain Cognitive Development℠ Study is a service mark of the U.S. Department of Health & Human Services (HHS).