• Study
  • Data use
    • Access & download data
    • Responsible use
    • Acknowledgment
  • Documentation
    • Curation & structure
    • Non-imaging
    • Imaging
    • Substudies
    • Release notes
  • Tools
    • Data tools
    • R/Python Packages
    • CMIG Pipeline
  • Info
    • FAQs
    • Report issues
    • Changelog
    • Cite this website
  • Version
    • empty
  1. Non-imaging data
  2. Genetics

The ABCD 7.0 data has been released, and the Data Documentation has been updated with the 7.0 data release notes.

  • Curation & structure
    • Data structure
    • Curation standards
    • Naming convention
    • Metadata
  • Non-imaging data
    • ABCD (General)
    • Friends, Family, & Community
    • Genetics
    • Linked External Data
    • Mental Health
    • Neurocognition
    • Novel Technologies
    • Physical Health
    • Substance Use
  • Imaging data
    • Administrative & QC tables
      • MRI Administration
      • MRI Quality control
    • Data types
      • Documentation
        • Imaging
          • Concatenated
          • MRI derivatives data documentation
          • Source data / raw data
          • Supplementary tables
    • Scan types
      • Documentation
        • Imaging
          • Diffusion MRI
          • Resting-state fMRI
          • Structural MRI
          • Task-based fMRI
          • Task-based fMRI (Behavioral performance)
          • Trial level behavioral performance during task-based fMRI
    • ABCD BIDS Community Collection (ABCC)
      • Documentation
        • Imaging
          • About the ABCC
          • fMRIPrep
          • XCP-D
          • QSIPrep & QSIRecon
          • Post-Processing
  • Substudy data
    • Baby Teeth
    • COVID-19 rapid response research
    • Endocannabinoid
    • IRMA
    • MR Spectroscopy
    • RECOVER
    • Social Development
    • SIPS
  • Release notes
    • 7.0 data release
    • 6.1 data release
    • 6.0 data release

On this page

  • Domain overview
  • Youth tables (tabulated data)
    • Genetic relatedness
    • Population structure
    • Twin zygosity rating
  • File-based data
    • Whole Genome Sequencing (WGS)
      • SNP and INDEL Population Level VCF’s
      • Gene Burden Matrix
      • Metadata and Metrics
      • Support Files
    • Micro Array Based Data
    • Smokescreen binarized PLINK files and batch info
    • Imputed VCF files using TOPMED imputation panel
    • GENESIS derived genetic principal component weights and relatedness estimates
  1. Non-imaging data
  2. Genetics

Genetics

Domain overview

Please scroll horizontally to view the number of variables and events of administration for the displayed tables.

This document contains information on zygosity, genetic, and genetically derived data that are available for the ABCD sample for the 7.0 data release. We describe both tabulated data that is available in release tables listed below, in addition to bulk genetic data that includes the following:

Illumina Short Read Whole Genome Sequencing (WGS) data:

  1. Population-level Variant Call Files (VCF) covering 8,710 individuals at ~169 million genetic variants.

  2. Gene burden matrices covering 21k genes at 5 variant effect/impact categories

  3. Sample level Picard variant calling summary metrics for single nucleotide polymorphisms (SNPs) and insertions/deletions (INDELs).

Single Nucleotide Polymorphism Micro-array data:

  1. Curated genotyping data from Smokescreen array - set of PLINK files containing 11,670 unique subjects at ~515k variants

  2. Imputed data based on TOPMED r3 reference panel - set of vcf files containing all imputed genotype data for 11,670 unique subjects at ~260 million variants

  3. Genetic relatedness and \(\hat{\pi}\) estimates across the full sample using methods correcting ancestry background

  4. Genetic principal component weights to enable projection of other samples to ABCD genetic PC space

For comprehensive details of quality control steps performed and a description of the genetic data within the ABCD sample please refer to and cite the following work:

Key reference: Fan, C. C., Loughnan, R., Wilson, S., Hewitt, J. K., ABCD Genetic Working Group, Agrawal, A., Dowling, G., Garavan, H., LeBlanc, K., Neale, M., Friedman, N., Madden, P., Little, R., Brown, S. A., Jernigan, T., & Thompson, W. K. (2023). Behavior Genetics, 53(3), 159–168. https://doi.org/10.1007/s10519-023-10143-0
Fan et al. (2023)
CautionRelease notes: gn

7.0

  • New Data: Inclusion of Illumina short read whole genome sequencing data on 8,710 individuals
  • Genetic Relatedness Matrix: Now as GCTA-style formatted GRM fileset: .grm.bin + .grm.N.bin + .grm.id

6.1

  • Genetic Relatedness Matrix

6.0

  • File naming & inclusion
  • Variant call files
ImportantResponsible use consideration: Genetics

Balanced Research Practices

Researchers must adhere to ethical guidelines and ensure that genetic data are analyzed and interpreted responsibly. Research should aim to advance scientific understanding and avoid misinterpretation or misuse of genetic findings. Evidence of users engaging in stigmatizing research will result in termination of data access.

Consideration of Population Descriptors

The use of population descriptors in genetic research can often be varied and inconsistent. We encourage users to review the NASEM report for consideration of appropriate population descriptors for their analysis. Self-reported race and ethnicity may reflect social and environmental experiences that do not directly correspond to genetic variation. Genetic principal components are provided to more accurately represent genetic variation without reliance on self-reported categories. Principal components provide unlabelled variables capturing genetic variation, whereas ancestry labels capture similar variation but typically are labeled in terms of continental ancestry groups (European, African, etc.).

To ensure genetic analyses are conducted with scientific rigor and precision, this data release provides genetic principal components rather than ancestry labels, following best practices in genomic research.

Key reference: (2023). https://doi.org/10.17226/26902
Using Population Descriptors in Genetics and Genomics Research (2023)
WarningData consideration: Genetics principal components

Outliers of the genetics principal components appear in the gn_y_popstruct table should be noted and analyzed as such.

Key reference: Fan, C. C., Loughnan, R., Wilson, S., Hewitt, J. K., ABCD Genetic Working Group, Agrawal, A., Dowling, G., Garavan, H., LeBlanc, K., Neale, M., Friedman, N., Madden, P., Little, R., Brown, S. A., Jernigan, T., & Thompson, W. K. (2023). Behavior Genetics, 53(3), 159–168. https://doi.org/10.1007/s10519-023-10143-0
Fan et al. (2023)

Youth tables (tabulated data)

Genetic relatedness

gn_y_genrel
tabulated

Measure description: Data in this instrument indicate genetic and related similarity. gn_y_genrel_id__fam contains a unique number for each set of individuals who appear genetically related (genetic relatedness>0.35). We include gn_y_genrel_id__birth which, for individuals in the same (genetic) family, indicates those whose birthdays are within three months of each other (i.e., twins or triplets). These fields are cross listed in the “ABCD (general)” domain (ab_g_stc__design_id__fam__gen and ab_g_stc__design_id__birth__gen).

gn_y_genrel_id__fam can be used as a random effect in mixed effects models to account for relatedness in the ABCD sample. Depending on the analysis, users may select either or a combination of these variables to account for familial/shared environmental effects within the sample.

Twin Analysis Variables: To identify siblings or twins/triplets in the sample, researchers can use gn_y_genrel_id__paired__{N}, gn_y_genrel_zyg__{N} and gn_y_genrel_pihat__{N} columns. These columns are derived from pairs of related individuals where both individuals have been genotyped. These columns are aligned such that for a given subject, gn_y_genrel_id__paired__01 indicates the other subject in the sample they are related to, andgn_y_genrel_zyg__01 indicates whether this relationship is monozygotic (1), dizygotic (2) or singleton siblings (3). Monozygotic relationships have \(\hat{\pi}\)>0.8, dizygotic relationships have 0.8>\(\hat{\pi}\)>0.35 and matching birth dates between pairs, and sibling relationships have 0.8>\(\hat{\pi}\)>0.35 and birth dates more than 3 months apart. Finally, gn_y_genrel_pihat__01 represents the genetic relatedness of this relationship (as captured by \(\hat{\pi}\) ). These columns are only defined for individuals for pihat >0.35. Genetic relatedness across all pairs in the sample (thresholded) is available as part of the bulk genetic data (see section GENESIS below, which describes how genetic relatedness was calculated).

Modifications since initial administration: From data release 5.0 onwards, genetic relatedness has been computed by PC-Relate (see GENESIS section below for details). Previous data releases used PLINK --genome for this calculation which is less suitable for the population structure of the ABCD study.

CautionRelease notes: gn_y_genrel

6.0

  • Missing genotypes
  • Family relatedness

Key references:

  • Gogarten, S. M., Sofer, T., Chen, H., Yu, C., Brody, J. A., Thornton, T. A., Rice, K. M., & Conomos, M. P. (2019). Bioinformatics (Oxford, England), 35(24), 5346–5348. https://doi.org/10.1093/bioinformatics/btz567
    Gogarten et al. (2019)
  • Conomos, M. P., Miller, M. B., & Thornton, T. A. (2015). Genetic Epidemiology, 39(4), 276–293. https://doi.org/10.1002/gepi.21896
    Conomos et al. (2015)
  • Conomos, M. P., Reiner, A. P., Weir, B. S., & Thornton, T. A. (2016). The American Journal of Human Genetics, 98(1), 127–148. https://doi.org/10.1016/j.ajhg.2015.11.022
    Conomos et al. (2016)

Population structure

gn_y_popstruct
tabulated

Measure description: Genetic principal components (PC), computed using GENESIS (Gogarten et. al., Bioinformatics, 2019), are provided under fields gn_y_popstruct_pc__{N}. For details of this procedure, including details of where to find PC weights to enable projection of other samples onto this PC space, please see section GENESIS Derived Principal Component Weights and Relatedness Estimates below.

Modifications since initial administration: Genetic ancestry factors had been previously released with ABCD 4.0 data release and earlier releases. See “Non-tabulated Genetic Data” section of this document to find some resources for computing these measures if necessary. From the 5.0 data release forward, population structure is captured by genetic principal components described below in “GENESIS…” section.

Twin zygosity rating

gn_e_zygrat
tabulated

Measure description: The Twin Zygosity Rating study used in-person ratings by two research assistants (RAs) on degree of similarity for a range of physical characteristics (such as hair color, hair texture, and eye color) in twin siblings to estimate zygosity. Photographs taken of each twin were available to help resolve discrepancies in the ratings taken by the RAs. Please note, as described above, genetic-based zygosity is provided as part of the gn_y_genrel table.

Key reference:

  • Nichols, R. & Bilbro, J. W. (1966). Human Heredity, 16(3), 265–275. https://doi.org/10.1159/000151973
    Nichols & Bilbro (1966)

File-based data

Whole Genome Sequencing (WGS)

Genome Build: GRCh38

Number of variants: ~169 million variants

Number of individuals: 8,710

Read depth: 30x

Sequencing was performed on Illumina NovaSeq instruments at 30x coverage, with reads aligned to the GRCh38 reference genome. Blood and saliva samples were collected across participants roughly equally, with blood preferred for sequencing where available. DNA extraction and biospecimen storage were carried out by SAMPLED. Sentieon (versions 202112.06, 202112.01 and 202010.02) were used to map reads and perform sample-level variant calling. Sample level variant calling metrics were calculated using picard and GATK (Genome Analysis Toolkit).

Sample-level quality control was anchored by cross-validating WGS data against existing microarray genotype data on the same individuals, with discordant samples flagged for resequencing. A total of 95 samples were removed on this basis, yielding the final set of 8,877 samples covering 8,710 individuals (including 168 technical replicates). All sample IDs within files correspond to participant IDs of the ABCD study, for technical replicates a suffix is used to mark if it is a different samples type (e.g. _saliva or _wholeblood) or the same sample type (e.g. _wholeblood_replicate).

Full methodological details - including library preparation, alignment parameters, joint genotyping strategy, and QC thresholds - will be provided in an upcoming data resource paper. Data related to whole genome sequencing can be found under the following directory /concatenated/genetics/sequencing/ and all file paths in this section are relative to that root path.

SNP and INDEL Population Level VCF’s

Population-level VCF files were generated by joint genotyping across all samples GATK (v4.4.0). Files are released separated by chromosomes and chunked into 977 separate files in the following format:

./snv_indel/population_vcf/abcd_cohort_chr1_block0.vcf.gz

./snv_indel/population_vcf/abcd_cohort_chr1_block0.vcf.gz.tbi

./snv_indel/population_vcf/abcd_cohort_chr1_block1.vcf.gz

./snv_indel/population_vcf/abcd_cohort_chr1_block1.vcf.gz.tbi

./snv_indel/population_vcf/abcd_cohort_chr1_block2.vcf.gz

./snv_indel/population_vcf/abcd_cohort_chr1_block2.vcf.gz.tbi

…

Due to poor identifiability and computational tractability, we also remove low complexity regions (these regions correspond to major repetitive elements such as centromeres and large segmental duplications). Both low complexity region files and chunking scheme files are described in the support file sections below.

For variant level QC we used excess heterozygosity (>54.69) filtering and VQSR with GATK (v4.6.2) to attach a flag to FILTER field of VCF files which indicates the following:

Filter Interpretation
PASS High-confidence variants
VQSRTrancheSNP99.80to99.90 Borderline quality SNPs
VQSRTrancheSNP99.90to99.95 Lower confidence SNPs
VQSRTrancheSNP99.95to100.00 Near-certain false positive SNPs
VQSRTrancheINDEL99.80to99.90 Borderline quality INDELs
VQSRTrancheINDEL99.90to99.95 Lower confidence INDELs
VQSRTrancheINDEL99.95to100.00 Near-certain false positive INDELs
ExcessHet Filtered pre-VQSR (likely non-diploid sites) 

For the majority of users we recommend using high-confidence sites, which can be filtered with bcftools as follows:

bcftools view -f PASS input.vcf.gz -o output.vcf.gz -O z

Each genomic block VCF is normalised by splitting multiallelic sites with bcftools norm, then annotated using Ensembl VEP (version 112.0, GRCh38, offline cache) to append functional consequence predictions, gene symbols, regulatory context, canonical transcript flags, and existing variant information to each variant record. The resulting annotated VCFs are indexed and retained in compressed format for downstream filtering.

Gene Burden Matrix

Following VEP annotation, functional variants are extracted by filtering on a configurable set of VEP consequences restricted to PASS -filtered sites. Variants are categorised into five masks of increasing breadth based on their predicted consequence and VEP impact rating (see table below), then grouped per gene to construct a burden feature matrix per genomic block. Each cell records the aggregate count of qualifying rare variants (MAF 0–1%) carried by a given sample for a given gene–mask combination. Block-level matrices are concatenated into a single genome-wide sparse matrix (.mtx format) with one feature per gene–mask pair (named GENESYMBOL.MaskName) and one row per sample, ready for downstream association or machine-learning analyses.

Mask Included VEP Consequences Description
LoF stop_gained, stop_lost, start_lost, splice_donor_variant, splice_acceptor_variant, frameshift_variant, transcript_ablation Loss-of-function variants predicted to disrupt or eliminate the protein product
Missense missense_variant, inframe_deletion, inframe_insertion, protein_altering_variant Non-synonymous variants that alter the amino acid sequence while preserving the reading frame
HIGH Any variant with VEP IMPACT = HIGH VEP’s own high-impact tier (largely overlaps LoF but assigned directly by VEP)
MODERATE Any variant with VEP IMPACT = MODERATE VEP’s moderate-impact tier (largely overlaps Missense)
Functional Union of all above Catch-all mask containing every variant assigned to at least one other mask

The genome-wide output consists of the following three files (which can be found at ./snv_indel/gene_burden_mat:

File Contents
genome_wide_gene_burden_matrix.mtx Sparse Matrix Market format — rows = samples, cols = gene-masks
genome_wide_gene_burden_matrix.samples.txt Ordered sample IDs (one per line)
genome_wide_gene_burden_matrix.genemasks.txt Ordered gene-mask labels, e.g. BRCA2.LoF (one per line)

Scipy in python could be used to read in these files with the following code:

import scipy.io
import scipy.sparse as sp
import pandas as pd

# ── paths ────────────────────────────────────────────────────────────────────
base = "results/gene_burden/genome_wide_gene_burden_matrix"

# ── load labels ──────────────────────────────────────────────────────────────
samples = pd.read_csv(f"{base}.samples.txt", header=None, names=["sample_id"])["sample_id"].tolist()
genemasks = pd.read_csv(f"{base}.genemasks.txt", header=None, names=["gene_mask"])["gene_mask"].tolist()

# ── load sparse matrix ───────────────────────────────────────────────────────
sparse_mat = scipy.io.mmread(f"{base}.mtx").tocsr()  # shape: (n_samples, n_gene_masks)
burden_df = pd.DataFrame.sparse.from_spmatrix(sparse_mat, index=samples, columns=genemasks)

Metadata and Metrics

Although the same GATK version and VEP versions were used for this release, different senteion versions were used across samples. Additionally, approximately 40% of samples were saliva while the rest were wholeblood samples. We recommend using both senteion version and sample type as covariates for downstream analyses using the WGS data. This data can be found at ./snv_indel/metadata/sample_info.tsv

Variant callsets are evaluated using two Picard tools: 1) CollectVariantCallingMetrics was run on the VQSR-filtered VCF(s) against a dbSNP reference VCF, restricted to a target interval list. It produces per-callset detail and summary metrics files (.variant_calling_detail_metrics, .variant_calling_summary_metrics) capturing statistics such as SNP/indel counts, novelty rates, Ti/Tv ratios, and dbSNP concordance. 2) AccumulateVariantCallingMetrics was used to aggregate the per-shard detail and summary metrics files from the sharded collection step into a single combined output, enabling cohort-level reporting across scattered VCF chunks. The output from this can be found at ./snv_indel/metadata/picard_metrics.tsv.

Support Files

./snv_indel/supportfiles contain the following files

  • GRCh38.no_alt_analysis_set.fa / .fa.fai: The GRCh38 human reference genome (no-alt analysis set) used for alignment and variant calling, with accompanying samtools FASTA index.

  • low_complexity_regions.tsv: A TSV file listing genomic regions excluded from haplotype calling and VCF output due to low sequence complexity. Regions were identified by scanning the GRCh38 reference genome using a sliding-window k-mer uniqueness approach (10 kb windows, 25-mer complexity score < 0.6), retaining only contiguous low-complexity blocks ≥ 1 Mb. These regions correspond to major repetitive elements such as centromeres and large segmental duplications.

  • pvcf_blocks.txt: Defines the chromosome-level genomic block partitioning used to split the callset into population VCF (pVCF) shards. The blocking scheme follows the UK Biobank convention, enabling compatibility with downstream tools and pipelines designed around that standard.

Micro Array Based Data

For extraction of single genetic variants, users can use tools like bed-reader for python and snpStats for R to parse plink bed files described below. Methods such as PRS-csx have been developed to generate polygenic scores in samples with the high degree of genetic diversity found in ABCD. PRScs shows comparable performance with PRS-csx in ABCD data, whilst requiring fewer analysis steps (e.g. cross validation)(Ahern, J et. al., Behavior Genetics, 2023). The performance of polygenic scores varies as genomic distance from the training sample increases (Ding et. al., 2023) . Due to this known issue, conducting stratified analysis of individuals that share similar continental ancestry is currently considered best practice, although this is an evolving field with new approaches constantly being developed. Tools such as ADMIXTURE can generate genetic factors to enable this type of analysis.

Key references:

  • Ahern, J., Thompson, W., Fan, C. C., & Loughnan, R. (2023). Behavior Genetics, 53(3), 292–309. https://doi.org/10.1007/s10519-023-10139-w
    Ahern et al. (2023)
  • Ding, Y., Hou, K., Xu, Z., Pimplaskar, A., Petter, E., Boulier, K., Privé, F., Vilhjálmsson, B. J., Olde Loohuis, L. M., & Pasaniuc, B. (2023). Nature, 618(7966), 774–781. https://doi.org/10.1038/s41586-023-06079-4
    Ding et al. (2023)

Plink files described do not contain family relatedness or sex information. For family relatedness please refer to “GENESIS Derived Genetic Principal Component Weights and Relatedness Estimates” section below or “Genetic Relatedness” above. For sex, refer to ab_g_stc__cohort_sex in ab_g_stc.

Non tabulated genetic data is split into three directories as follows: ../dairc/concatenated/genetics/genotype_microarray/..

  • /smokescreen/ smokescreen genotype array data (non-imputed)
  • /imputed/ TOPMED imputed array data - derived from smokescreen
  • /genesis/ GENESIS derived variables.

The contents of each of these directories is described in sections below.

Smokescreen binarized PLINK files and batch info

Files: ../smokescreen/..

  • merged_chroms.bed
  • merged_chroms.bim
  • merged_chroms.fam
  • batch.info
  • removed_individuals.txt

Measure description: After dish quality control and profile checks, genotypes were called using Axiom Analysis Suite (apt version 2.11) on raw intensities from the Affymetrix Smokescreen array. Based on the best practices analysis workflow by Thermo Fisher, classifications that passed the final SNP quality controls were recommended, resulting in ~515K recommended probe sets in each genotyping batch. Blood and Saliva DNA samples were genotyped separately. We include one genotype result for each subject, using whichever sample has the best QC metrics (call rates and missingness). There were nine genotyping batches in Data Release 6.0, spanning 147 plates (See batch.info in downloaded files). After obtaining the genotype batch, we mapped the probesets to SNPs using annotations derived from genome build hg19. After the mapping, we merged all nine batches into one study cohort and then performed additional study level QC to include missingness less than 10% in the SNP level, and less than 20% in the sample level. 515,279 variants and 11,670 people passed filters and QC. The subsequent imputation and relatedness inferences were based on the final curated genotype data. The batch information can be found in batch.info. Removed_individuals.txt indicates individuals that have been removed from 6.0 to 7.0 data release due to either withdrawn consent or indications of sample mixup (e.g. mismatching of genetic relatedness with known family members).

Genome Build: hg19

Number of variants: ~515k

Number of individuals: 11,670

Modifications since initial release: Includes genotyping of missing individuals from previous data releases due to sample mix up or failing quality control measures.

Key reference: Baurley, J. W., Edlund, C. K., Pardamean, C. I., Conti, D. V., & Bergen, A. W. (2016). BMC Genomics, 17(1), 145. https://doi.org/10.1186/s12864-016-2495-7
Baurley et al. (2016)

Imputed VCF files using TOPMED imputation panel

Files: ../imputed/..

  • chr{c}_dose.vsf.gz
  • chr{c}_dose.vcf.gz.tbi
  • …
  • qcreport.html

Measure description: “The curated genotype data was used for the imputation, using the bioinformatic pipelines and recommendations of TOPMED Server, with TOPMED r3 reference panel. We input unphased genotypes, performing eagle imputation, with the TOPMED r2 reference panel and population set to “all”. TOPMED includes rsID numbers automatically in the output files.

The TOPMED imputation scores and post-imputation quality report can be found at qcreport.html in this folder. In addition to estimated allele dosages, an R2 field in vcf files contains an estimated imputation accuracy which can be used to filter high quality imputed variants.

ABCD Classification: Genetic

Genome Build: GRCh38

Number of variants: ~260 million

Number of individuals: 11,670

Key references:

  • Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A. E., Kwong, A., Vrieze, S. I., Chew, E. Y., Levy, S., McGue, M., Schlessinger, D., Stambolian, D., Loh, P., Iacono, W. G., Swaroop, A., Scott, L. J., Cucca, F., Kronenberg, F., Boehnke, M., Abecasis, G. R., & Fuchsberger, C. (2016). Nature Genetics, 48(10), 1284–1287. https://doi.org/10.1038/ng.3656
    Das et al. (2016)
  • Loh, P., Danecek, P., Palamara, P. F., Fuchsberger, C., A Reshef, Y., K Finucane, H., Schoenherr, S., Forer, L., McCarthy, S., Abecasis, G. R., Durbin, R., & L Price, A. (2016). Nature Genetics, 48(11), 1443–1448. https://doi.org/10.1038/ng.3679
    Loh et al. (2016)

GENESIS derived genetic principal component weights and relatedness estimates

Files: ../genesis/..

  • pcair_weights.tsv
  • pcrelate_relatedness.tsv
  • pcrelate_relatedness_grm.bin
  • pcrelate_relatedness_grm.id
  • pcrelate_relatedness_grm.N.bin
  • unrelateds_individuals.txt

Measure description: Accounting for genetic principal components (PCs) in genetic studies (both GWAS and Polygenic Score analysis) is considered best practice to account for effects of population stratification that can lead to spurious results. Traditional approaches for calculating PCs (e.g., FlashPCA), although considered best practice for many genetic studies, may not be suited for samples with large known or cryptic relatedness, as is observed in ABCD. As such, we have replaced these genetic PCs with ones calculated using PC-AiR, a method developed and validated for samples with large family structure. PC-AiR captures ancestry information not confounded by relatedness by finding a set of unrelated individuals in the sample that have the highest divergent ancestry, and computing PCs in this set. The remaining related individuals are then projected into this space. This method is used by the Population Architecture through Genomics and Environment (PAGE) Consortium, which is principally concerned with genetic studies in diverse ancestry populations.

PC-AiR was run using default suggested settings from the GENESIS package. We used non-imputed SNPs passing QCs from the 6.0 data release (~500k variants and 11,670 individuals). PC-AiR takes in kinship estimates for defining its unrelated set of individuals with divergent ancestry; this was computed using snpgdsIBDKING as suggested by GENESIS authors. SNPs were LD pruned using snpgdsLDpruning with parameters: method=“corr”, slide.max.bp=10e6 and ld.threshold=sqrt(0.1). This resulted in 137,980 SNPs remaining after pruning. Using the computed kinship matrix PC-AiR was then run on this pruned set of SNPs. This resulted in 8,180 unrelated individuals from which PCs were derived – leaving 3,493 related individuals being projected onto this space. Subsequent analysis indicated a sample mix of 3 samples which were then removed from other genetic data.This is why the sum of unrelated and related individuals is more than the number of individuals in PLINK files (8,180+3,493>11,670). The weights, which can be used to project other samples into the same PC space, can be found in pcair_weights.tsv, with the file /smokescreen/merged_chroms.bim indicating allele codings. The list of 8,180 unrelated individuals used for deriving PC’s is available in unrelateds_individuals.txt.

After Computing PCs from PC-AiR, we then computed a genetic relatedness matrix (GRM) using PC-Relate. PC-Relate aims to compute a GRM that is independent from ancestry effects as derived from PC-AiR. PC-Relate was run on the same pruned set of SNPs described above using the first two PCs computed from PC-Air. Identity by descent probabilities between individuals i and j were calculated as \(\hat{\pi}_{ij}= \hat{k}_{ij}^{(2)}+0.5× \hat{k}_{ij}^{(1)}\), where \(\hat{k}_{ij}^{(2)}\) and \(\hat{k}_{ij}^{(1)}\) represent the probabilities that individuals i and j share 2 or 1 alleles at a locus – calculated from PC-Relate. For all off-diagonal elements of the GRM we provide estimates of \(\hat{k}_{ij}^{(0)}\), \(\hat{k}_{ij}^{(1)}\), \(\hat{k}_{ij}^{(2)}\), \(\hat{\pi}_{ij}\) and genetic relatedness in pcrelate_relatedness.tsv. A GCTA-style R version of the full GRM is represented in the pcrelate_relatedness_grm.* fileset. The range of \(\hat{\pi}\) is between 0 and 1. In most other genetic relatedness models GRM values should be bounded by 1, however, due to high inbreeding coefficients of some subjects and minor allele frequencies close to zero, a small subset of GRM values estimated by PC-Relate in ABCD exceed 1. If you are concerned with how this may affect downstream modeling we recommend clipping these values to 1. For a subset of related individuals (\(\hat{\pi}>0.35\)) we include estimates in tabulated data in the gn_y_genrel instrument described above.. Code used to perform the processes described in this section can be found here: https://github.com/robloughnan/ABCD_GeneticPCs_and_Relatedness.

Number of individuals: 11,670

Modifications since initial release: The number of principal components used for ancestry-adjusted allele frequency estimation has increased from 2 to 7 to better capture population structure. Includes genotyping of missing individuals from previous data releases due to sample mix up or failing quality control measures. Use of PC-Relate for relatedness computation, previous data releases used PLINK --genome for this calculation.

Key references:

  • Gogarten, S. M., Sofer, T., Chen, H., Yu, C., Brody, J. A., Thornton, T. A., Rice, K. M., & Conomos, M. P. (2019). Bioinformatics (Oxford, England), 35(24), 5346–5348. https://doi.org/10.1093/bioinformatics/btz567
    Gogarten et al. (2019)
  • Conomos, M. P., Miller, M. B., & Thornton, T. A. (2015). Genetic Epidemiology, 39(4), 276–293. https://doi.org/10.1002/gepi.21896
    Conomos et al. (2015)
  • Conomos, M. P., Reiner, A. P., Weir, B. S., & Thornton, T. A. (2016). The American Journal of Human Genetics, 98(1), 127–148. https://doi.org/10.1016/j.ajhg.2015.11.022
    Conomos et al. (2016)

References

Ahern, J., Thompson, W., Fan, C. C., & Loughnan, R. (2023). Behavior Genetics, 53(3), 292–309. https://doi.org/10.1007/s10519-023-10139-w
Baurley, J. W., Edlund, C. K., Pardamean, C. I., Conti, D. V., & Bergen, A. W. (2016). BMC Genomics, 17(1), 145. https://doi.org/10.1186/s12864-016-2495-7
Conomos, M. P., Miller, M. B., & Thornton, T. A. (2015). Genetic Epidemiology, 39(4), 276–293. https://doi.org/10.1002/gepi.21896
Conomos, M. P., Reiner, A. P., Weir, B. S., & Thornton, T. A. (2016). The American Journal of Human Genetics, 98(1), 127–148. https://doi.org/10.1016/j.ajhg.2015.11.022
Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A. E., Kwong, A., Vrieze, S. I., Chew, E. Y., Levy, S., McGue, M., Schlessinger, D., Stambolian, D., Loh, P.-R., Iacono, W. G., Swaroop, A., Scott, L. J., Cucca, F., Kronenberg, F., Boehnke, M., … Fuchsberger, C. (2016). Nature Genetics, 48(10), 1284–1287. https://doi.org/10.1038/ng.3656
Ding, Y., Hou, K., Xu, Z., Pimplaskar, A., Petter, E., Boulier, K., Privé, F., Vilhjálmsson, B. J., Olde Loohuis, L. M., & Pasaniuc, B. (2023). Nature, 618(7966), 774–781. https://doi.org/10.1038/s41586-023-06079-4
Fan, C. C., Loughnan, R., Wilson, S., Hewitt, J. K., ABCD Genetic Working Group, Agrawal, A., Dowling, G., Garavan, H., LeBlanc, K., Neale, M., Friedman, N., Madden, P., Little, R., Brown, S. A., Jernigan, T., & Thompson, W. K. (2023). Behavior Genetics, 53(3), 159–168. https://doi.org/10.1007/s10519-023-10143-0
Gogarten, S. M., Sofer, T., Chen, H., Yu, C., Brody, J. A., Thornton, T. A., Rice, K. M., & Conomos, M. P. (2019). Bioinformatics (Oxford, England), 35(24), 5346–5348. https://doi.org/10.1093/bioinformatics/btz567
Loh, P.-R., Danecek, P., Palamara, P. F., Fuchsberger, C., A Reshef, Y., K Finucane, H., Schoenherr, S., Forer, L., McCarthy, S., Abecasis, G. R., Durbin, R., & L Price, A. (2016). Nature Genetics, 48(11), 1443–1448. https://doi.org/10.1038/ng.3679
Nichols, R. C., & Bilbro, W. C., Jr. (1966). Human Heredity, 16(3), 265–275. https://doi.org/10.1159/000151973
Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. (2023). National Academies Press. https://doi.org/10.17226/26902