Genetics
Domain overview
This document contains information on zygosity, genetic, and genetically derived data that are available for the ABCD sample for the 7.0 data release. We describe both tabulated data that is available in release tables listed below, in addition to bulk genetic data that includes the following:
Illumina Short Read Whole Genome Sequencing (WGS) data:
Population-level Variant Call Files (VCF) covering 8,710 individuals at ~169 million genetic variants.
Gene burden matrices covering 21k genes at 5 variant effect/impact categories
Sample level Picard variant calling summary metrics for single nucleotide polymorphisms (SNPs) and insertions/deletions (INDELs).
Single Nucleotide Polymorphism Micro-array data:
Curated genotyping data from Smokescreen array - set of PLINK files containing 11,670 unique subjects at ~515k variants
Imputed data based on TOPMED r3 reference panel - set of vcf files containing all imputed genotype data for 11,670 unique subjects at ~260 million variants
Genetic relatedness and \(\hat{\pi}\) estimates across the full sample using methods correcting ancestry background
Genetic principal component weights to enable projection of other samples to ABCD genetic PC space
For comprehensive details of quality control steps performed and a description of the genetic data within the ABCD sample please refer to and cite the following work:
Key reference: Fan, C. C., Loughnan, R., Wilson, S., Hewitt, J. K., ABCD Genetic Working Group, Agrawal, A., Dowling, G., Garavan, H., LeBlanc, K., Neale, M., Friedman, N., Madden, P., Little, R., Brown, S. A., Jernigan, T., & Thompson, W. K. (2023). Behavior Genetics, 53(3), 159–168. https://doi.org/10.1007/s10519-023-10143-0gn
7.0
- New Data: Inclusion of Illumina short read whole genome sequencing data on 8,710 individuals
- Genetic Relatedness Matrix: Now as GCTA-style formatted GRM fileset: .grm.bin + .grm.N.bin + .grm.id
6.1
6.0
Balanced Research Practices
Researchers must adhere to ethical guidelines and ensure that genetic data are analyzed and interpreted responsibly. Research should aim to advance scientific understanding and avoid misinterpretation or misuse of genetic findings. Evidence of users engaging in stigmatizing research will result in termination of data access.
Consideration of Population Descriptors
The use of population descriptors in genetic research can often be varied and inconsistent. We encourage users to review the NASEM report for consideration of appropriate population descriptors for their analysis. Self-reported race and ethnicity may reflect social and environmental experiences that do not directly correspond to genetic variation. Genetic principal components are provided to more accurately represent genetic variation without reliance on self-reported categories. Principal components provide unlabelled variables capturing genetic variation, whereas ancestry labels capture similar variation but typically are labeled in terms of continental ancestry groups (European, African, etc.).
To ensure genetic analyses are conducted with scientific rigor and precision, this data release provides genetic principal components rather than ancestry labels, following best practices in genomic research.
Key reference: (2023). https://doi.org/10.17226/26902Outliers of the genetics principal components appear in the gn_y_popstruct table should be noted and analyzed as such.
Youth tables (tabulated data)
Genetic relatedness
Measure description: Data in this instrument indicate genetic and related similarity. gn_y_genrel_id__fam contains a unique number for each set of individuals who appear genetically related (genetic relatedness>0.35). We include gn_y_genrel_id__birth which, for individuals in the same (genetic) family, indicates those whose birthdays are within three months of each other (i.e., twins or triplets). These fields are cross listed in the “ABCD (general)” domain (ab_g_stc__design_id__fam__gen and ab_g_stc__design_id__birth__gen).
gn_y_genrel_id__fam can be used as a random effect in mixed effects models to account for relatedness in the ABCD sample. Depending on the analysis, users may select either or a combination of these variables to account for familial/shared environmental effects within the sample.
Twin Analysis Variables: To identify siblings or twins/triplets in the sample, researchers can use gn_y_genrel_id__paired__{N}, gn_y_genrel_zyg__{N} and gn_y_genrel_pihat__{N} columns. These columns are derived from pairs of related individuals where both individuals have been genotyped. These columns are aligned such that for a given subject, gn_y_genrel_id__paired__01 indicates the other subject in the sample they are related to, andgn_y_genrel_zyg__01 indicates whether this relationship is monozygotic (1), dizygotic (2) or singleton siblings (3). Monozygotic relationships have \(\hat{\pi}\)>0.8, dizygotic relationships have 0.8>\(\hat{\pi}\)>0.35 and matching birth dates between pairs, and sibling relationships have 0.8>\(\hat{\pi}\)>0.35 and birth dates more than 3 months apart. Finally, gn_y_genrel_pihat__01 represents the genetic relatedness of this relationship (as captured by \(\hat{\pi}\) ). These columns are only defined for individuals for pihat >0.35. Genetic relatedness across all pairs in the sample (thresholded) is available as part of the bulk genetic data (see section GENESIS below, which describes how genetic relatedness was calculated).
Modifications since initial administration: From data release 5.0 onwards, genetic relatedness has been computed by PC-Relate (see GENESIS section below for details). Previous data releases used PLINK --genome for this calculation which is less suitable for the population structure of the ABCD study.
gn_y_genrel
Key references:
- Gogarten, S. M., Sofer, T., Chen, H., Yu, C., Brody, J. A., Thornton, T. A., Rice, K. M., & Conomos, M. P. (2019). Bioinformatics (Oxford, England), 35(24), 5346–5348. https://doi.org/10.1093/bioinformatics/btz567
- Conomos, M. P., Miller, M. B., & Thornton, T. A. (2015). Genetic Epidemiology, 39(4), 276–293. https://doi.org/10.1002/gepi.21896
- Conomos, M. P., Reiner, A. P., Weir, B. S., & Thornton, T. A. (2016). The American Journal of Human Genetics, 98(1), 127–148. https://doi.org/10.1016/j.ajhg.2015.11.022
Population structure
Measure description: Genetic principal components (PC), computed using GENESIS (Gogarten et. al., Bioinformatics, 2019), are provided under fields gn_y_popstruct_pc__{N}. For details of this procedure, including details of where to find PC weights to enable projection of other samples onto this PC space, please see section GENESIS Derived Principal Component Weights and Relatedness Estimates below.
Modifications since initial administration: Genetic ancestry factors had been previously released with ABCD 4.0 data release and earlier releases. See “Non-tabulated Genetic Data” section of this document to find some resources for computing these measures if necessary. From the 5.0 data release forward, population structure is captured by genetic principal components described below in “GENESIS…” section.
Twin zygosity rating
Measure description: The Twin Zygosity Rating study used in-person ratings by two research assistants (RAs) on degree of similarity for a range of physical characteristics (such as hair color, hair texture, and eye color) in twin siblings to estimate zygosity. Photographs taken of each twin were available to help resolve discrepancies in the ratings taken by the RAs. Please note, as described above, genetic-based zygosity is provided as part of the gn_y_genrel table.
Key reference:
- Nichols, R. & Bilbro, J. W. (1966). Human Heredity, 16(3), 265–275. https://doi.org/10.1159/000151973
File-based data
Whole Genome Sequencing (WGS)
Genome Build: GRCh38
Number of variants: ~169 million variants
Number of individuals: 8,710
Read depth: 30x
Sequencing was performed on Illumina NovaSeq instruments at 30x coverage, with reads aligned to the GRCh38 reference genome. Blood and saliva samples were collected across participants roughly equally, with blood preferred for sequencing where available. DNA extraction and biospecimen storage were carried out by SAMPLED. Sentieon (versions 202112.06, 202112.01 and 202010.02) were used to map reads and perform sample-level variant calling. Sample level variant calling metrics were calculated using picard and GATK (Genome Analysis Toolkit).
Sample-level quality control was anchored by cross-validating WGS data against existing microarray genotype data on the same individuals, with discordant samples flagged for resequencing. A total of 95 samples were removed on this basis, yielding the final set of 8,877 samples covering 8,710 individuals (including 168 technical replicates). All sample IDs within files correspond to participant IDs of the ABCD study, for technical replicates a suffix is used to mark if it is a different samples type (e.g. _saliva or _wholeblood) or the same sample type (e.g. _wholeblood_replicate).
Full methodological details - including library preparation, alignment parameters, joint genotyping strategy, and QC thresholds - will be provided in an upcoming data resource paper. Data related to whole genome sequencing can be found under the following directory /concatenated/genetics/sequencing/ and all file paths in this section are relative to that root path.
SNP and INDEL Population Level VCF’s
Population-level VCF files were generated by joint genotyping across all samples GATK (v4.4.0). Files are released separated by chromosomes and chunked into 977 separate files in the following format:
./snv_indel/population_vcf/abcd_cohort_chr1_block0.vcf.gz
./snv_indel/population_vcf/abcd_cohort_chr1_block0.vcf.gz.tbi
./snv_indel/population_vcf/abcd_cohort_chr1_block1.vcf.gz
./snv_indel/population_vcf/abcd_cohort_chr1_block1.vcf.gz.tbi
./snv_indel/population_vcf/abcd_cohort_chr1_block2.vcf.gz
./snv_indel/population_vcf/abcd_cohort_chr1_block2.vcf.gz.tbi
…
Due to poor identifiability and computational tractability, we also remove low complexity regions (these regions correspond to major repetitive elements such as centromeres and large segmental duplications). Both low complexity region files and chunking scheme files are described in the support file sections below.
For variant level QC we used excess heterozygosity (>54.69) filtering and VQSR with GATK (v4.6.2) to attach a flag to FILTER field of VCF files which indicates the following:
Filter |
Interpretation |
PASS |
High-confidence variants |
VQSRTrancheSNP99.80to99.90 |
Borderline quality SNPs |
VQSRTrancheSNP99.90to99.95 |
Lower confidence SNPs |
VQSRTrancheSNP99.95to100.00 |
Near-certain false positive SNPs |
VQSRTrancheINDEL99.80to99.90 |
Borderline quality INDELs |
VQSRTrancheINDEL99.90to99.95 |
Lower confidence INDELs |
VQSRTrancheINDEL99.95to100.00 |
Near-certain false positive INDELs |
ExcessHet |
Filtered pre-VQSR (likely non-diploid sites) |
For the majority of users we recommend using high-confidence sites, which can be filtered with bcftools as follows:
bcftools view -f PASS input.vcf.gz -o output.vcf.gz -O z
Each genomic block VCF is normalised by splitting multiallelic sites with bcftools norm, then annotated using Ensembl VEP (version 112.0, GRCh38, offline cache) to append functional consequence predictions, gene symbols, regulatory context, canonical transcript flags, and existing variant information to each variant record. The resulting annotated VCFs are indexed and retained in compressed format for downstream filtering.
Gene Burden Matrix
Following VEP annotation, functional variants are extracted by filtering on a configurable set of VEP consequences restricted to PASS -filtered sites. Variants are categorised into five masks of increasing breadth based on their predicted consequence and VEP impact rating (see table below), then grouped per gene to construct a burden feature matrix per genomic block. Each cell records the aggregate count of qualifying rare variants (MAF 0–1%) carried by a given sample for a given gene–mask combination. Block-level matrices are concatenated into a single genome-wide sparse matrix (.mtx format) with one feature per gene–mask pair (named GENESYMBOL.MaskName) and one row per sample, ready for downstream association or machine-learning analyses.
| Mask | Included VEP Consequences | Description |
| LoF | stop_gained, stop_lost, start_lost, splice_donor_variant, splice_acceptor_variant, frameshift_variant, transcript_ablation |
Loss-of-function variants predicted to disrupt or eliminate the protein product |
| Missense | missense_variant, inframe_deletion, inframe_insertion, protein_altering_variant | Non-synonymous variants that alter the amino acid sequence while preserving the reading frame |
| HIGH | Any variant with VEP IMPACT = HIGH |
VEP’s own high-impact tier (largely overlaps LoF but assigned directly by VEP) |
| MODERATE | Any variant with VEP IMPACT = MODERATE |
VEP’s moderate-impact tier (largely overlaps Missense) |
| Functional | Union of all above | Catch-all mask containing every variant assigned to at least one other mask |
The genome-wide output consists of the following three files (which can be found at ./snv_indel/gene_burden_mat:
| File | Contents |
genome_wide_gene_burden_matrix.mtx |
Sparse Matrix Market format — rows = samples, cols = gene-masks |
genome_wide_gene_burden_matrix.samples.txt |
Ordered sample IDs (one per line) |
genome_wide_gene_burden_matrix.genemasks.txt |
Ordered gene-mask labels, e.g. BRCA2.LoF (one per line) |
Scipy in python could be used to read in these files with the following code:
import scipy.io
import scipy.sparse as sp
import pandas as pd
# ── paths ────────────────────────────────────────────────────────────────────
base = "results/gene_burden/genome_wide_gene_burden_matrix"
# ── load labels ──────────────────────────────────────────────────────────────
samples = pd.read_csv(f"{base}.samples.txt", header=None, names=["sample_id"])["sample_id"].tolist()
genemasks = pd.read_csv(f"{base}.genemasks.txt", header=None, names=["gene_mask"])["gene_mask"].tolist()
# ── load sparse matrix ───────────────────────────────────────────────────────
sparse_mat = scipy.io.mmread(f"{base}.mtx").tocsr() # shape: (n_samples, n_gene_masks)
burden_df = pd.DataFrame.sparse.from_spmatrix(sparse_mat, index=samples, columns=genemasks)Metadata and Metrics
Although the same GATK version and VEP versions were used for this release, different senteion versions were used across samples. Additionally, approximately 40% of samples were saliva while the rest were wholeblood samples. We recommend using both senteion version and sample type as covariates for downstream analyses using the WGS data. This data can be found at ./snv_indel/metadata/sample_info.tsv
Variant callsets are evaluated using two Picard tools: 1) CollectVariantCallingMetrics was run on the VQSR-filtered VCF(s) against a dbSNP reference VCF, restricted to a target interval list. It produces per-callset detail and summary metrics files (.variant_calling_detail_metrics, .variant_calling_summary_metrics) capturing statistics such as SNP/indel counts, novelty rates, Ti/Tv ratios, and dbSNP concordance. 2) AccumulateVariantCallingMetrics was used to aggregate the per-shard detail and summary metrics files from the sharded collection step into a single combined output, enabling cohort-level reporting across scattered VCF chunks. The output from this can be found at ./snv_indel/metadata/picard_metrics.tsv.
Support Files
./snv_indel/supportfiles contain the following files
GRCh38.no_alt_analysis_set.fa / .fa.fai: The GRCh38 human reference genome (no-alt analysis set) used for alignment and variant calling, with accompanying samtools FASTA index.low_complexity_regions.tsv: A TSV file listing genomic regions excluded from haplotype calling and VCF output due to low sequence complexity. Regions were identified by scanning the GRCh38 reference genome using a sliding-window k-mer uniqueness approach (10 kb windows, 25-mer complexity score < 0.6), retaining only contiguous low-complexity blocks ≥ 1 Mb. These regions correspond to major repetitive elements such as centromeres and large segmental duplications.pvcf_blocks.txt: Defines the chromosome-level genomic block partitioning used to split the callset into population VCF (pVCF) shards. The blocking scheme follows the UK Biobank convention, enabling compatibility with downstream tools and pipelines designed around that standard.
Micro Array Based Data
For extraction of single genetic variants, users can use tools like bed-reader for python and snpStats for R to parse plink bed files described below. Methods such as PRS-csx have been developed to generate polygenic scores in samples with the high degree of genetic diversity found in ABCD. PRScs shows comparable performance with PRS-csx in ABCD data, whilst requiring fewer analysis steps (e.g. cross validation)(Ahern, J et. al., Behavior Genetics, 2023). The performance of polygenic scores varies as genomic distance from the training sample increases (Ding et. al., 2023) . Due to this known issue, conducting stratified analysis of individuals that share similar continental ancestry is currently considered best practice, although this is an evolving field with new approaches constantly being developed. Tools such as ADMIXTURE can generate genetic factors to enable this type of analysis.
Key references:
- Ahern, J., Thompson, W., Fan, C. C., & Loughnan, R. (2023). Behavior Genetics, 53(3), 292–309. https://doi.org/10.1007/s10519-023-10139-w
- Ding, Y., Hou, K., Xu, Z., Pimplaskar, A., Petter, E., Boulier, K., Privé, F., Vilhjálmsson, B. J., Olde Loohuis, L. M., & Pasaniuc, B. (2023). Nature, 618(7966), 774–781. https://doi.org/10.1038/s41586-023-06079-4
Plink files described do not contain family relatedness or sex information. For family relatedness please refer to “GENESIS Derived Genetic Principal Component Weights and Relatedness Estimates” section below or “Genetic Relatedness” above. For sex, refer to ab_g_stc__cohort_sex in ab_g_stc.
Non tabulated genetic data is split into three directories as follows: ../dairc/concatenated/genetics/genotype_microarray/..
/smokescreen/smokescreen genotype array data (non-imputed)/imputed/TOPMED imputed array data - derived from smokescreen/genesis/GENESIS derived variables.
The contents of each of these directories is described in sections below.
Smokescreen binarized PLINK files and batch info
Files: ../smokescreen/..
merged_chroms.bedmerged_chroms.bimmerged_chroms.fambatch.inforemoved_individuals.txt
Measure description: After dish quality control and profile checks, genotypes were called using Axiom Analysis Suite (apt version 2.11) on raw intensities from the Affymetrix Smokescreen array. Based on the best practices analysis workflow by Thermo Fisher, classifications that passed the final SNP quality controls were recommended, resulting in ~515K recommended probe sets in each genotyping batch. Blood and Saliva DNA samples were genotyped separately. We include one genotype result for each subject, using whichever sample has the best QC metrics (call rates and missingness). There were nine genotyping batches in Data Release 6.0, spanning 147 plates (See batch.info in downloaded files). After obtaining the genotype batch, we mapped the probesets to SNPs using annotations derived from genome build hg19. After the mapping, we merged all nine batches into one study cohort and then performed additional study level QC to include missingness less than 10% in the SNP level, and less than 20% in the sample level. 515,279 variants and 11,670 people passed filters and QC. The subsequent imputation and relatedness inferences were based on the final curated genotype data. The batch information can be found in batch.info. Removed_individuals.txt indicates individuals that have been removed from 6.0 to 7.0 data release due to either withdrawn consent or indications of sample mixup (e.g. mismatching of genetic relatedness with known family members).
Genome Build: hg19
Number of variants: ~515k
Number of individuals: 11,670
Modifications since initial release: Includes genotyping of missing individuals from previous data releases due to sample mix up or failing quality control measures.
Key reference: Baurley, J. W., Edlund, C. K., Pardamean, C. I., Conti, D. V., & Bergen, A. W. (2016). BMC Genomics, 17(1), 145. https://doi.org/10.1186/s12864-016-2495-7Imputed VCF files using TOPMED imputation panel
Files: ../imputed/..
chr{c}_dose.vsf.gzchr{c}_dose.vcf.gz.tbi- …
qcreport.html
Measure description: “The curated genotype data was used for the imputation, using the bioinformatic pipelines and recommendations of TOPMED Server, with TOPMED r3 reference panel. We input unphased genotypes, performing eagle imputation, with the TOPMED r2 reference panel and population set to “all”. TOPMED includes rsID numbers automatically in the output files.
The TOPMED imputation scores and post-imputation quality report can be found at qcreport.html in this folder. In addition to estimated allele dosages, an R2 field in vcf files contains an estimated imputation accuracy which can be used to filter high quality imputed variants.
ABCD Classification: Genetic
Genome Build: GRCh38
Number of variants: ~260 million
Number of individuals: 11,670
Key references:
- Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A. E., Kwong, A., Vrieze, S. I., Chew, E. Y., Levy, S., McGue, M., Schlessinger, D., Stambolian, D., Loh, P., Iacono, W. G., Swaroop, A., Scott, L. J., Cucca, F., Kronenberg, F., Boehnke, M., Abecasis, G. R., & Fuchsberger, C. (2016). Nature Genetics, 48(10), 1284–1287. https://doi.org/10.1038/ng.3656
- Loh, P., Danecek, P., Palamara, P. F., Fuchsberger, C., A Reshef, Y., K Finucane, H., Schoenherr, S., Forer, L., McCarthy, S., Abecasis, G. R., Durbin, R., & L Price, A. (2016). Nature Genetics, 48(11), 1443–1448. https://doi.org/10.1038/ng.3679