Calculate SNP Heritability

Using SumHer

Posted by Yuan on September 29, 2022

夫英雄者,胸怀大志,腹有良谋,有包藏宇宙之机,吞吐天地之志者也。

Taggings in heritability modeling

All applications of SumHer require a tagging file, which records the (relative) expected heritability tagged by each predictor. Recommendations from Dougspeed:

  1. To create this tagging file requires a (well-matched) Reference Panel. Using an extensive panel (e.g., imputed or sequencing data, retaining SNPs with MAF above 0.005 and information score above 0.8).
  2. When calculating the tagging file, you must specify a Heritability Model.
  3. When analysing human data: using the BLD-LDAK Model to estimate SNP Heritability or Heritability Enrichments (for pre-defined categories) from summary statistics; using the LDAK-Thin Model to estimate Genetic Correlations; using the LDAK-Thin Model to analyse individual-level data; using the BLD-LDAK+Alpha Model to estimate the selection-related parameter alpha.
  4. When analysing non-human data, always using the LDAK-Thin Model.

Different heritability models

In the additive model
$y_j=m+\sum{h_i*z_{ij}} + \epsilon_j$         (1)

We have:
$1 = Var(Y) = \sum{Var(h_i*z_{ij})} + Var(E) = \sum{h_i^2Var(z_{ij})} + Var(E) = \sum{h_i^2} + Var(E)$

If $h_i$ is also treated as a variable, which is most of current GWAS method doing, as there are much much more number of SNPs than number of samples, thus they always assume some prior of $\beta_i$ to fit equation (1) using Bayesian approach. In this approach, heritability(variance contributed by genetics) could be expressed as:
$E(\sum{h_i^2})$ = $\sum{E(h_i^2)}$
For each $h_i$, we have:
$E(h_i^2) = (E(h_i))^2 + Var(h_i)$         (2)

In a GWAS study, most all the time, $h_i$ is small, and thus it is assumed:
$h  \overset{\mathrm{iid}}{\sim}  N(0,\sigma^2)$

Thus,$E(h_i)=0$, and eq(2) could be written as:
$E(h_i^2) = \sigma^2$         (3)

In Many softwares, such as GCTA, LDSC, Lassosum, LDpred, they are using uniform $\sigma^2$ model, where $\sigma^2$ is constant for all SNPs.
Some other software,such as LDAK, assume $\sigma_i^2$ is affected by MAF, LD, and information score, and have a general form of:

$\sigma_i^2$     $\propto$     $(p_i(1-p_i))^{1+\alpha}w_{i}r_{i}$         (4)

Here:

  1. $p_i$ is MAF of $SNP_i$;
  2. $w_i$ is a SNP specific weight that is a function of the inverse of the LD score of $SNP_i$, so SNPs in regions of low LD contribute more than those in high LD regions;
  3. $r_i \in [0,1]$ is an information score measuring genotype certainty so theat high-quality SNPs contribute more than low-quality ones.

Model Comparisons

Model $\alpha$ $w_i$ Comments implementation summary in LDAK software
GCTA -1 1 Constatnt expected varianace of $h_i$ accross genome In LDAK, this model is achieved by adding the options –ignore-weights YES and –power -1 when Calculating Kinships or Calculating Taggings.
LDAK -0.25 $w_i$ Adding LD, for high LD regions, $w_i$ should be smaller In LDAK, this model is achieved by adding the options –weights and --power -0.25 when Calculating Kinships or Calculating Taggings, where provides the LDAK Weightings.
LDAK-Thin -0.25 $I_iw_i$ $I_i$ indicates whether SNP remains after thinning for duplicate SNPs In LDAK, the LDAK-Thin Model is achieved by adding the options –weights and --power -0.25 when Calculating Kinships or Calculating Taggings, where gives weight one to the SNPs that remain after Thinning Predictors with options --window-kb 100 and --window-prune 0.98.
LDSC -1 $\sum_{j=1}^{74}{\tau_ja_{ji}} + \tau_{75}$ $a$ values could be derived from LDSC website 1. Download the Baseline LD annot.gz files from the LDSC website, then make files called baselineLD1,…, baselineLD74 (where baselineLD$k has two columns, providing the SNP names then values of Annotation k). same as BLD-LDAK model below.
2. Calculate Taggings adding –annotation-number 74, –annotation-prefix baselineLD, –ignore-weights YES and –power -1.
Equivalently, you could make an extra file called baselineLD75 that contains the names of all SNPs, then replace –annotation-number 74 and –annotation-prefix baselineLD with –partition-number 75 and –partition-prefix baselineLD.
BLD-LDAK -0.25 $\sum_{j=1}^{64}{\tau_jb_{ji}} + \tau_{65}w_i+\tau_{66}$ where b1, b2, …, b64 are the non-MAF annotations from the Baseline LD Model and $w_i$ is the LDAK weighting (computed using only high-quality SNPs). 1. Download the files bld1, bld2, …, bld64 from the BLD-LDAK Annotations.
2. Next calculate the LDAK Weightings from your reference panel and rename them bld65.
3. Calculate Taggings adding –annotation-number 65, –annotation-prefix bld, –ignore-weights YES and –power -0.25.
BLD-LDAK+Alpha -0.25 $\sum_{j=1}^{64}{\tau_jb_{ji}} + \tau_{65}w_i+\tau_{66}$ where b1, b2, …, b64 are the non-MAF annotations from the Baseline LD Model and $w_i$ is the LDAK weighting (computed using only high-quality SNPs). The 67-parameter BLD-LDAK+Alpha Model generalizes the BLD-LDAK Model by allowing alpha to vary (instead of fixing it to -0.25).
1. Get 65 bld files the same as BLD-LDAK model;
2. Create 31 instances of the model, corresponding to alpha equals -1, -0.95, …, 0.45, 0.5;
3. For each alpha, calculate tagggins adding –annotation-number 65, –annotation-prefix bld, –ignore-weights YES and –power - alpha

Bash script for above task of extracting LDSC_scores to get bld0:64:

1
2
3
4
5
6
7
8
9
10
11
#Remove bld0 and base1~base74
rm bld0 base{1..74}
#Extract LD annotations for each annotations and merge into base$j
for j in {1..22}; 
do 
   gunzip -c baselineLD_v2.1/baselineLD.$j.annot.gz | awk '(NR>1){for(i=1;j<=74;i++){if($(5+i)!=0){print $1":"$2, $(5+i) >> "base"i}}print $1":"$2 >> "bld0"}'; 
done

#Rename base to bld
for j in {1..58}; do mv base$j bld$j; done
for j in {59..64}; do mv base$((10+j)) bld$j; done

Bash script for to get bld65

1
2
3
4
5
6
7
8
9
10
#Calculate LDAK SNP weights; here, we calculate weights separately for each chromosome then merge
#The on-screen instructions explain how to further parallelize (use --calc-weights not --calc-weights-all) 
for j in {1..22}; do
ldak.linux --cut-weights alz_chr$j --bfile AD.ref --extract alz.txt --chr $j
ldak.linux --calc-weights-all alz_chr$j --bfile AD.ref --extract alz.txt --chr $j
done

#Merge weights across chromosomes
cat alz_chr{1..22}/weights.short > bld65

1
2
3
4
5
6
7
8
ldak.linux --cut-weights sections --bfile AD.ref --extract alz.txt
for j in {1..22}; do
ldak.linux --calc-weights sections --bfile AD.ref --section $j --extract alz.txt
done
ldak.linux --join-weights sections --bfile AD.ref 

cp sections/weights.short bld65

1
2
3
4
5
6
# Alternatively, we could just run all of chromesomes as one
ldak.linux --cut-weights sections --bfile AD.ref --extract alz.txt 
ldak.linux --calc-weights-all sections --bfile AD.ref --extract alz.txt 
cp sections/weights.short bld65

#The weightings will be saved in <folder>/weights.short; this contains only predictors with non-zero weightings (as predictors with zero weighting can be ignored). A more detailed version will be saved in <folder>/weights.all; for each predictor, this reports the weighting, the number of neighbours (how many predictors were within the window), the tagging (the sum of squared-correlations with predictors within the window), the information score of the predictor and a "check" (the total tagging of the predictor after weighting; if the weightings are accurate, this value should match the information score).

Calculating taggings

Below, showed tagging calculations in ldak software in different heritability models. This document is mainly from Dougspeed
The main parameter is --calc-tagging outfile .
This requires the options

–bfile/–gen/–sp/–speed - to specify the genetic data files (see File Formats).

–weights or --ignore-weights YES - to specify the predictor weightings (if using a weightsfile, this should have two columns, that provide predictor names then weightings. Note in BLD-LDAK models, this weights is in bld65).

–power - to specify how predictors are scaled.

–window-cm - to specify the window size (how far to search for tagging predictors). It generally suffices to use 1cM. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 1000.

You can use –keep and/or --remove to restrict to a subset of samples, and --extract and/or --exclude to restrict to a subset of predictors (for more details, see Data Filtering).

To provide SNP annotations, you should use either –partition-number and --partition-prefix , or --annotation-number and --annotation-prefix (more details below).

To save the heritability matrix, add –save-matrix YES (necessary if performing Prediction).

To parallelize this process, you can add –chr to calculate taggings for each chromosome separately, then combine these with the argument --join-tagging , using the option --taglist to specify the per-chromosome tagging files (see the example below).

GCTA / LDSC

Most simple model is GCTA. For LDSC, it also assumes --power -1. But I do not understand where to implement LDSC scores in ldak.linux.

1
ldak.linux --calc-tagging GCTA --bfile human --ignore-weights YES --power -1 --window-cm 1

LDAK

Here, we need --weights bld65 where ‘bld65’ comes from BLD-LDAK models shown above, is the LDAK weights.

1
ldak.linux --calc-tagging LDAK --bfile human --weights bld65 --power -.25 --window-cm 1

LDAK-Thin

The taggings are saved in LDAK-Thin.tagging.

1
2
3
4
5
6
7
8
9
# 1. Create a weightsfile that gives weighting one to the predictors that remain after thinning for duplicates
#  Predictors are given weighting one in the file weights.thin. Other predictors are with weight zero.
ldak.linux --thin thin --bfile human --window-prune .98 --window-kb 100
awk < thin.in '{print $1, 1}' > weights.thin

# 2. Calculate tagging
ldak.linux --calc-tagging LDAK-Thin --bfile human --weights weights.thin --power -.25 --window-cm 1 --save-matrix YES

#The taggings are saved in LDAK-Thin.tagging.
1
2
3
4
5
6
7
8
9
10
ldak.linux --thin thin --bfile human --window-prune .98 --window-kb 100
awk < thin.in '{print $1, 1}' > weights.thin

for j in {1..22}; do
ldak.linux --calc-tagging LDAK-Thin$j --bfile human --weights weights.thin --power -.25 --window-cm 1 --chr $j
done

rm list.txt; for j in {1..22}; do echo "LDAK-Thin$j.tagging" >> list.txt; done
ldak.linux --join-tagging LDAK-Thin --taglist list.txt

1
ldak.out --calc-tagging LDAK-Thin.gwas --bfile human --weights weights.thin --power -.25 --window-cm 1 --regression-predictors gwasSummarySNPs.txt

BLD-LDAK

The results will be saved in BLD-LDAK.tagging, and the heritability matrix will be saved in BLD-LDAK.matrix.

1
2
#Note here we ignore-weights and add it to bld65 in --annotation-prefix
ldak.linux --calc-tagging BLD-LDAK --bfile ref.AD --ignore-weights YES --power -.25 --window-cm 1 --annotation-number 65 --annotation-prefix bld --save-matrix YES

BLD-LDAK+Alpha

1
2
3
4
5
6
for j in {-20..10}; 
do
   alpha=`echo $j | awk '{print -$1/20}'`; 
   echo $alpha
   ldak.linux --calc-tagging BLD-LDAK+Alpha.$j --bfile human --ignore-weights YES --power $alpha --window-cm 1 --annotation-number 65 --annotation-prefix bld
done

Estimate SNP heritability

SNPs with unusually large effect sizes may not be accommodated by the assumed heritability model, thus may be treated separately. In Alzheimer’s, APOE4 is well known large risk factor and the most successful PRS models are those treat APOE and other genetic risk factors separately, and tream APOE as fixed effects.

SNP-heritability analysis in humans usually omit the ~9Mb of the extended MHC region on chromosome six because of:

  1. large effect size, particularly for immune-related traits;
  2. unusual patterns of LD in this region.

This could be achieved by:

1
2
3
4
5
6
7
8
9
10
11
#Get list of MHC SNPs (from reference panel), for hg19 genome structure
awk < ../h1000/hg19/ref.AD.bim '($1==6 && $4>25000000 && $4<34000000){print $2}' > mhc.snps

#Identify large-effect SNPs (those explaining more than 1% of phenotypic variance)
#This command uses the fact that the variance explained by each SNP is stat/(stat+n)
awk < alz_kunkle2019.txt '(NR>1 && $5>$6/99){print $1}' > alz.big
#Find SNPs tagging the alzheimers large-effect loci (within 1cM and correlation squared >.1)
ldak5.linux --remove-tags alz --bfile ../h1000/hg19/ref.AD --targets alz.big --window-cm 1 --min-cor .1

#Create exclusion files, containing mhc and (for alzheimers) SNPs tagging large-effect SNPs
cat mhc.snps alz.out > alz.excl

Calculate heritabilities for each SNP

To estimate SNP heritability, the first step is to obtain a tagging file.
The second step is to regress (correctly-formatted) Summary Statistics onto the tagging file.

The main argument is --sum-hers outfile

This requires the options

--tagfile taggingfile - to specify the tagging file

--summary sumsfile - to specify the file containing the summary statistics.

LDAK will report an error if the summary statistics file does not provide summary statistics for all predictors in the tagging file. If a relatively small proportion of predictors are affected (say less than 20%), it should be OK to override this error by adding –check-sums NO.

1
2
3
4
5
6
7
8
9
10
11
12
ldak.linux --sum-hers alz_bldldak --tagfile BLD-LDAK.tagging --summary alz_kunkle2019.txt --exclude alz.excl

#Using BLD-LDAK model
## Different version of commands
ldak.out --sum-hers snpher1 --summary alz_kunkle2019.txt --tagfile BLD-LDAK.tagging
ldak.out --sum-hers snpher2 --summary alz_kunkle2019.txt --tagfile BLD-LDAK.tagging --exclude alz.excl
ldak.out --sum-hers snpher3 --summary alz_kunkle2019.txt --tagfile BLD-LDAK.tagging --cutoff 0.01
## Calculate predictor level heritabilities
ldak.out --sum-hers snpher3 --summary alz_kunkle2019.txt --tagfile ../SumHer/BLD-LDAK.tagging --cutoff 0.01 --matrix ../SumHer/BLD-LDAK.matrix --check-sums NO

#Using LDAK-Thin model
ldak.out --sum-hers snpher-thin --summary alz_kunkle2019.txt --tagfile ../SumHer/LDAK-Thin.tagging

To compare the LDAK-Thin and BLD-LDAK Models, we can examine the model fits (measured by loglSS) in the files snpher.extra. First it is important to confirm the log likelihoods are the same (in this case, both -854), as this indicates that the two model fits were computed using the same predictors and predictor weightings (and thus ensures a fair comparison). Next, we see that the alternative likelihood for the one-parameter LDAK-Thin Model is -826 (i.e., 28 above the null), while the alternative likelihood for the 66-parameter BLD-LDAK Model is -854 (41 above the null).

Wrap-up script for Alzheimer’s BLD-LDAK model

The below one is modified from dougspeed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Step 1
#Tidy alzheimers summary statistics - Get the format of:
#Chr:Position A1 A2 dir STAT N
#If possible, compute a chi-squared test statistic for STAT (otherwise, could use P)
#For linear regression, can use (beta/SD)^2, for logistic regression, (log.odds/SD)^2
awk < Kunkle_etal_Stage1_results.txt '(NR>1){snp=$3;a1=$4;a2=$5;dir=$6;stat=($6/$7)*($6/$7);n=73926}(NR==1)\
{print "Predictor A1 A2 Direction Stat n"}(NR>1 && ((a1 ~/^[AT]$/ && a2 ~ /^[CG]$/) || (a1 ~/^[CG]$/ && a2 ~ /^[AT]$/)) {print snp, a1, a2, dir, stat, n}' > alz_kunkle2019.txt

#Check for duplicates using the unix functions sort and uniq
#If this command results in no output, it means all predictors are unique
awk < alz_kunkle2019.txt '{print $1}' | sort | uniq -d | head
# Or 
awk < alz_kunkle2019.txt 'seen[$1]++' | head

#If there are duplicates, we can remove then using
mv alz_kunkle2019.txt alz2.txt; awk '!seen[$1]++' alz2.txt > alz_kunkle2019.txt

# Step 1.1, get ref.AD reference based on EUR population and SNPs in alz_kunkle2019.txt. (Finished in previous blog)

#Get list of MHC SNPs (from reference panel)
awk < ref.bim '($1==6 && $4>25000000 && $4<34000000){print $2}' > mhc.snps

#Identify large-effect SNPs (those explaining more than 1% of phenotypic variance)
#This command uses the fact that the variance explained by each SNP is stat/(stat+n)
awk < alz_kunkle2019.txt '(NR>1 && $5>$6/99){print $1}' > alz.big
#Find SNPs tagging the alzheimers large-effect loci (within 1cM and correlation squared >.1)
ldak5.linux --remove-tags alz --bfile ../h1000/hg19/ref.AD --top-preds alz.big --window-cm 1 --min-cor .1

#Create exclusion files, containing mhc and (for alzheimers) SNPs tagging large-effect SNPs
cat mhc.snps alz.out > alz.excl

# Step 2. Calculate taggings using BLD-LDAK model
#Estimate SNP heritability. To estimate $\htsnp$ there are two steps: first we compute a (1-part) tagfile which contains $q_j + \sum_{l \in N_j} q_l r^2_{jl}$ for each SNP; then we perform the regression to estimate $\htsnp$. To compute an LDAK tagfile, we must first compute LDAK weights; this can take a few hours, but can be efficiently parallelized.
## Step 2.1 Get bld0-64
#Download LDSC scores v2.1, baseline in LDSC folder and extract
cd ../SumHer/LDSC
mkdir BLD_LDAK
cd BLD_LDAK
#Remove bld0 and base1~base74
rm bld0 base{1..74}
#Extract LD annotations for each annotations and merge into base$j
for j in {1..22}; 
do 
   gunzip -c ../1000G_Phase3_baselineLD_v2.1_ldscores/baselineLD.$j.annot.gz | awk '(NR>1){for(i=1;j<=74;i++){if($(5+i)!=0){print $1":"$2, $(5+i) >> "base"i}}print $1":"$2 >> "bld0"}'; 
   echo "chr$j finished\n"
done

#Rename base to bld
for j in {1..58}; do mv base$j bld$j; done
for j in {59..64}; do mv base$((10+j)) bld$j; done

## Step 2.2 Calculate LDAK SNP weights
#here, we calculate weights separately for each chromosome then merge
#The on-screen instructions explain how to further parallelize (use --calc-weights not --calc-weights-all) 
cd ../..

ldak.linux --cut-weights sections --bfile ../h1000/hg19/ref.AD
for j in {1..22}; do
ldak.linux --calc-weights sections --bfile ../h1000/hg19/ref.AD --section $j
done
ldak.linux --join-weights sections --bfile ../h1000/hg19/ref.AD

cp sections/weights.short LDSC/BLD_LDAK/bld65

## 2.3 Calculate taggings, save weight matrix
ldak.linux --calc-tagging BLD-LDAK --bfile ../h1000/hg19/ref.AD --ignore-weights YES --power -.25 --window-cm 1 --annotation-number 65 --annotation-prefix LDSC/BLD_LDAK/bld --save-matrix YES

#Calculate the (1-part) LDAK Model tagfile
#ldak5.linux --calc-tagging alz_ldak --bfile ref --extract alz.txt --weights bld65 --power -.25 --window-cm 1

# Step 3. Calculate SNP heritabilies
#Perform the regression (remember to exclude the MHC / large-effect SNPs)
#Pay attention to the screen output (for this example it says I should add --check-sums NO)
cd ../kunkle2019
ldak.linux --sum-hers alz_bldldak --tagfile ../SumHer/BLD-LDAK.tagging --summary alz_kunkle2019.txt --cutoff 0.01 --matrix BLD-LDAK.matrix --check-sums NO

#SNP level heritability is in alz_bldldak.ind.hers, Merge with gwas summary stat
awk '(NR==FNR){h[$1]=$2;next;}(FNR==1){print $1,$2,$3,$4,$5,"h",$6;next;}(h[$1]){print $1,$2,$3,$4,$5,h[$1],$6}' alz_bldldak.ind.hers alz_kunkle2019.txt >alzHer_kunkle2019.txt

#Estimates of SNP heritability will be in alz_bldldak.hers

# Other related
#Estimate confounding bias. We recommend estimating the scaling factor $C$, but SumHer can also estimate the LDSC intercept $1+A$. There is no need to compute a new tagfile, just add \verb|--genomic-control YES| or \verb|--intercept YES| when performing the regression.


#Estimate the scaling factor C
ldak.linux --sum-hers alz_bldldak.gcon --tagfile ../SumHer/BLD-LDAK.tagging --summary alz_kunkle2019.txt --exclude alz.excl --genomic-control YES
#The scaling factor will be in alz_bldldak.gcon.extra

#Estimate the intercept 1+A
ldak.linux --sum-hers alz_gcta.cept --tagfile ../SumHer/BLD-LDAK.tagging --summary alz_kunkle2019.txt --exclude alz.excl --intercept YES
#The intercept will be in alz_gcta.cept.extra