夫英雄者,胸怀大志,腹有良谋,有包藏宇宙之机,吞吐天地之志者也。
Taggings in heritability modeling
All applications of SumHer require a tagging file, which records the (relative) expected heritability tagged by each predictor. Recommendations from Dougspeed:
- To create this tagging file requires a (well-matched) Reference Panel. Using an extensive panel (e.g., imputed or sequencing data, retaining SNPs with MAF above 0.005 and information score above 0.8).
- When calculating the tagging file, you must specify a Heritability Model.
- When analysing human data: using the BLD-LDAK Model to estimate SNP Heritability or Heritability Enrichments (for pre-defined categories) from summary statistics; using the LDAK-Thin Model to estimate Genetic Correlations; using the LDAK-Thin Model to analyse individual-level data; using the BLD-LDAK+Alpha Model to estimate the selection-related parameter alpha.
- When analysing non-human data, always using the LDAK-Thin Model.
Different heritability models
In the additive model
$y_j=m+\sum{h_i*z_{ij}} + \epsilon_j$ (1)
We have:
$1 = Var(Y) = \sum{Var(h_i*z_{ij})} + Var(E) = \sum{h_i^2Var(z_{ij})} + Var(E) = \sum{h_i^2} + Var(E)$
If $h_i$ is also treated as a variable, which is most of current GWAS method doing, as there are much much more number of SNPs than number of samples, thus they always assume some prior of $\beta_i$ to fit equation (1) using Bayesian approach. In this approach, heritability(variance contributed by genetics) could be expressed as:
$E(\sum{h_i^2})$ = $\sum{E(h_i^2)}$
For each $h_i$, we have:
$E(h_i^2) = (E(h_i))^2 + Var(h_i)$ (2)
In a GWAS study, most all the time, $h_i$ is small, and thus it is assumed:
$h \overset{\mathrm{iid}}{\sim} N(0,\sigma^2)$
Thus,$E(h_i)=0$, and eq(2) could be written as:
$E(h_i^2) = \sigma^2$ (3)
In Many softwares, such as GCTA, LDSC, Lassosum, LDpred, they are using uniform $\sigma^2$ model, where $\sigma^2$ is constant for all SNPs.
Some other software,such as LDAK, assume $\sigma_i^2$ is affected by MAF, LD, and information score, and have a general form of:
$\sigma_i^2$ $\propto$ $(p_i(1-p_i))^{1+\alpha}w_{i}r_{i}$ (4)
Here:
- $p_i$ is MAF of $SNP_i$;
- $w_i$ is a SNP specific weight that is a function of the inverse of the LD score of $SNP_i$, so SNPs in regions of low LD contribute more than those in high LD regions;
- $r_i \in [0,1]$ is an information score measuring genotype certainty so theat high-quality SNPs contribute more than low-quality ones.
Model Comparisons
Model | $\alpha$ | $w_i$ | Comments | implementation summary in LDAK software |
---|---|---|---|---|
GCTA | -1 | 1 | Constatnt expected varianace of $h_i$ accross genome | In LDAK, this model is achieved by adding the options –ignore-weights YES and –power -1 when Calculating Kinships or Calculating Taggings. |
LDAK | -0.25 | $w_i$ | Adding LD, for high LD regions, $w_i$ should be smaller | In LDAK, this model is achieved by adding the options –weights |
LDAK-Thin | -0.25 | $I_iw_i$ | $I_i$ indicates whether SNP remains after thinning for duplicate SNPs | In LDAK, the LDAK-Thin Model is achieved by adding the options –weights |
LDSC | -1 | $\sum_{j=1}^{74}{\tau_ja_{ji}} + \tau_{75}$ | $a$ values could be derived from LDSC website | 1. Download the Baseline LD annot.gz files from the LDSC website, then make files called baselineLD1,…, baselineLD74 (where baselineLD$k has two columns, providing the SNP names then values of Annotation k). same as BLD-LDAK model below. 2. Calculate Taggings adding –annotation-number 74, –annotation-prefix baselineLD, –ignore-weights YES and –power -1. Equivalently, you could make an extra file called baselineLD75 that contains the names of all SNPs, then replace –annotation-number 74 and –annotation-prefix baselineLD with –partition-number 75 and –partition-prefix baselineLD. |
BLD-LDAK | -0.25 | $\sum_{j=1}^{64}{\tau_jb_{ji}} + \tau_{65}w_i+\tau_{66}$ | where b1, b2, …, b64 are the non-MAF annotations from the Baseline LD Model and $w_i$ is the LDAK weighting (computed using only high-quality SNPs). | 1. Download the files bld1, bld2, …, bld64 from the BLD-LDAK Annotations. 2. Next calculate the LDAK Weightings from your reference panel and rename them bld65. 3. Calculate Taggings adding –annotation-number 65, –annotation-prefix bld, –ignore-weights YES and –power -0.25. |
BLD-LDAK+Alpha | -0.25 | $\sum_{j=1}^{64}{\tau_jb_{ji}} + \tau_{65}w_i+\tau_{66}$ | where b1, b2, …, b64 are the non-MAF annotations from the Baseline LD Model and $w_i$ is the LDAK weighting (computed using only high-quality SNPs). | The 67-parameter BLD-LDAK+Alpha Model generalizes the BLD-LDAK Model by allowing alpha to vary (instead of fixing it to -0.25). 1. Get 65 bld files the same as BLD-LDAK model; 2. Create 31 instances of the model, corresponding to alpha equals -1, -0.95, …, 0.45, 0.5; 3. For each alpha, calculate tagggins adding –annotation-number 65, –annotation-prefix bld, –ignore-weights YES and –power - alpha |
To get BLD-LDAK weights. (This part is directly from Dougspeed's website.)
1. Get the 64 annotations: downloaded the folder 1000G_Phase3_baselineLD_v2.1_ldscores.tgz from https://data.broadinstitute.org/alkesgroup/LDSCORE. Within this folder, the .annot.gz files contain the 74 annotations of the Baseline LD Model. The BLD-LDAK Model uses Annotations 1-58 and 59-64.
2. Extracted all 74 annotations (plus Annotation 0, the base category),
print chr:position,LD_annotation
and merge that from chr1:22 into one bld file 3. Exclude the 10 MAF(59:68) bins
Bash script for above task of extracting LDSC_scores to get bld0:64:
1
2
3
4
5
6
7
8
9
10
11
#Remove bld0 and base1~base74
rm bld0 base{1..74}
#Extract LD annotations for each annotations and merge into base$j
for j in {1..22};
do
gunzip -c baselineLD_v2.1/baselineLD.$j.annot.gz | awk '(NR>1){for(i=1;j<=74;i++){if($(5+i)!=0){print $1":"$2, $(5+i) >> "base"i}}print $1":"$2 >> "bld0"}';
done
#Rename base to bld
for j in {1..58}; do mv base$j bld$j; done
for j in {59..64}; do mv base$((10+j)) bld$j; done
Bash script for to get bld65
1
2
3
4
5
6
7
8
9
10
#Calculate LDAK SNP weights; here, we calculate weights separately for each chromosome then merge
#The on-screen instructions explain how to further parallelize (use --calc-weights not --calc-weights-all)
for j in {1..22}; do
ldak.linux --cut-weights alz_chr$j --bfile AD.ref --extract alz.txt --chr $j
ldak.linux --calc-weights-all alz_chr$j --bfile AD.ref --extract alz.txt --chr $j
done
#Merge weights across chromosomes
cat alz_chr{1..22}/weights.short > bld65
1
2
3
4
5
6
7
8
ldak.linux --cut-weights sections --bfile AD.ref --extract alz.txt
for j in {1..22}; do
ldak.linux --calc-weights sections --bfile AD.ref --section $j --extract alz.txt
done
ldak.linux --join-weights sections --bfile AD.ref
cp sections/weights.short bld65
1
2
3
4
5
6
# Alternatively, we could just run all of chromesomes as one
ldak.linux --cut-weights sections --bfile AD.ref --extract alz.txt
ldak.linux --calc-weights-all sections --bfile AD.ref --extract alz.txt
cp sections/weights.short bld65
#The weightings will be saved in <folder>/weights.short; this contains only predictors with non-zero weightings (as predictors with zero weighting can be ignored). A more detailed version will be saved in <folder>/weights.all; for each predictor, this reports the weighting, the number of neighbours (how many predictors were within the window), the tagging (the sum of squared-correlations with predictors within the window), the information score of the predictor and a "check" (the total tagging of the predictor after weighting; if the weightings are accurate, this value should match the information score).
Calculating taggings
Below, showed tagging calculations in ldak software in different heritability models. This document is mainly from Dougspeed
The main parameter is --calc-tagging outfile
.
This requires the options
–bfile/–gen/–sp/–speed
–weights
–power
–window-cm
You can use –keep
To provide SNP annotations, you should use either –partition-number
To save the heritability matrix, add –save-matrix YES (necessary if performing Prediction).
To parallelize this process, you can add –chr
GCTA / LDSC
Most simple model is GCTA. For LDSC, it also assumes --power -1
. But I do not understand where to implement LDSC scores in ldak.linux.
1
ldak.linux --calc-tagging GCTA --bfile human --ignore-weights YES --power -1 --window-cm 1
LDAK
Here, we need --weights bld65
where ‘bld65’ comes from BLD-LDAK models shown above, is the LDAK weights.
1
ldak.linux --calc-tagging LDAK --bfile human --weights bld65 --power -.25 --window-cm 1
LDAK-Thin
The taggings are saved in LDAK-Thin.tagging.
1
2
3
4
5
6
7
8
9
# 1. Create a weightsfile that gives weighting one to the predictors that remain after thinning for duplicates
# Predictors are given weighting one in the file weights.thin. Other predictors are with weight zero.
ldak.linux --thin thin --bfile human --window-prune .98 --window-kb 100
awk < thin.in '{print $1, 1}' > weights.thin
# 2. Calculate tagging
ldak.linux --calc-tagging LDAK-Thin --bfile human --weights weights.thin --power -.25 --window-cm 1 --save-matrix YES
#The taggings are saved in LDAK-Thin.tagging.
1
2
3
4
5
6
7
8
9
10
ldak.linux --thin thin --bfile human --window-prune .98 --window-kb 100
awk < thin.in '{print $1, 1}' > weights.thin
for j in {1..22}; do
ldak.linux --calc-tagging LDAK-Thin$j --bfile human --weights weights.thin --power -.25 --window-cm 1 --chr $j
done
rm list.txt; for j in {1..22}; do echo "LDAK-Thin$j.tagging" >> list.txt; done
ldak.linux --join-tagging LDAK-Thin --taglist list.txt
1
ldak.out --calc-tagging LDAK-Thin.gwas --bfile human --weights weights.thin --power -.25 --window-cm 1 --regression-predictors gwasSummarySNPs.txt
BLD-LDAK
The results will be saved in BLD-LDAK.tagging, and the heritability matrix will be saved in BLD-LDAK.matrix.
1
2
#Note here we ignore-weights and add it to bld65 in --annotation-prefix
ldak.linux --calc-tagging BLD-LDAK --bfile ref.AD --ignore-weights YES --power -.25 --window-cm 1 --annotation-number 65 --annotation-prefix bld --save-matrix YES
BLD-LDAK+Alpha
1
2
3
4
5
6
for j in {-20..10};
do
alpha=`echo $j | awk '{print -$1/20}'`;
echo $alpha
ldak.linux --calc-tagging BLD-LDAK+Alpha.$j --bfile human --ignore-weights YES --power $alpha --window-cm 1 --annotation-number 65 --annotation-prefix bld
done
Estimate SNP heritability
SNPs with unusually large effect sizes may not be accommodated by the assumed heritability model, thus may be treated separately. In Alzheimer’s, APOE4 is well known large risk factor and the most successful PRS models are those treat APOE and other genetic risk factors separately, and tream APOE as fixed effects.
SNP-heritability analysis in humans usually omit the ~9Mb of the extended MHC region on chromosome six because of:
- large effect size, particularly for immune-related traits;
- unusual patterns of LD in this region.
This could be achieved by:
1
2
3
4
5
6
7
8
9
10
11
#Get list of MHC SNPs (from reference panel), for hg19 genome structure
awk < ../h1000/hg19/ref.AD.bim '($1==6 && $4>25000000 && $4<34000000){print $2}' > mhc.snps
#Identify large-effect SNPs (those explaining more than 1% of phenotypic variance)
#This command uses the fact that the variance explained by each SNP is stat/(stat+n)
awk < alz_kunkle2019.txt '(NR>1 && $5>$6/99){print $1}' > alz.big
#Find SNPs tagging the alzheimers large-effect loci (within 1cM and correlation squared >.1)
ldak5.linux --remove-tags alz --bfile ../h1000/hg19/ref.AD --targets alz.big --window-cm 1 --min-cor .1
#Create exclusion files, containing mhc and (for alzheimers) SNPs tagging large-effect SNPs
cat mhc.snps alz.out > alz.excl
Previously, we recommended first using
--remove-tags outfile
to identify predictors tagging loci that explain more than 1% of phenotypic variance, then excluding these using --extract extractfile
and/or --exclude excludefile
. However, this can now be done more easily using the option
--cutoff float
; for example, to remove predictors that explain more than 1% of phenotypic variance, add --cutoff 0.01
(note that this will not also remove predictors tagging the large-effect loci, but in practice, we find this makes little difference).Actually, whether we need to exclude MHC region is still debatable, and we could try both models.
Calculate heritabilities for each SNP
To estimate SNP heritability, the first step is to obtain a tagging file.
The second step is to regress (correctly-formatted) Summary Statistics onto the tagging file.
The main argument is --sum-hers outfile
This requires the options
--tagfile taggingfile
- to specify the tagging file
--summary sumsfile
- to specify the file containing the summary statistics.
LDAK will report an error if the summary statistics file does not provide summary statistics for all predictors in the tagging file. If a relatively small proportion of predictors are affected (say less than 20%), it should be OK to override this error by adding –check-sums NO.
1
2
3
4
5
6
7
8
9
10
11
12
ldak.linux --sum-hers alz_bldldak --tagfile BLD-LDAK.tagging --summary alz_kunkle2019.txt --exclude alz.excl
#Using BLD-LDAK model
## Different version of commands
ldak.out --sum-hers snpher1 --summary alz_kunkle2019.txt --tagfile BLD-LDAK.tagging
ldak.out --sum-hers snpher2 --summary alz_kunkle2019.txt --tagfile BLD-LDAK.tagging --exclude alz.excl
ldak.out --sum-hers snpher3 --summary alz_kunkle2019.txt --tagfile BLD-LDAK.tagging --cutoff 0.01
## Calculate predictor level heritabilities
ldak.out --sum-hers snpher3 --summary alz_kunkle2019.txt --tagfile ../SumHer/BLD-LDAK.tagging --cutoff 0.01 --matrix ../SumHer/BLD-LDAK.matrix --check-sums NO
#Using LDAK-Thin model
ldak.out --sum-hers snpher-thin --summary alz_kunkle2019.txt --tagfile ../SumHer/LDAK-Thin.tagging
To compare the LDAK-Thin and BLD-LDAK Models, we can examine the model fits (measured by loglSS) in the files snpher.extra. First it is important to confirm the log likelihoods are the same (in this case, both -854), as this indicates that the two model fits were computed using the same predictors and predictor weightings (and thus ensures a fair comparison). Next, we see that the alternative likelihood for the one-parameter LDAK-Thin Model is -826 (i.e., 28 above the null), while the alternative likelihood for the 66-parameter BLD-LDAK Model is -854 (41 above the null).
Wrap-up script for Alzheimer’s BLD-LDAK model
The below one is modified from dougspeed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Step 1
#Tidy alzheimers summary statistics - Get the format of:
#Chr:Position A1 A2 dir STAT N
#If possible, compute a chi-squared test statistic for STAT (otherwise, could use P)
#For linear regression, can use (beta/SD)^2, for logistic regression, (log.odds/SD)^2
awk < Kunkle_etal_Stage1_results.txt '(NR>1){snp=$3;a1=$4;a2=$5;dir=$6;stat=($6/$7)*($6/$7);n=73926}(NR==1)\
{print "Predictor A1 A2 Direction Stat n"}(NR>1 && ((a1 ~/^[AT]$/ && a2 ~ /^[CG]$/) || (a1 ~/^[CG]$/ && a2 ~ /^[AT]$/)) {print snp, a1, a2, dir, stat, n}' > alz_kunkle2019.txt
#Check for duplicates using the unix functions sort and uniq
#If this command results in no output, it means all predictors are unique
awk < alz_kunkle2019.txt '{print $1}' | sort | uniq -d | head
# Or
awk < alz_kunkle2019.txt 'seen[$1]++' | head
#If there are duplicates, we can remove then using
mv alz_kunkle2019.txt alz2.txt; awk '!seen[$1]++' alz2.txt > alz_kunkle2019.txt
# Step 1.1, get ref.AD reference based on EUR population and SNPs in alz_kunkle2019.txt. (Finished in previous blog)
#Get list of MHC SNPs (from reference panel)
awk < ref.bim '($1==6 && $4>25000000 && $4<34000000){print $2}' > mhc.snps
#Identify large-effect SNPs (those explaining more than 1% of phenotypic variance)
#This command uses the fact that the variance explained by each SNP is stat/(stat+n)
awk < alz_kunkle2019.txt '(NR>1 && $5>$6/99){print $1}' > alz.big
#Find SNPs tagging the alzheimers large-effect loci (within 1cM and correlation squared >.1)
ldak5.linux --remove-tags alz --bfile ../h1000/hg19/ref.AD --top-preds alz.big --window-cm 1 --min-cor .1
#Create exclusion files, containing mhc and (for alzheimers) SNPs tagging large-effect SNPs
cat mhc.snps alz.out > alz.excl
# Step 2. Calculate taggings using BLD-LDAK model
#Estimate SNP heritability. To estimate $\htsnp$ there are two steps: first we compute a (1-part) tagfile which contains $q_j + \sum_{l \in N_j} q_l r^2_{jl}$ for each SNP; then we perform the regression to estimate $\htsnp$. To compute an LDAK tagfile, we must first compute LDAK weights; this can take a few hours, but can be efficiently parallelized.
## Step 2.1 Get bld0-64
#Download LDSC scores v2.1, baseline in LDSC folder and extract
cd ../SumHer/LDSC
mkdir BLD_LDAK
cd BLD_LDAK
#Remove bld0 and base1~base74
rm bld0 base{1..74}
#Extract LD annotations for each annotations and merge into base$j
for j in {1..22};
do
gunzip -c ../1000G_Phase3_baselineLD_v2.1_ldscores/baselineLD.$j.annot.gz | awk '(NR>1){for(i=1;j<=74;i++){if($(5+i)!=0){print $1":"$2, $(5+i) >> "base"i}}print $1":"$2 >> "bld0"}';
echo "chr$j finished\n"
done
#Rename base to bld
for j in {1..58}; do mv base$j bld$j; done
for j in {59..64}; do mv base$((10+j)) bld$j; done
## Step 2.2 Calculate LDAK SNP weights
#here, we calculate weights separately for each chromosome then merge
#The on-screen instructions explain how to further parallelize (use --calc-weights not --calc-weights-all)
cd ../..
ldak.linux --cut-weights sections --bfile ../h1000/hg19/ref.AD
for j in {1..22}; do
ldak.linux --calc-weights sections --bfile ../h1000/hg19/ref.AD --section $j
done
ldak.linux --join-weights sections --bfile ../h1000/hg19/ref.AD
cp sections/weights.short LDSC/BLD_LDAK/bld65
## 2.3 Calculate taggings, save weight matrix
ldak.linux --calc-tagging BLD-LDAK --bfile ../h1000/hg19/ref.AD --ignore-weights YES --power -.25 --window-cm 1 --annotation-number 65 --annotation-prefix LDSC/BLD_LDAK/bld --save-matrix YES
#Calculate the (1-part) LDAK Model tagfile
#ldak5.linux --calc-tagging alz_ldak --bfile ref --extract alz.txt --weights bld65 --power -.25 --window-cm 1
# Step 3. Calculate SNP heritabilies
#Perform the regression (remember to exclude the MHC / large-effect SNPs)
#Pay attention to the screen output (for this example it says I should add --check-sums NO)
cd ../kunkle2019
ldak.linux --sum-hers alz_bldldak --tagfile ../SumHer/BLD-LDAK.tagging --summary alz_kunkle2019.txt --cutoff 0.01 --matrix BLD-LDAK.matrix --check-sums NO
#SNP level heritability is in alz_bldldak.ind.hers, Merge with gwas summary stat
awk '(NR==FNR){h[$1]=$2;next;}(FNR==1){print $1,$2,$3,$4,$5,"h",$6;next;}(h[$1]){print $1,$2,$3,$4,$5,h[$1],$6}' alz_bldldak.ind.hers alz_kunkle2019.txt >alzHer_kunkle2019.txt
#Estimates of SNP heritability will be in alz_bldldak.hers
# Other related
#Estimate confounding bias. We recommend estimating the scaling factor $C$, but SumHer can also estimate the LDSC intercept $1+A$. There is no need to compute a new tagfile, just add \verb|--genomic-control YES| or \verb|--intercept YES| when performing the regression.
#Estimate the scaling factor C
ldak.linux --sum-hers alz_bldldak.gcon --tagfile ../SumHer/BLD-LDAK.tagging --summary alz_kunkle2019.txt --exclude alz.excl --genomic-control YES
#The scaling factor will be in alz_bldldak.gcon.extra
#Estimate the intercept 1+A
ldak.linux --sum-hers alz_gcta.cept --tagfile ../SumHer/BLD-LDAK.tagging --summary alz_kunkle2019.txt --exclude alz.excl --intercept YES
#The intercept will be in alz_gcta.cept.extra