君子博学而日参省乎己,则知明而行无过矣
This is part 2 of PRS score calculation blog. The first part focuses on QC in base and target GWAS
Required files
Before calculating PRS scores, the following files are required:
File Name | Description |
---|---|
Height.QC.gz | The post-QCed summary Statistic |
EUR.QC.bed | The genotype file after performing some basic filtering |
EUR.QC.bim | This file contains the SNPs that passed the basic filtering |
EUR.QC.fam | This file contains the samples that passed the basic filtering |
EUR.height | This file contains the phenotype of the samples |
EUR.cov | This file contains the covariates of the samples |
Update Effect Size
When the effect size relates to disease risk and is thus given as an odds ratio (OR), rather than BETA (for continuous traits), then the PRS is computed as a product of ORs. To simplify this calculation, we take the natural logarithm of the OR so that the PRS can be computed using summation instead (which can be back-transformed afterwards).
1
2
3
4
library(data.table)
dat <- fread("Height.QC.gz")
fwrite(dat[,BETA:=log(OR)], "Height.QC.Transformed", sep="\t")
q() # exit R
Clumping
Linkage disequilibrium, which corresponds to the correlation between the genotypes of genetic variants across the genome, makes identifying the contribution from causal independent genetic variants extremely challenging. One way of approximately capturing the right level of causal signal is to perform clumping, which removes SNPs in ways that only weakly correlated SNPs are retained but preferentially retaining the SNPs most associated with the phenotype under study. Clumping can be performed using the following command in plink:
1
2
3
4
5
6
7
8
9
plink \
--bfile EUR.QC \
--clump-p1 1 \
--clump-r2 0.1 \
--clump-kb 250 \
--clump Height.QC.Transformed \
--clump-snp-field SNP \
--clump-field P \
--out EUR
parameters used above:
Parameter | Value | Description |
---|---|---|
clump-p1 | 1 | P-value threshold for a SNP to be included as an index SNP. 1 is selected such that all SNPs are include for clumping |
clump-r2 | 0.1 | SNPs having $r^2$ higher than 0.1 with the index SNPs will be removed |
clump-kb | 250 | SNPs within 250k of the index SNP are considered for clumping |
clump | Height.QC.Transformed | Base data (summary statistic) file containing the P-value information |
clump-snp-field | SNP | Specifies that the column SNP contains the SNP IDs |
clump-field | P | Specifies that the column P contains the P-value information |
The EUR.clumped files contain the index SNPs, which could be extract:
1
awk 'NR!=1{print $3}' EUR.clumped >EUR.valid.snp
Generate PRS using plink
plink provides a convenient function –score and –q-score-range for calculating polygenic scores. Three files are required here:
- The base data file: Height.QC.Transformed
- A file containing SNP IDs and their corresponding P-values
1
awk '{print $3,$8}' Height.QC.Transformed > SNP.pvalue
- A file containing the different P-value thresholds for inclusion of SNPs in the PRS.
1
2
3
4
5
6
7
echo "0.001 0 0.001" > range_list
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.2 0 0.2" >> range_list
echo "0.3 0 0.3" >> range_list
echo "0.4 0 0.4" >> range_list
echo "0.5 0 0.5" >> range_list
The format of range_list is:
1
Name_of_Threshold Lower_bound Upper_Bound
calculating PRS using plink
1
2
3
4
5
6
plink \
--bfile EUR.QC \
--score Height.QC.Transformed 3 4 12 header \
--q-score-range range_list SNP.pvalue \
--extract EUR.valid.snp \
--out EUR
Parameter | Value | Description |
score | Height.QC.Transformed 3 4 12 header | We read from the Height.QC.Transformed file, assuming that the 3st column is the SNP ID; 4th column is the effective allele information; the 12th column is the effect size estimate; and that the file contains a header |
q-score-range | range_list SNP.pvalue | We want to calculate PRS based on the thresholds defined in range_list, where the threshold values (P-values) were stored in SNP.pvalue |
The above command and range_list will generate 7 files:
- EUR.0.5.profile
- EUR.0.4.profile
- EUR.0.3.profile
- EUR.0.2.profile
- EUR.0.1.profile
- EUR.0.05.profile
- EUR.0.001.profile
$$ PRS_j =\frac{ \sum_i^NS_i*G_{ij}}{P*M_j} $$
Where the effect size of SNP $i$ is $S_i$; the number of effect alleles observed in sample $j$ is $G_{ij}$; the ploidy of the sample is $P$ (is genrally 2 for human); the total number of SNPs included in the PRS is N;and the number of non-missing SNPs observed in sample $j$ is $M_j$. If the sample has a missing genotype for SNP $i$, then the population minor allele frequency multiplied by the ploidy ($MAF_i * P$) is used instead of $G_{ij}$.
Accounting for Population Stratification
Population structure is the principal source of confounding in GWAS and is usually accounted for by incorporating principal components (PCs) as covariates. We can incorporate PCs into our PRS analysis to account for population stratification.
1
2
3
4
5
6
7
8
9
10
11
# First, we need to perform prunning
plink \
--bfile EUR.QC \
--indep-pairwise 200 50 0.25 \
--out EUR
# Then we calculate the first 6 PCs
plink \
--bfile EUR.QC \
--extract EUR.prune.in \
--pca 6 \
--out EUR
Here the PCs have been stored in the EUR.eigenvec file and can be used as covariates in the regression model to account for population stratification, which would be used in PRSice.
Finding the “best-fit” PRS
The P-value threshold that provides the “best-fit” PRS under the C+T method is usually unknown. To approximate the “best-fit” PRS, we can perform a regression between PRS calculated at a range of P-value thresholds and then select the PRS that explains the highest phenotypic variance.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)
# Read in the phenotype file
phenotype <- read.table("EUR.height", header=T)
# Read in the PCs
pcs <- read.table("EUR.eigenvec", header=F)
# The default output from plink does not include a header
# To make things simple, we will add the appropriate headers
# (1:6 because there are 6 PCs)
colnames(pcs) <- c("FID", "IID", paste0("PC",1:6))
# Read in the covariates (here, it is sex)
covariate <- read.table("EUR.cov", header=T)
# Now merge the files
pheno <- merge(merge(phenotype, covariate, by=c("FID", "IID")), pcs, by=c("FID","IID"))
# We can then calculate the null model (model with PRS) using a linear regression
# (as height is quantitative)
null.model <- lm(Height~., data=pheno[,!colnames(pheno)%in%c("FID","IID")])
# And the R2 of the null model is
null.r2 <- summary(null.model)$r.squared
prs.result <- NULL
for(i in p.threshold){
# Go through each p-value threshold
prs <- read.table(paste0("EUR.",i,".profile"), header=T)
# Merge the prs with the phenotype matrix
# We only want the FID, IID and PRS from the PRS file, therefore we only select the
# relevant columns
pheno.prs <- merge(pheno, prs[,c("FID","IID", "SCORE")], by=c("FID", "IID"))
# Now perform a linear regression on Height with PRS and the covariates
# ignoring the FID and IID from our model
model <- lm(Height~., data=pheno.prs[,!colnames(pheno.prs)%in%c("FID","IID")])
# model R2 is obtained as
model.r2 <- summary(model)$r.squared
# R2 of PRS is simply calculated as the model R2 minus the null R2
prs.r2 <- model.r2-null.r2
# We can also obtain the coeffcient and p-value of association of PRS as follow
prs.coef <- summary(model)$coeff["SCORE",]
prs.beta <- as.numeric(prs.coef[1])
prs.se <- as.numeric(prs.coef[2])
prs.p <- as.numeric(prs.coef[4])
# We can then store the results
prs.result <- rbind(prs.result, data.frame(Threshold=i, R2=prs.r2, P=prs.p, BETA=prs.beta,SE=prs.se))
}
# Best result is:
prs.result[which.max(prs.result$R2),]
q() # exit R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
library(data.table)
library(magrittr)
p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)
phenotype <- fread("EUR.height")
pcs <- fread("EUR.eigenvec", header=F) %>%
setnames(., colnames(.), c("FID", "IID", paste0("PC",1:6)) )
covariate <- fread("EUR.cov")
pheno <- merge(phenotype, covariate) %>%
merge(., pcs)
null.r2 <- summary(lm(Height~., data=pheno[,-c("FID", "IID")]))$r.squared
prs.result <- NULL
for(i in p.threshold){
pheno.prs <- paste0("EUR.", i, ".profile") %>%
fread(.) %>%
.[,c("FID", "IID", "SCORE")] %>%
merge(., pheno, by=c("FID", "IID"))
model <- lm(Height~., data=pheno.prs[,-c("FID","IID")]) %>%
summary
model.r2 <- model$r.squared
prs.r2 <- model.r2-null.r2
prs.coef <- model$coeff["SCORE",]
prs.result %<>% rbind(.,
data.frame(Threshold=i, R2=prs.r2,
P=as.numeric(prs.coef[4]),
BETA=as.numeric(prs.coef[1]),
SE=as.numeric(prs.coef[2])))
}
print(prs.result[which.max(prs.result$R2),])
q() # exit R