君子和而不同
This is the third part of the PRS blog, calculating PRS scores using PRSice2 and lassosum.
The formula to calculate PRS score is given in "PRS score 2" blog: $$ PRS_j =\frac{ \sum_i^NS_i*G_{ij}}{P*M_j} $$ Here, the effect size of SNP $i$ is $S_i$; the number of effect alleles observed in sample $j$ is $G_{ij}$; the ploidy of the sample is $P$ (is genrally 2 for human); the total number of SNPs included in the PRS is N;and the number of non-missing SNPs observed in sample $j$ is $M_j$. If the sample has a missing genotype for SNP $i$, then the population minor allele frequency multiplied by the ploidy ($MAF_i * P$) is used instead of $G_{ij}$.
where 'a hyperparameter'(pvalue threshold) that determine the set of SNPs to be included when calculating PRS scores.
This hyperparameter is tuned by selecting the largest $R^2$ diff between Full regression model and the reduced model.
Full Model: $$ phenotypeTrail \backsim PRS + COV $$
Reduced Model: $$ phenotypeTrail \backsim COV $$
Required Files
File Name | Description |
---|---|
Height.QC.gz | The post QC base data file. While PRSice-2 can automatically apply most filtering on the base file, it cannot remove duplicated SNPs |
EUR.QC.bed | This file contains the genotype data that passed the QC steps |
EUR.QC.bim | This file contains the list of SNPs that passed the QC steps |
EUR.QC.fam | This file contains the samples that passed the QC steps |
EUR.height | This file contains the phenotype data of the samples |
EUR.cov | This file contains the covariates of the samples |
EUR.eigenvec | This file contains the principal components (PCs) of the samples |
Running PRS analysis
To run PRSice-2, we need merge EUR.cov and EUR.eigenvec.
1
2
3
4
5
6
7
library(data.table)
covariate <- fread("EUR.cov")
pcs <- fread("EUR.eigenvec", header=F)
colnames(pcs) <- c("FID","IID", paste0("PC",1:6))
cov <- merge(covariate, pcs)
fwrite(cov,"EUR.covariate", sep="\t")
q()
Then, PRS could calculated through PRSice
1
2
3
4
5
6
7
8
9
10
11
12
Rscript ~/softwares/PRSice_mac/PRSice.R \
--prsice ~/softwares/PRSice_mac/PRSice_mac \
--base Height.QC.gz \
--target EUR.QC \
--binary-target F \
--pheno EUR.height \
--cov EUR.covariate \
--base-maf MAF:0.01 \
--base-info INFO:0.8 \
--stat OR \
--or \
--out EUR
This will automatically perform “high-resolution scoring” and generate the “best-fit” PRS (in EUR.best), with associated plots of the results.
Tables of parameters
Parameter | Value | Description |
---|---|---|
prsice | PRSice_xxx | Informs PRSice.R that the location of the PRSice binary |
base | Height.QC.gz | Informs PRSice that the name of the GWAS summary statistic |
target | EUR.QC | Informs PRSice that the input genotype files should have a prefix of EUR.QC |
binary-target | F | Indicate if the phenotype of interest is a binary trait. F for no, T for yes |
pheno | EUR.height | Provide PRSice with the phenotype file |
cov | EUR.covariate | Provide PRSice with the covariate file |
base-maf | MAF:0.01 | Filter out SNPs with MAF < 0.01 in the GWAS summary statistics, using information in the MAF column |
base-info | INFO:0.8 | Filter out SNPs with INFO < 0.8 in the GWAS summary statistics, using information in the INFO column |
stat | OR | Column name of the column containing the effect size |
or | - | Inform PRSice that the effect size is an Odd Ratio |
out | EUR | Informs PRSice that all output should have a prefix of EUR |
Using LDpred-2
LDpred-2 is an R package that uses a Bayesian approach to polygenic risk scoring. The tutorial along with LDpred-2 is also worthing trying. Based on their suggestions, “In practice, until we find a better set of variants, we recommend using the HapMap3 variants used in the PRS-CS and LDpred2 papers. If you do not have enough data to use as LD reference (e.g. at least 2000 individuals)”, you could use one a good LD references prepared by them. This part is too complicated for my current PRS applications, so… just ignore it for now.
Calculate PRS using lassosum
lassosum is an R package that uses penalised regression (LASSO) in its approach to PRS calculation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#install.packages(c("devtools","RcppArmadillo", "data.table", "Matrix"), dependencies=TRUE)
#library(devtools)
#install_github("tshmak/lassosum")
library(lassosum)
# Prefer to work with data.table as it speeds up file reading
library(data.table)
library(methods)
library(magrittr)
# For multi-threading, you can use the parallel package and
# invoke cl which is then passed to lassosum.pipeline
library(parallel)
# This will invoke 2 threads.
cl <- makeCluster(2)
sum.stat <- "Height.QC.gz"
bfile <- "EUR.QC"
# Read in and process the covariates
covariate <- fread("EUR.cov")
pcs <- fread("EUR.eigenvec") %>%
setnames(., colnames(.), c("FID","IID", paste0("PC",1:6)))
# Need as.data.frame here as lassosum doesn't handle data.table
# covariates very well
cov <- merge(covariate, pcs)
# We will need the EUR.hg19 file provided by lassosum
# which are LD regions defined in Berisa and Pickrell (2015) for the European population and the hg19 genome.
ld.file <- "EUR.hg19"
# output prefix
prefix <- "EUR"
# Read in the target phenotype file
target.pheno <- fread("EUR.height")[,c("FID", "IID", "Height")]
# Read in the summary statistics
ss <- fread(sum.stat)
# Remove P-value = 0, which causes problem in the transformation
ss <- ss[!P == 0]
# Transform the P-values into correlation
cor <- p2cor(p = ss$P,
n = ss$N,
sign = log(ss$OR)
)
fam <- fread(paste0(bfile, ".fam"))
fam[,ID:=do.call(paste, c(.SD, sep=":")),.SDcols=c(1:2)]
# Run the lassosum pipeline
# The cluster parameter is used for multi-threading
# You can ignore that if you do not wish to perform multi-threaded processing
out <- lassosum.pipeline(
cor = cor,
chr = ss$CHR,
pos = ss$BP,
A1 = ss$A1,
A2 = ss$A2,
ref.bfile = bfile,
test.bfile = bfile,
LDblocks = ld.file,
cluster=cl
)
# Store the R2 results
target.res <- validate(out, pheno = as.data.frame(target.pheno), covar=as.data.frame(cov))
# Get the maximum R2
r2 <- max(target.res$validation.table$value)^2