PRS score (3)

using PRSice-2, lassosum

Posted by Yuan on May 31, 2022

君子和而不同

This is the third part of the PRS blog, calculating PRS scores using PRSice2 and lassosum.

Required Files

File Name Description
Height.QC.gz The post QC base data file. While PRSice-2 can automatically apply most filtering on the base file, it cannot remove duplicated SNPs
EUR.QC.bed This file contains the genotype data that passed the QC steps
EUR.QC.bim This file contains the list of SNPs that passed the QC steps
EUR.QC.fam This file contains the samples that passed the QC steps
EUR.height This file contains the phenotype data of the samples
EUR.cov This file contains the covariates of the samples
EUR.eigenvec This file contains the principal components (PCs) of the samples

Running PRS analysis

To run PRSice-2, we need merge EUR.cov and EUR.eigenvec.

1
2
3
4
5
6
7
library(data.table)
covariate <- fread("EUR.cov")
pcs <- fread("EUR.eigenvec", header=F)
colnames(pcs) <- c("FID","IID", paste0("PC",1:6))
cov <- merge(covariate, pcs)
fwrite(cov,"EUR.covariate", sep="\t")
q()

Then, PRS could calculated through PRSice

1
2
3
4
5
6
7
8
9
10
11
12
Rscript ~/softwares/PRSice_mac/PRSice.R \
    --prsice ~/softwares/PRSice_mac/PRSice_mac \
    --base Height.QC.gz \
    --target EUR.QC \
    --binary-target F \
    --pheno EUR.height \
    --cov EUR.covariate \
    --base-maf MAF:0.01 \
    --base-info INFO:0.8 \
    --stat OR \
    --or \
    --out EUR

This will automatically perform “high-resolution scoring” and generate the “best-fit” PRS (in EUR.best), with associated plots of the results.

Tables of parameters

Parameter Value Description
prsice PRSice_xxx Informs PRSice.R that the location of the PRSice binary
base Height.QC.gz Informs PRSice that the name of the GWAS summary statistic
target EUR.QC Informs PRSice that the input genotype files should have a prefix of EUR.QC
binary-target F Indicate if the phenotype of interest is a binary trait. F for no, T for yes
pheno EUR.height Provide PRSice with the phenotype file
cov EUR.covariate Provide PRSice with the covariate file
base-maf MAF:0.01 Filter out SNPs with MAF < 0.01 in the GWAS summary statistics, using information in the MAF column
base-info INFO:0.8 Filter out SNPs with INFO < 0.8 in the GWAS summary statistics, using information in the INFO column
stat OR Column name of the column containing the effect size
or - Inform PRSice that the effect size is an Odd Ratio
out EUR Informs PRSice that all output should have a prefix of EUR

Using LDpred-2

LDpred-2 is an R package that uses a Bayesian approach to polygenic risk scoring. The tutorial along with LDpred-2 is also worthing trying. Based on their suggestions, “In practice, until we find a better set of variants, we recommend using the HapMap3 variants used in the PRS-CS and LDpred2 papers. If you do not have enough data to use as LD reference (e.g. at least 2000 individuals)”, you could use one a good LD references prepared by them. This part is too complicated for my current PRS applications, so… just ignore it for now.

Calculate PRS using lassosum

lassosum is an R package that uses penalised regression (LASSO) in its approach to PRS calculation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#install.packages(c("devtools","RcppArmadillo", "data.table", "Matrix"), dependencies=TRUE)
#library(devtools)
#install_github("tshmak/lassosum")
library(lassosum)
# Prefer to work with data.table as it speeds up file reading
library(data.table)
library(methods)
library(magrittr)
# For multi-threading, you can use the parallel package and 
# invoke cl which is then passed to lassosum.pipeline
library(parallel)
# This will invoke 2 threads. 
cl <- makeCluster(2)

sum.stat <- "Height.QC.gz"
bfile <- "EUR.QC"
# Read in and process the covariates
covariate <- fread("EUR.cov")
pcs <- fread("EUR.eigenvec") %>%
    setnames(., colnames(.), c("FID","IID", paste0("PC",1:6)))
# Need as.data.frame here as lassosum doesn't handle data.table 
# covariates very well
cov <- merge(covariate, pcs)

# We will need the EUR.hg19 file provided by lassosum 
# which are LD regions defined in Berisa and Pickrell (2015) for the European population and the hg19 genome.
ld.file <- "EUR.hg19"
# output prefix
prefix <- "EUR"
# Read in the target phenotype file
target.pheno <- fread("EUR.height")[,c("FID", "IID", "Height")]
# Read in the summary statistics
ss <- fread(sum.stat)
# Remove P-value = 0, which causes problem in the transformation
ss <- ss[!P == 0]
# Transform the P-values into correlation
cor <- p2cor(p = ss$P,
        n = ss$N,
        sign = log(ss$OR)
        )
fam <- fread(paste0(bfile, ".fam"))
fam[,ID:=do.call(paste, c(.SD, sep=":")),.SDcols=c(1:2)]


# Run the lassosum pipeline
# The cluster parameter is used for multi-threading
# You can ignore that if you do not wish to perform multi-threaded processing
out <- lassosum.pipeline(
    cor = cor,
    chr = ss$CHR,
    pos = ss$BP,
    A1 = ss$A1,
    A2 = ss$A2,
    ref.bfile = bfile,
    test.bfile = bfile,
    LDblocks = ld.file, 
    cluster=cl
)
# Store the R2 results
target.res <- validate(out, pheno = as.data.frame(target.pheno), covar=as.data.frame(cov))
# Get the maximum R2
r2 <- max(target.res$validation.table$value)^2