PRS score

A tutorial based on PLINK, PRSice 2, LDpred 2 and lassosum

Posted by Yuan on May 24, 2022
  • My Weakness, My Strength

Recently, I have genetic data to be added to my AD models. For each subject, I intended to add their PRS score of AD as a new feature.

This blog is based on two tutorials, one from PRSice and the other from Andries T. Marees.

The power and validity of PRS analysis depend on the quality of the base and target data. THus, both data sets must undergo QC, not only to the standards of specific GWAS studies, but also to some PRS specific requirements.

1.QC of Base Data

The first step in Polygenic Risk Score (PRS) analyses is to generate or obtain the base data (GWAS summary statistics). The GWAS summary statistics file contains the following columns:

The column headers correspond to the following:

CHR BP SNP A1 A2 N SE P OR INFO MAF
1 756604 rs3131962 A G 388028 0.00301666 0.483171 0.997886915712657 0.890557941364774 0.369389592764921
1 768448 rs12562034 A G 388028 0.00329472 0.834808 1.00068731609353 0.895893511351165 0.336845754096289
  • CHR: The chromosome in which the SNP resides
  • BP: Chromosomal co-ordinate of the SNP
  • SNP: SNP ID, usually in the form of rs-ID
  • A1: The effect allele of the SNP
  • A2: The non-effect allele of the SNP
  • N: Number of samples used to obtain the effect size estimate
  • SE: The standard error (SE) of the effect size esimate
  • P: The P-value of association between the SNP genotypes and the base phenotype
  • OR: The effect size estimate of the SNP, if the outcome is binary/case-control. If the outcome is continuous or treated as continuous then this will usually be BETA
  • INFO: The imputation information score
  • MAF: The minor allele frequency (MAF) of the SNP

QC relevant to base data only

  • Heritability check: A critical factor in the accuracy and predictive power of PRSs is the power of the base data. To avoid reaching misleading conclusions, PRS analysis using GWAS data with h_SNP^2^ > 0.05 is recommended. h_SNP^2^ could be estimated from GWAS summary statistics using LD Score regression or SumHer.
  • Effect Allele: Some GWAS results do not specify which allele is the effect allele and which is non-effect allele, which will lead to wrong directions of the effect of PRS in the target data if the incorrect assumption is made.
  • Genome build: The height summary statistic are on the same genome build as the target data that we will be using.If they are not then use a tool such as LiftOver to make the builds consistent across the data sets.
  • Standard GWAS QC: both the base and target data should be subjected to the standard stringent QC steps performed in GWAS. If the base data have been obtained as summary statistics from a public source, then the typical QC steps that you will be able to perform on them are to filter the SNPs according to INFO score and MAF. SNPs with low minor allele frequency (MAF) or imputation information score (INFO) are more likely to generate false positive results due to their lower statistical power (and higher probability of genotyping errors in the case of low MAF). Therefore, SNPs with low MAF (MAF < 1%) and INFO(INFO < 0.8) are typically removed before performing downstream analyses. With very large base sample sizes, these thresholds could be reduced if sensitivity checks indicate reliable results.

    1
    2
    3
    
      gunzip -c Height.gwas.txt.gz |\
      awk 'NR==1 || ($11 > 0.01) && ($10 > 0.8) {print}' |\
      gzip  > Height.gz
    
    1
    2
    3
    4
    5
    6
    7
    8
    
      # Alternatively, you can use R, with data.table v1.11.8+
      library(data.table)
      # Read in file
      dat <- fread("Height.gwas.txt.gz")
      # Filter out SNPs
      result <- dat[INFO > 0.8 & MAF > 0.01]
      # Output the gz file
      fwrite(result, "Height.gz", sep="\t")
    
  • Mismatching SNPs: SNPs that have mismatching alleles reported in the base and target data are either resolvable by “strand-flipping” the alleles to their complementary alleles in e.g. the target data, such as for a SNP with A/C in the base data and G/T in the target, or non-resolvable, such as for a SNP with C/G in the base and C/T in the target. Most polygenic score software perform strand-flipping automatically for SNPs that are resolvable, and remove non-resolvable mismatching SNPs.
  • Duplicate SNPs Most PRS software do not allow duplicated SNPs in the base data input and thus they should be removed:
    1
    2
    3
    4
    
    #assuming the third column contian the SNP ID
      gunzip -c Height.gz |\
      awk '{seen[$3]++; if(seen[$3]==1){ print}}' |\
      gzip - > Height.nodup.g
    
  • Ambiguous SNPs: If the base and target data were generated using different genotyping chips and the chromosome strand (+/-) that was used for either is unknown, then it is not possible to pair-up the alleles of ambiguous SNPs (i.e. those with complementary alleles, either C/G or A/T SNPs) across the data sets, because it will be unknown whether the base and target data are referring to the same allele or not. While allele frequencies could be used to infer which alleles are on the same strand, the accuracy of this could be low for SNPs with MAF close to 50% or when the base and target data are from different populations. Therefore, we recommend removing all ambiguous SNPs to avoid introducing this potential source of systematic error.
    1
    2
    3
    4
    5
    6
    
      gunzip -c Height.nodup.gz |\
      awk '!( ($4=="A" && $5=="T") || \
              ($4=="T" && $5=="A") || \
              ($4=="G" && $5=="C") || \
              ($4=="C" && $5=="G")) {print}' |\
          gzip > Height.QC.gz
    

    2.QC of Target Data

    Target data consist of individual-level genotype-phenotype data, usually generated within your lab/department/collaboration. plink is required to do GWAS quality control similar to standard GWAS analysis.

QC checklist: Target data

  • Sample size: We recommend that users only perform PRS analyses on target data of at least 100 individuals.
  • Genome build: Again, make sure target and base data are using the same genome build
  • Standard GWAS QC:
    1. Fileter samples by MAF, allele frequencey, etc

    The target data must be quality controlled to at least the standards implemented in GWAS studies, e.g. removing SNPs with low genotyping rate, low minor allele frequency, out of Hardy-Weinberg Equilibrium, removing individuals with low genotyping rate.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
      plink \
          --bfile EUR \
          --maf 0.01 \
          --hwe 1e-6 \
          --geno 0.01 \
          --mind 0.01 \
          --write-snplist \
          --make-just-fam \
          --out EUR.QC
    

    Some useful Plink options:

    Option Value Description
    bfile EUR Informs plink that the input genotype files should have a prefix of EUR
    maf 0.01 Removes all SNPs with minor allele frequency less than 0.01. Genotyping errors typically have a larger influence on SNPs with low MAF. Studies with large sample sizes could apply a lower MAF threshold
    hwe 1e-6 Removes SNPs with low P-value from the Hardy-Weinberg Equilibrium Fisher’s exact or chi-squared test. SNPs with significant P-values from the HWE test are more likely affected by genotyping error or the effects of natural selection. Filtering should be performed on the control samples to avoid filtering SNPs that are causal (under selection in cases). When phenotype information is included, plink will automatically perform the filtering in the controls.
    geno 0.01 Excludes SNPs that are missing in a high fraction of subjects. A two-stage filtering process is usually performed
    mind 0.01 Excludes individuals who have a high rate of genotype missingness, since this may indicate problems in the DNA sample or processing.Similar to geno, two stage steps is usually performed.
    make-just-fam - Informs plink to only generate the QC’ed sample name to avoid generating the .bed file.
    write-snplist - Informs plink to only generate the QC’ed SNP list to avoid generating the .bed file.
    out EUR.QC Informs plink that all output should have a prefix of EUR.QC

    After this QC steps, for the similated samples:

    • 14 samples were removed due to a high rate of genotype missingness
    • 5,353 SNP were removed due to missing genotype data
    • 944 SNPs were removed due to being out of Hardy-Weinberg Equilibrium
    • 5,061 SNPs were removed due to low minor allele frequency

    Normally, we can generate a new genotype file using the new sample list. However, this will use up a lot of storage space. Using plink’s –extract, –exclude, –keep, –remove, –make-just-fam and –write-snplist functions, we can work solely on the list of samples and SNPs without duplicating the genotype file, reducing the storage space usage.

    1. Remove high heterozygosity samples

    Very high or low heterozygosity rates in individuals could be due to DNA contamination or to high levels of inbreeding. Therefore, samples with extreme heterozygosity are typically removed prior to downstream analyses. First, we perform pruning to remove highly correlated SNPs:

    1
    2
    3
    4
    5
    6
    
      plink \
          --bfile EUR \
          --keep EUR.QC.fam \
          --extract EUR.QC.snplist \
          --indep-pairwise 200 50 0.25 \
          --out EUR.QC
    

    This will generate two files 1) EUR.QC.prune.in and 2) EUR.QC.prune.out. All SNPs within EUR.QC.prune.in have a pairwise r^2^ < 0.25.

    Parameter Value Description
    keep EUR.QC.fam Informs plink that we only want to use samples in EUR.QC.fam in the analysis
    extract EUR.QC.snplist Informs plink that we only want to use SNPs in EUR.QC.snplist in the analysis
    indep-pairwise 200 50 0.25 Informs plink that we wish to perform pruning with a window size of 200 variants, sliding across the genome with step size of 50 variants at a time, and filter out any SNPs with LD r^2^ higher than 0.25
    1. Heterozygosity rates can then be computed
    1
    2
    3
    4
    5
    6
    
      plink \
          --bfile EUR \
          --extract EUR.QC.prune.in \
          --keep EUR.QC.fam \
          --het \
          --out EUR.QC
    

    This will generate the EUR.QC.het file, which contains F coefficient estimates for assessing heterozygosity. We will remove individuals with F coefficients that are more than 3 standard deviation (SD) units from the mean:

    1
    2
    3
    4
    5
    6
    7
    8
    
      library(data.table)
      # Read in file
      dat <- fread("EUR.QC.het")
      # Get samples with F coefficient within 3 SD of the population mean
      valid <- dat[F<=mean(F)+3*sd(F) & F>=mean(F)-3*sd(F)] 
      # print FID and IID for valid samples
      fwrite(valid[,c("FID","IID")], "EUR.valid.sample", sep="\t") 
      q() # exit R
    
  • Ambiguous SNPs: These were removed during the base data QC.
  • Mismatching SNPs: SNPs that have mismatching alleles reported in the base and target data may be resolvable by strand-flipping the alleles to their complementary alleles in e.g. the target data, such as for a SNP with A/C in the base data and G/T in the target. This can be achieved with the following steps:

    Note:Most PRS software will perform strand-flipping automatically, thus this step is usually not required.

    1. Load the bim file, the summary statistic and the QC SNP list into R
    2. Identify SNPs that require strand flipping
    3. Identify SNPs that require recoding in the target (to ensure the coding allele in the target data is the effective allele in the base summary statistic)
    4. Identify SNPs that have different allele in base and target (usually due to difference in genome build or Indel)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    
      # magrittr allow us to do piping, which help to reduce the 
      # amount of intermediate data types
      library(data.table)
      library(magrittr)
      #1. Load the bim file, the summary statistic and the QC SNP list into R
      # Read in bim file 
      bim <- fread("EUR.bim") %>%
          # Note: . represents the output from previous step
          # The syntax here means, setnames of the data read from
          # the bim file, and replace the original column names by 
          # the new names
          setnames(., colnames(.), c("CHR", "SNP", "CM", "BP", "B.A1", "B.A2")) %>%
          # And immediately change the alleles to upper cases
          .[,c("B.A1","B.A2"):=list(toupper(B.A1), toupper(B.A2))]
      # Read in summary statistic data (require data.table v1.12.0+)
      height <- fread("Height.QC.gz") %>%
          # And immediately change the alleles to upper cases
          .[,c("A1","A2"):=list(toupper(A1), toupper(A2))]
      # Read in QCed SNPs
      qc <- fread("EUR.QC.snplist", header=F)
    
      #2. Identify SNPs that require strand flipping
      # Merge summary statistic with target
      info <- merge(bim, height, by=c("SNP", "CHR", "BP")) %>%
          # And filter out QCed SNPs
          .[SNP %in% qc[,V1]]
    
      # Function for calculating the complementary allele
      complement <- function(x){
          switch (x,
              "A" = "T",
              "C" = "G",
              "T" = "A",
              "G" = "C",
              return(NA)
          )
      } 
      # Get SNPs that have the same alleles across base and target
      info.match <- info[A1 == B.A1 & A2 == B.A2, SNP]
      # Identify SNPs that are complementary between base and target
      com.snps <- info[sapply(B.A1, complement) == A1 &
                          sapply(B.A2, complement) == A2, SNP]
      # Now update the bim file
      bim[SNP %in% com.snps, c("B.A1", "B.A2") :=
              list(sapply(B.A1, complement),
                  sapply(B.A2, complement))]
    
      #3. Identify SNPs that require recoding in the target
      # identify SNPs that need recoding
      recode.snps <- info[B.A1==A2 & B.A2==A1, SNP]
      # Update the bim file
      bim[SNP %in% recode.snps, c("B.A1", "B.A2") :=
              list(B.A2, B.A1)]
    
      # identify SNPs that need recoding & complement
      com.recode <- info[sapply(B.A1, complement) == A2 &
                          sapply(B.A2, complement) == A1, SNP]
      # Now update the bim file
      bim[SNP %in% com.recode, c("B.A1", "B.A2") :=
              list(sapply(B.A2, complement),
                  sapply(B.A1, complement))]
      # Write the updated bim file
      fwrite(bim[,c("SNP", "B.A1")], "EUR.a1", col.names=F, sep="\t")
    
      #4. Identify SNPs that have different allele in base and target
      mismatch <- bim[!(SNP %in% info.match |
                      SNP %in% com.snps |
                      SNP %in% recode.snps |
                      SNP %in% com.recode), SNP]
      write.table(mismatch, "EUR.mismatch", quote=F, row.names=F, col.names=F)
      q() # exit R
    
  • Duplicate SNPs: Make sure to remove any duplicate SNPs in your target data
  • Sex chromosomes: Sometimes sample mislabelling can occur, which may lead to invalid results. A sex check can be performed in PLINK, in which individuals are called as females if their X chromosome homozygosity estimate (F statistic) is < 0.2 and as males if the estimate is > 0.8.

    Before performing a sex check, pruning should be performed. A sex check can then easily be conducted using plink:

    1
    2
    3
    4
    5
    6
    
      plink \
          --bfile EUR \
          --extract EUR.QC.prune.in \
          --keep EUR.valid.sample \
          --check-sex \
          --out EUR.QC
    

    This will generate a file called EUR.QC.sexcheck containing the F-statistics. Then:

    1
    2
    3
    4
    5
    6
    
      library(data.table)
      # Read in file
      valid <- fread("EUR.valid.sample")
      dat <- fread("EUR.QC.sexcheck")[FID%in%valid$FID]
      fwrite(dat[STATUS=="OK",c("FID","IID")], "EUR.QC.valid", sep="\t") 
      q() # exit R
    
  • Sample overlap: Avoiding sample overlap between target and base dataset
  • Relatedness: Closely related individuals in the target data may lead to overfitted results, limiting the generalisability of the results.

    Before calculating the relatedness, pruning should be performed.Individuals that have a first or second degree relative in the sample (π > 0.125 ) can be removed with the following command:

    1
    2
    3
    4
    5
    6
    
      plink \
          --bfile EUR \
          --extract EUR.QC.prune.in \
          --keep EUR.QC.valid \
          --rel-cutoff 0.125 \
          --out EUR.QC
    

    Note: A greedy algorithm is used to remove closely related individuals in a way that optimizes the size of the sample retained. However, the algorithm is dependent on the random seed used, which can generate different results. Therefore, to reproduce the same result, you will need to specify the same random seed.

    PLINK’s algorithm for removing related individuals does not account for the phenotype under study. To minimize the removal of cases of a disease, the following algorithm can be used instead: GreedyRelated.

  • Generate final QC’ed target data file: After performing the full analysis, you can generate a QC’ed data set with the following command:

    1
    2
    3
    4
    5
    6
    7
    8
    
      plink \
          --bfile EUR \
          --make-bed \
          --keep EUR.QC.rel.id \
          --out EUR.QC \
          --extract EUR.QC.snplist \
          --exclude EUR.mismatch \
          --a1-allele EUR.a1