My Weakness, My Strength

Recently, I have genetic data to be added to my AD models. For each subject, I intended to add their PRS score of AD as a new feature.

This blog is based on two tutorials, one from PRSice and the other from Andries T. Marees.

The power and validity of PRS analysis depend on the quality of the base and target data. THus, both data sets must undergo QC, not only to the standards of specific GWAS studies, but also to some PRS specific requirements.

1.QC of Base Data

The first step in Polygenic Risk Score (PRS) analyses is to generate or obtain the base data (GWAS summary statistics). The GWAS summary statistics file contains the following columns:

The column headers correspond to the following:

CHR	BP	SNP	A1	A2	N	SE	P	OR	INFO	MAF
1	756604	rs3131962	A	G	388028	0.00301666	0.483171	0.997886915712657	0.890557941364774	0.369389592764921
1	768448	rs12562034	A	G	388028	0.00329472	0.834808	1.00068731609353	0.895893511351165	0.336845754096289

CHR: The chromosome in which the SNP resides
BP: Chromosomal co-ordinate of the SNP
SNP: SNP ID, usually in the form of rs-ID
A1: The effect allele of the SNP
A2: The non-effect allele of the SNP
N: Number of samples used to obtain the effect size estimate
SE: The standard error (SE) of the effect size esimate
P: The P-value of association between the SNP genotypes and the base phenotype
OR: The effect size estimate of the SNP, if the outcome is binary/case-control. If the outcome is continuous or treated as continuous then this will usually be BETA
INFO: The imputation information score
MAF: The minor allele frequency (MAF) of the SNP

QC relevant to base data only

Heritability check: A critical factor in the accuracy and predictive power of PRSs is the power of the base data. To avoid reaching misleading conclusions, PRS analysis using GWAS data with h_SNP^2^ > 0.05 is recommended. h_SNP^2^ could be estimated from GWAS summary statistics using LD Score regression or SumHer.
Effect Allele: Some GWAS results do not specify which allele is the effect allele and which is non-effect allele, which will lead to wrong directions of the effect of PRS in the target data if the incorrect assumption is made.
Genome build: The height summary statistic are on the same genome build as the target data that we will be using.If they are not then use a tool such as LiftOver to make the builds consistent across the data sets.

Standard GWAS QC: both the base and target data should be subjected to the standard stringent QC steps performed in GWAS. If the base data have been obtained as summary statistics from a public source, then the typical QC steps that you will be able to perform on them are to filter the SNPs according to INFO score and MAF. SNPs with low minor allele frequency (MAF) or imputation information score (INFO) are more likely to generate false positive results due to their lower statistical power (and higher probability of genotyping errors in the case of low MAF). Therefore, SNPs with low MAF (MAF < 1%) and INFO(INFO < 0.8) are typically removed before performing downstream analyses. With very large base sample sizes, these thresholds could be reduced if sensitivity checks indicate reliable results.

Using bash
Using R

  gunzip -c Height.gwas.txt.gz |\
  awk 'NR==1 || ($11 > 0.01) && ($10 > 0.8) {print}' |\
  gzip  > Height.gz

  # Alternatively, you can use R, with data.table v1.11.8+
  library(data.table)
  # Read in file
  dat <- fread("Height.gwas.txt.gz")
  # Filter out SNPs
  result <- dat[INFO > 0.8 & MAF > 0.01]
  # Output the gz file
  fwrite(result, "Height.gz", sep="\t")

Mismatching SNPs: SNPs that have mismatching alleles reported in the base and target data are either resolvable by “strand-flipping” the alleles to their complementary alleles in e.g. the target data, such as for a SNP with A/C in the base data and G/T in the target, or non-resolvable, such as for a SNP with C/G in the base and C/T in the target. Most polygenic score software perform strand-flipping automatically for SNPs that are resolvable, and remove non-resolvable mismatching SNPs.

Duplicate SNPs Most PRS software do not allow duplicated SNPs in the base data input and thus they should be removed:

#assuming the third column contian the SNP ID
  gunzip -c Height.gz |\
  awk '{seen[$3]++; if(seen[$3]==1){ print}}' |\
  gzip - > Height.nodup.g

Ambiguous SNPs: If the base and target data were generated using different genotyping chips and the chromosome strand (+/-) that was used for either is unknown, then it is not possible to pair-up the alleles of ambiguous SNPs (i.e. those with complementary alleles, either C/G or A/T SNPs) across the data sets, because it will be unknown whether the base and target data are referring to the same allele or not. While allele frequencies could be used to infer which alleles are on the same strand, the accuracy of this could be low for SNPs with MAF close to 50% or when the base and target data are from different populations. Therefore, we recommend removing all ambiguous SNPs to avoid introducing this potential source of systematic error.

  gunzip -c Height.nodup.gz |\
  awk '!( ($4=="A" && $5=="T") || \
          ($4=="T" && $5=="A") || \
          ($4=="G" && $5=="C") || \
          ($4=="C" && $5=="G")) {print}' |\
      gzip > Height.QC.gz

2.QC of Target Data

Target data consist of individual-level genotype-phenotype data, usually generated within your lab/department/collaboration. plink is required to do GWAS quality control similar to standard GWAS analysis.

QC checklist: Target data

Sample size: We recommend that users only perform PRS analyses on target data of at least 100 individuals.
Genome build: Again, make sure target and base data are using the same genome build

Standard GWAS QC:

Fileter samples by MAF, allele frequencey, etc

The target data must be quality controlled to at least the standards implemented in GWAS studies, e.g. removing SNPs with low genotyping rate, low minor allele frequency, out of Hardy-Weinberg Equilibrium, removing individuals with low genotyping rate.

  plink \
      --bfile EUR \
      --maf 0.01 \
      --hwe 1e-6 \
      --geno 0.01 \
      --mind 0.01 \
      --write-snplist \
      --make-just-fam \
      --out EUR.QC

Some useful Plink options:

Option	Value	Description
bfile	EUR	Informs plink that the input genotype files should have a prefix of EUR
maf	0.01	Removes all SNPs with minor allele frequency less than 0.01. Genotyping errors typically have a larger influence on SNPs with low MAF. Studies with large sample sizes could apply a lower MAF threshold
hwe	1e-6	Removes SNPs with low P-value from the Hardy-Weinberg Equilibrium Fisher’s exact or chi-squared test. SNPs with significant P-values from the HWE test are more likely affected by genotyping error or the effects of natural selection. Filtering should be performed on the control samples to avoid filtering SNPs that are causal (under selection in cases). When phenotype information is included, plink will automatically perform the filtering in the controls.
geno	0.01	Excludes SNPs that are missing in a high fraction of subjects. A two-stage filtering process is usually performed
mind	0.01	Excludes individuals who have a high rate of genotype missingness, since this may indicate problems in the DNA sample or processing.Similar to geno, two stage steps is usually performed.
make-just-fam	-	Informs plink to only generate the QC’ed sample name to avoid generating the .bed file.
write-snplist	-	Informs plink to only generate the QC’ed SNP list to avoid generating the .bed file.
out	EUR.QC	Informs plink that all output should have a prefix of EUR.QC

After this QC steps, for the similated samples:

14 samples were removed due to a high rate of genotype missingness
5,353 SNP were removed due to missing genotype data
944 SNPs were removed due to being out of Hardy-Weinberg Equilibrium
5,061 SNPs were removed due to low minor allele frequency

Normally, we can generate a new genotype file using the new sample list. However, this will use up a lot of storage space. Using plink’s –extract, –exclude, –keep, –remove, –make-just-fam and –write-snplist functions, we can work solely on the list of samples and SNPs without duplicating the genotype file, reducing the storage space usage.

Remove high heterozygosity samples

Very high or low heterozygosity rates in individuals could be due to DNA contamination or to high levels of inbreeding. Therefore, samples with extreme heterozygosity are typically removed prior to downstream analyses. First, we perform pruning to remove highly correlated SNPs:

  plink \
      --bfile EUR \
      --keep EUR.QC.fam \
      --extract EUR.QC.snplist \
      --indep-pairwise 200 50 0.25 \
      --out EUR.QC

This will generate two files 1) EUR.QC.prune.in and 2) EUR.QC.prune.out. All SNPs within EUR.QC.prune.in have a pairwise r^2^ < 0.25.

Parameter	Value	Description
keep	EUR.QC.fam	Informs plink that we only want to use samples in EUR.QC.fam in the analysis
extract	EUR.QC.snplist	Informs plink that we only want to use SNPs in EUR.QC.snplist in the analysis
indep-pairwise	200 50 0.25	Informs plink that we wish to perform pruning with a window size of 200 variants, sliding across the genome with step size of 50 variants at a time, and filter out any SNPs with LD r^2^ higher than 0.25

Heterozygosity rates can then be computed

  plink \
      --bfile EUR \
      --extract EUR.QC.prune.in \
      --keep EUR.QC.fam \
      --het \
      --out EUR.QC

This will generate the EUR.QC.het file, which contains F coefficient estimates for assessing heterozygosity. We will remove individuals with F coefficients that are more than 3 standard deviation (SD) units from the mean:

  library(data.table)
  # Read in file
  dat <- fread("EUR.QC.het")
  # Get samples with F coefficient within 3 SD of the population mean
  valid <- dat[F<=mean(F)+3*sd(F) & F>=mean(F)-3*sd(F)] 
  # print FID and IID for valid samples
  fwrite(valid[,c("FID","IID")], "EUR.valid.sample", sep="\t") 
  q() # exit R

Ambiguous SNPs: These were removed during the base data QC.

Mismatching SNPs: SNPs that have mismatching alleles reported in the base and target data may be resolvable by strand-flipping the alleles to their complementary alleles in e.g. the target data, such as for a SNP with A/C in the base data and G/T in the target. This can be achieved with the following steps:

Note:Most PRS software will perform strand-flipping automatically, thus this step is usually not required.

Load the bim file, the summary statistic and the QC SNP list into R
Identify SNPs that require strand flipping
Identify SNPs that require recoding in the target (to ensure the coding allele in the target data is the effective allele in the base summary statistic)
Identify SNPs that have different allele in base and target (usually due to difference in genome build or Indel)

  # magrittr allow us to do piping, which help to reduce the 
  # amount of intermediate data types
  library(data.table)
  library(magrittr)
  #1. Load the bim file, the summary statistic and the QC SNP list into R
  # Read in bim file 
  bim <- fread("EUR.bim") %>%
      # Note: . represents the output from previous step
      # The syntax here means, setnames of the data read from
      # the bim file, and replace the original column names by 
      # the new names
      setnames(., colnames(.), c("CHR", "SNP", "CM", "BP", "B.A1", "B.A2")) %>%
      # And immediately change the alleles to upper cases
      .[,c("B.A1","B.A2"):=list(toupper(B.A1), toupper(B.A2))]
  # Read in summary statistic data (require data.table v1.12.0+)
  height <- fread("Height.QC.gz") %>%
      # And immediately change the alleles to upper cases
      .[,c("A1","A2"):=list(toupper(A1), toupper(A2))]
  # Read in QCed SNPs
  qc <- fread("EUR.QC.snplist", header=F)

  #2. Identify SNPs that require strand flipping
  # Merge summary statistic with target
  info <- merge(bim, height, by=c("SNP", "CHR", "BP")) %>%
      # And filter out QCed SNPs
      .[SNP %in% qc[,V1]]

  # Function for calculating the complementary allele
  complement <- function(x){
      switch (x,
          "A" = "T",
          "C" = "G",
          "T" = "A",
          "G" = "C",
          return(NA)
      )
  } 
  # Get SNPs that have the same alleles across base and target
  info.match <- info[A1 == B.A1 & A2 == B.A2, SNP]
  # Identify SNPs that are complementary between base and target
  com.snps <- info[sapply(B.A1, complement) == A1 &
                      sapply(B.A2, complement) == A2, SNP]
  # Now update the bim file
  bim[SNP %in% com.snps, c("B.A1", "B.A2") :=
          list(sapply(B.A1, complement),
              sapply(B.A2, complement))]

  #3. Identify SNPs that require recoding in the target
  # identify SNPs that need recoding
  recode.snps <- info[B.A1==A2 & B.A2==A1, SNP]
  # Update the bim file
  bim[SNP %in% recode.snps, c("B.A1", "B.A2") :=
          list(B.A2, B.A1)]

  # identify SNPs that need recoding & complement
  com.recode <- info[sapply(B.A1, complement) == A2 &
                      sapply(B.A2, complement) == A1, SNP]
  # Now update the bim file
  bim[SNP %in% com.recode, c("B.A1", "B.A2") :=
          list(sapply(B.A2, complement),
              sapply(B.A1, complement))]
  # Write the updated bim file
  fwrite(bim[,c("SNP", "B.A1")], "EUR.a1", col.names=F, sep="\t")

  #4. Identify SNPs that have different allele in base and target
  mismatch <- bim[!(SNP %in% info.match |
                  SNP %in% com.snps |
                  SNP %in% recode.snps |
                  SNP %in% com.recode), SNP]
  write.table(mismatch, "EUR.mismatch", quote=F, row.names=F, col.names=F)
  q() # exit R

Duplicate SNPs: Make sure to remove any duplicate SNPs in your target data

Sex chromosomes: Sometimes sample mislabelling can occur, which may lead to invalid results. A sex check can be performed in PLINK, in which individuals are called as females if their X chromosome homozygosity estimate (F statistic) is < 0.2 and as males if the estimate is > 0.8.

Before performing a sex check, pruning should be performed. A sex check can then easily be conducted using plink:

  plink \
      --bfile EUR \
      --extract EUR.QC.prune.in \
      --keep EUR.valid.sample \
      --check-sex \
      --out EUR.QC

This will generate a file called EUR.QC.sexcheck containing the F-statistics. Then:

  library(data.table)
  # Read in file
  valid <- fread("EUR.valid.sample")
  dat <- fread("EUR.QC.sexcheck")[FID%in%valid$FID]
  fwrite(dat[STATUS=="OK",c("FID","IID")], "EUR.QC.valid", sep="\t") 
  q() # exit R

Sample overlap: Avoiding sample overlap between target and base dataset
Relatedness: Closely related individuals in the target data may lead to overfitted results, limiting the generalisability of the results.

Before calculating the relatedness, pruning should be performed.Individuals that have a first or second degree relative in the sample (π > 0.125 ) can be removed with the following command:
```
1
2
3
4
5
6
  plink \
      --bfile EUR \
      --extract EUR.QC.prune.in \
      --keep EUR.QC.valid \
      --rel-cutoff 0.125 \
      --out EUR.QC
```
Note: A greedy algorithm is used to remove closely related individuals in a way that optimizes the size of the sample retained. However, the algorithm is dependent on the random seed used, which can generate different results. Therefore, to reproduce the same result, you will need to specify the same random seed.

PLINK’s algorithm for removing related individuals does not account for the phenotype under study. To minimize the removal of cases of a disease, the following algorithm can be used instead: GreedyRelated.

Generate final QC’ed target data file: After performing the full analysis, you can generate a QC’ed data set with the following command:

  plink \
      --bfile EUR \
      --make-bed \
      --keep EUR.QC.rel.id \
      --out EUR.QC \
      --extract EUR.QC.snplist \
      --exclude EUR.mismatch \
      --a1-allele EUR.a1

PRS score

A tutorial based on PLINK, PRSice 2, LDpred 2 and lassosum

1.QC of Base Data

QC relevant to base data only

2.QC of Target Data

QC checklist: Target data

CATALOG

FEATURED TAGS

FRIENDS