Updated RNAseq Pipeline Using NextFlow

A modular, scalable, and reproducible workflow for RNA-seq data analysis with built-in QC, alignment, and quantification steps

Posted by Yuan on October 1, 2023

πŸš€ Overview

As RNA-Seq continues to be a core technology for transcriptomic analysis, having a reproducible, scalable, and modular pipeline is essential. In this post, I’ll walk through recent updates to our RNA-Seq pipeline built with Nextflow, including improvements in preprocessing, quantification, and downstream QC.

πŸ”§ Why Nextflow?

Nextflow enables scalable and reproducible scientific workflows using software containers. It integrates smoothly with cloud and HPC environments and is particularly well-suited for complex bioinformatics pipelines.

Key advantages:

  • Easy parallelization
  • Support for Docker/Singularity
  • Seamless integration with GitHub and CI/CD tools
  • Built-in traceability and logging

Example running bash code

Directory Structure

1
2
3
4
5
6
7
8
9
10
11
12
rna-seq-pipeline/
β”œβ”€β”€ main.nf
β”œβ”€β”€ nextflow.config
β”œβ”€β”€ modules/
β”‚   β”œβ”€β”€ qc/
β”‚   β”œβ”€β”€ align/
β”‚   β”œβ”€β”€ quant/
β”œβ”€β”€ results/
β”‚   └── multiqc/
└── resources/
    β”œβ”€β”€ genome.fa
    └── annotation.gtf
1
2
3
4
5
nextflow run main.nf \
  --reads "data/*_R{1,2}.fastq.gz" \
  --genome GRCh38 \
  --aligner salmon \
  -profile docker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
#!/usr/bin/env nextflow

nextflow.enable.dsl=2

// Define parameters
params.reads = "data/*_R{1,2}.fastq.gz"
params.genome = "resources/genome.fa"
params.gtf = "resources/annotation.gtf"
params.aligner = "salmon" // or 'star'
params.outdir = "results"

workflow {

    reads_ch = Channel.fromFilePairs(params.reads, flat: true)

    qc_out = qc(reads_ch)
    trimmed_reads = trim(qc_out)

    if (params.aligner == 'salmon') {
        quant_out = quant_salmon(trimmed_reads)
    } else if (params.aligner == 'star') {
        aligned = align_star(trimmed_reads)
        quant_out = count_featureCounts(aligned)
    }

    multiqc_report(quant_out)
}

// QC Step
process qc {
    tag "$sample_id"
    input:
    set sample_id, file(reads) from reads_ch
    output:
    set sample_id, file("*_fastqc.zip") into qc_ch

    script:
    """
    fastqc -o ./ ${reads.join(' ')}
    """
}

// Trimming Step
process trim {
    tag "$sample_id"
    input:
    set sample_id, file(reads) from qc_ch
    output:
    set sample_id, file("*.fq.gz") into trimmed_ch

    script:
    """
    fastp -i ${reads[0]} -I ${reads[1]} -o ${sample_id}_R1_trimmed.fq.gz -O ${sample_id}_R2_trimmed.fq.gz
    """
}

// Salmon Quantification
process quant_salmon {
    tag "$sample_id"
    input:
    set sample_id, file(reads) from trimmed_ch
    output:
    set sample_id, file("quant.sf") into quant_out_ch

    script:
    """
    salmon quant -i salmon_index -l A \
        -1 ${reads[0]} -2 ${reads[1]} \
        -p 8 -o ${sample_id}_quant
    cp ${sample_id}_quant/quant.sf ./
    """
}

// STAR Alignment
process align_star {
    tag "$sample_id"
    input:
    set sample_id, file(reads) from trimmed_ch
    output:
    set sample_id, file("Aligned.out.sam") into aligned_ch

    script:
    """
    STAR --genomeDir star_index \
         --readFilesIn ${reads[0]} ${reads[1]} \
         --runThreadN 8 \
         --outFileNamePrefix ${sample_id}_ \
         --outSAMtype SAM
    cp ${sample_id}_Aligned.out.sam Aligned.out.sam
    """
}

// featureCounts (optional downstream after STAR)
process count_featureCounts {
    input:
    set sample_id, file(aligned) from aligned_ch
    output:
    set sample_id, file("counts.txt") into quant_out_ch

    script:
    """
    featureCounts -a ${params.gtf} -o counts.txt -T 4 ${aligned}
    """
}

// MultiQC summary
process multiqc_report {
    input:
    set sample_id, file(qcfiles) from quant_out_ch.collect()
    output:
    file("multiqc_report.html") into final_report

    script:
    """
    multiqc . -o ${params.outdir}/multiqc
    """
}

🧬 Pipeline Components

Here’s an overview of the major steps in the updated pipeline:

  1. Quality Control (FastQC + MultiQC)
    Initial QC of raw FASTQ files to assess base quality, adapter content, and GC bias.

  2. Trimming (TrimGalore or fastp)
    Adapter removal and quality trimming for cleaner downstream alignment or quantification.

  3. Alignment or Pseudoalignment
    • Option A: STAR for genome alignment
    • Option B: Salmon for transcript-level quantification (faster, alignment-free)
  4. QC Metrics Aggregation
    • Mapping rate summaries, read distribution, duplication rates, etc., compiled using MultiQC.
  5. Differential Expression (optional module)