π Overview
As RNA-Seq continues to be a core technology for transcriptomic analysis, having a reproducible, scalable, and modular pipeline is essential. In this post, Iβll walk through recent updates to our RNA-Seq pipeline built with Nextflow, including improvements in preprocessing, quantification, and downstream QC.
π§ Why Nextflow?
Nextflow enables scalable and reproducible scientific workflows using software containers. It integrates smoothly with cloud and HPC environments and is particularly well-suited for complex bioinformatics pipelines.
Nextflow is a domain-specific language (DSL) for data-driven workflows (like RNA-seq), built on top of the Java Virtual Machine.
It uses Groovy, a scripting language that runs on the JVM (and blends with Java syntax).
So when you run Nextflow, it uses Java under the hood.
Key advantages:
- Easy parallelization
- Support for Docker/Singularity
- Seamless integration with GitHub and CI/CD tools
- Built-in traceability and logging
Example running bash code
Directory Structure
1
2
3
4
5
6
7
8
9
10
11
12
rna-seq-pipeline/
βββ main.nf
βββ nextflow.config
βββ modules/
β βββ qc/
β βββ align/
β βββ quant/
βββ results/
β βββ multiqc/
βββ resources/
βββ genome.fa
βββ annotation.gtf
1
2
3
4
5
nextflow run main.nf \
--reads "data/*_R{1,2}.fastq.gz" \
--genome GRCh38 \
--aligner salmon \
-profile docker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
// Define parameters
params.reads = "data/*_R{1,2}.fastq.gz"
params.genome = "resources/genome.fa"
params.gtf = "resources/annotation.gtf"
params.aligner = "salmon" // or 'star'
params.outdir = "results"
workflow {
reads_ch = Channel.fromFilePairs(params.reads, flat: true)
qc_out = qc(reads_ch)
trimmed_reads = trim(qc_out)
if (params.aligner == 'salmon') {
quant_out = quant_salmon(trimmed_reads)
} else if (params.aligner == 'star') {
aligned = align_star(trimmed_reads)
quant_out = count_featureCounts(aligned)
}
multiqc_report(quant_out)
}
// QC Step
process qc {
tag "$sample_id"
input:
set sample_id, file(reads) from reads_ch
output:
set sample_id, file("*_fastqc.zip") into qc_ch
script:
"""
fastqc -o ./ ${reads.join(' ')}
"""
}
// Trimming Step
process trim {
tag "$sample_id"
input:
set sample_id, file(reads) from qc_ch
output:
set sample_id, file("*.fq.gz") into trimmed_ch
script:
"""
fastp -i ${reads[0]} -I ${reads[1]} -o ${sample_id}_R1_trimmed.fq.gz -O ${sample_id}_R2_trimmed.fq.gz
"""
}
// Salmon Quantification
process quant_salmon {
tag "$sample_id"
input:
set sample_id, file(reads) from trimmed_ch
output:
set sample_id, file("quant.sf") into quant_out_ch
script:
"""
salmon quant -i salmon_index -l A \
-1 ${reads[0]} -2 ${reads[1]} \
-p 8 -o ${sample_id}_quant
cp ${sample_id}_quant/quant.sf ./
"""
}
// STAR Alignment
process align_star {
tag "$sample_id"
input:
set sample_id, file(reads) from trimmed_ch
output:
set sample_id, file("Aligned.out.sam") into aligned_ch
script:
"""
STAR --genomeDir star_index \
--readFilesIn ${reads[0]} ${reads[1]} \
--runThreadN 8 \
--outFileNamePrefix ${sample_id}_ \
--outSAMtype SAM
cp ${sample_id}_Aligned.out.sam Aligned.out.sam
"""
}
// featureCounts (optional downstream after STAR)
process count_featureCounts {
input:
set sample_id, file(aligned) from aligned_ch
output:
set sample_id, file("counts.txt") into quant_out_ch
script:
"""
featureCounts -a ${params.gtf} -o counts.txt -T 4 ${aligned}
"""
}
// MultiQC summary
process multiqc_report {
input:
set sample_id, file(qcfiles) from quant_out_ch.collect()
output:
file("multiqc_report.html") into final_report
script:
"""
multiqc . -o ${params.outdir}/multiqc
"""
}
𧬠Pipeline Components
Hereβs an overview of the major steps in the updated pipeline:
-
Quality Control (FastQC + MultiQC)
Initial QC of raw FASTQ files to assess base quality, adapter content, and GC bias. -
Trimming (TrimGalore or fastp)
Adapter removal and quality trimming for cleaner downstream alignment or quantification. - Alignment or Pseudoalignment
- Option A: STAR for genome alignment
- Option B: Salmon for transcript-level quantification (faster, alignment-free)
- QC Metrics Aggregation
- Mapping rate summaries, read distribution, duplication rates, etc., compiled using MultiQC.
- Differential Expression (optional module)
- The salmon counts could be used as input for my RNASeq pipline.