Updated RNAseq Pipeline Using NextFlow

🚀 Overview

As RNA-Seq continues to be a core technology for transcriptomic analysis, having a reproducible, scalable, and modular pipeline is essential. In this post, I’ll walk through recent updates to our RNA-Seq pipeline built with Nextflow, including improvements in preprocessing, quantification, and downstream QC.

🔧 Why Nextflow?

Nextflow enables scalable and reproducible scientific workflows using software containers. It integrates smoothly with cloud and HPC environments and is particularly well-suited for complex bioinformatics pipelines.

Note:

Nextflow is a domain-specific language (DSL) for data-driven workflows (like RNA-seq), built on top of the Java Virtual Machine.

It uses Groovy, a scripting language that runs on the JVM (and blends with Java syntax).

So when you run Nextflow, it uses Java under the hood.

Key advantages:

Easy parallelization
Support for Docker/Singularity
Seamless integration with GitHub and CI/CD tools
Built-in traceability and logging

Example running bash code

Directory Structure

rna-seq-pipeline/
├── main.nf
├── nextflow.config
├── modules/
│   ├── qc/
│   ├── align/
│   ├── quant/
├── results/
│   └── multiqc/
└── resources/
    ├── genome.fa
    └── annotation.gtf

nextflow run main.nf \
  --reads "data/*_R{1,2}.fastq.gz" \
  --genome GRCh38 \
  --aligner salmon \
  -profile docker

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

// Define parameters
params.reads = "data/*_R{1,2}.fastq.gz"
params.genome = "resources/genome.fa"
params.gtf = "resources/annotation.gtf"
params.aligner = "salmon" // or 'star'
params.outdir = "results"

workflow {

    reads_ch = Channel.fromFilePairs(params.reads, flat: true)

    qc_out = qc(reads_ch)
    trimmed_reads = trim(qc_out)

    if (params.aligner == 'salmon') {
        quant_out = quant_salmon(trimmed_reads)
    } else if (params.aligner == 'star') {
        aligned = align_star(trimmed_reads)
        quant_out = count_featureCounts(aligned)
    }

    multiqc_report(quant_out)
}

// QC Step
process qc {
    tag "$sample_id"
    input:
    set sample_id, file(reads) from reads_ch
    output:
    set sample_id, file("*_fastqc.zip") into qc_ch

    script:
    """
    fastqc -o ./ ${reads.join(' ')}
    """
}

// Trimming Step
process trim {
    tag "$sample_id"
    input:
    set sample_id, file(reads) from qc_ch
    output:
    set sample_id, file("*.fq.gz") into trimmed_ch

    script:
    """
    fastp -i ${reads[0]} -I ${reads[1]} -o ${sample_id}_R1_trimmed.fq.gz -O ${sample_id}_R2_trimmed.fq.gz
    """
}

// Salmon Quantification
process quant_salmon {
    tag "$sample_id"
    input:
    set sample_id, file(reads) from trimmed_ch
    output:
    set sample_id, file("quant.sf") into quant_out_ch

    script:
    """
    salmon quant -i salmon_index -l A \
        -1 ${reads[0]} -2 ${reads[1]} \
        -p 8 -o ${sample_id}_quant
    cp ${sample_id}_quant/quant.sf ./
    """
}

// STAR Alignment
process align_star {
    tag "$sample_id"
    input:
    set sample_id, file(reads) from trimmed_ch
    output:
    set sample_id, file("Aligned.out.sam") into aligned_ch

    script:
    """
    STAR --genomeDir star_index \
         --readFilesIn ${reads[0]} ${reads[1]} \
         --runThreadN 8 \
         --outFileNamePrefix ${sample_id}_ \
         --outSAMtype SAM
    cp ${sample_id}_Aligned.out.sam Aligned.out.sam
    """
}

// featureCounts (optional downstream after STAR)
process count_featureCounts {
    input:
    set sample_id, file(aligned) from aligned_ch
    output:
    set sample_id, file("counts.txt") into quant_out_ch

    script:
    """
    featureCounts -a ${params.gtf} -o counts.txt -T 4 ${aligned}
    """
}

// MultiQC summary
process multiqc_report {
    input:
    set sample_id, file(qcfiles) from quant_out_ch.collect()
    output:
    file("multiqc_report.html") into final_report

    script:
    """
    multiqc . -o ${params.outdir}/multiqc
    """
}

🧬 Pipeline Components

Here’s an overview of the major steps in the updated pipeline:

Quality Control (FastQC + MultiQC)
Initial QC of raw FASTQ files to assess base quality, adapter content, and GC bias.
Trimming (TrimGalore or fastp)
Adapter removal and quality trimming for cleaner downstream alignment or quantification.
Alignment or Pseudoalignment
- Option A: STAR for genome alignment
- Option B: Salmon for transcript-level quantification (faster, alignment-free)
QC Metrics Aggregation
- Mapping rate summaries, read distribution, duplication rates, etc., compiled using MultiQC.
Differential Expression (optional module)
- The salmon counts could be used as input for my RNASeq pipline.

Updated RNAseq Pipeline Using NextFlow

A modular, scalable, and reproducible workflow for RNA-seq data analysis with built-in QC, alignment, and quantification steps

🚀 Overview

🔧 Why Nextflow?

Example running bash code

🧬 Pipeline Components

CATALOG

FEATURED TAGS

FRIENDS