Workflow¶

VIP consists of four workflows depending on the type of input data: fastq, bam/cram, gvcf or vcf. The fastq workflow is an extension of the cram workflow. The cram and gvcf workflows are extensions of the vcf workflow. The vcf workflow produces the pipeline outputs as described here. The following sections provide an overview of the steps of each of these workflows.

FASTQ¶

The fastq workflow consists of the following steps:

Parallelize sample sheet per sample and for each sample
Quality reporting and preprocessing using fastp
Alignment using minimap2 producing a cram file per sample
In case of multiple fastq files per sample, concatenate the cram output files
Continue with step 3. of the cram workflow

For details, see here.

CRAM¶

The cram workflow consists of the following steps:

Parallelize sample sheet per sample and for each sample
Create validated, indexed .bam file from bam/cram/sam input
Generate coverage metrics using MosDepth, using the the regions file provided in the sample sheet if present. If no regions file is provided a default .bed file will be used containing all exons in case of WES data, and all genes in case of WGS data.
Discover short tandem repeats and publish as intermediate result.
1. Using ExpansionHunter for Illumina short read data.
2. Using this fork of Straglr for PacBio and Nanopore long read data, this is a fork of this fork(https://github.com/philres/straglr) and is chosen over the original Straglr because of the VCF output that enables VIP to combine it with the SV and SNV data in the VCF workflow.
Discover copy number variants for for PacBio and Nanopore long read data using Spectre data and publish as intermediate result.
Parallelize cram in chunks consisting of one or more contigs and for each chunk
1. Perform short variant calling with DeepVariant producing a gvcf file per chunk per sample, the gvcfs of the samples in a project are than merged to one vcf per project (using GLnexus and phased using Whatshap¹.
2. Perform structural variant calling with Manta or cuteSV producing a vcf file per chunk per project.
Concatenate short variant calling and structural variant calling vcf files per chunk per sample
Continue with step 3. of the vcf workflow

For details, see here.

¹): In case of projects containing pedigree information the phasing requires the bam/cram/sam file to have readgroup information in order to be able to determine which reads belong to which family member. Phasing is skipped if this information is not present.

gVCF¶

The gvcf workflow consists of the following steps:

For each project in the sample sheet
Create validated, indexed .g.vcf.gz file from bcf/bcf.gz/bcf.bgz/gvcf/gvcf.gz/gvcf.bgz/vcf/vcf.gz/vcf.bgz inputs
Merge .g.vcf.gz files using GLnexus resulting in one vcf.gz per project
Continue with step 3. of the vcf workflow

For details, see here.

VCF¶

The vcf workflow consists of the following steps:

For each project in the sample sheet
Create validated, indexed .vcf.gz file from bcf|bcf.gz|bcf.bgz|vcf|vcf.gz|vcf.bgz input
Chunk vcf.gz files and for each chunk
1. Normalize
2. Annotate
3. Classify
4. Filter
5. Perform inheritance matching
6. Classify in the context of samples
7. Filter in the context of samples
Concatenate chunks resulting in one vcf.gz file per project
If cram data is available slice the cram files to only keep relevant reads
Create report

For details, see here.