Workflow¶
VIP consists of four workflows depending on the type of input data: fastq, bam/cram, gvcf or vcf.
The fastq workflow is an extension of the cram workflow. The cram and gvcf workflows are extensions of the vcf workflow.
The vcf workflow produces the pipeline outputs as described here.
The following sections provide an overview of the steps of each of these workflows.
FASTQ¶
The fastq workflow consists of the following steps:
- Parallelize sample sheet per sample and for each sample
- Quality reporting and preprocessing using fastp
- Alignment using minimap2 producing a
cramfile per sample - In case of multiple fastq files per sample, concatenate the cram output files
- Continue with step 3. of the
cramworkflow
For details, see here.
CRAM¶
The cram workflow consists of the following steps:
- Parallelize sample sheet per sample and for each sample
- Create validated, indexed
.bamfile frombam/cram/saminput - Generate coverage metrics using MosDepth, using the the
regionsfile provided in the sample sheet if present. If no regions file is provided a default.bedfile will be used containing all exons in case of WES data, and all genes in case of WGS data. - Discover short tandem repeats and publish as intermediate result.
- Using ExpansionHunter for Illumina short read data.
- Using this fork of Straglr for PacBio and Nanopore long read data, this is a fork of this fork(https://github.com/philres/straglr) and is chosen over the original Straglr because of the VCF output that enables VIP to combine it with the SV and SNV data in the VCF workflow.
- Discover copy number variants for for PacBio and Nanopore long read data using Spectre data and publish as intermediate result.
- Parallelize cram in chunks consisting of one or more contigs and for each chunk
- Perform short variant calling with DeepVariant producing a
gvcffile per chunk per sample, the gvcfs of the samples in a project are than merged to one vcf per project (using GLnexus and phased using Whatshap1. - Perform structural variant calling with Manta or cuteSV producing a
vcffile per chunk per project.
- Perform short variant calling with DeepVariant producing a
- Concatenate short variant calling and structural variant calling
vcffiles per chunk per sample - Continue with step 3. of the
vcfworkflow
For details, see here.
1): In case of projects containing pedigree information the phasing requires the bam/cram/sam file to have readgroup information in order to be able to determine which reads belong to which family member. Phasing is skipped if this information is not present.
gVCF¶
The gvcf workflow consists of the following steps:
- For each project in the sample sheet
- Create validated, indexed
.g.vcf.gzfile frombcf/bcf.gz/bcf.bgz/gvcf/gvcf.gz/gvcf.bgz/vcf/vcf.gz/vcf.bgzinputs - Merge
.g.vcf.gzfiles using GLnexus resulting in onevcf.gzper project - Continue with step 3. of the
vcfworkflow
For details, see here.
VCF¶
The vcf workflow consists of the following steps:
- For each project in the sample sheet
- Create validated, indexed
.vcf.gzfile frombcf|bcf.gz|bcf.bgz|vcf|vcf.gz|vcf.bgzinput - Chunk
vcf.gzfiles and for each chunk- Normalize
- Annotate
- Classify
- Filter
- Perform inheritance matching
- Classify in the context of samples
- Filter in the context of samples
- Concatenate chunks resulting in one
vcf.gzfile per project - If
cramdata is available slice thecramfiles to only keep relevant reads - Create report
For details, see here.