QuaC
🦆🦆 Don't duck that QC thingy 🦆🦆
############### NOTE ###############
This repo has been archived and is now hosted at https://github.com/uab-cgds-worthey/quac. Please use that site for the most recent changes to the repo.
###################################
What is QuaC?
QuaC is a snakemake-based pipeline that runs several QC tools for WGS/WES samples and then summarizes their results using pre-defined, configurable QC thresholds.
In summary, QuaC performs the following:
- Runs several QC tools using
BAM
andVCF
files as input. At our center CGDS, these files are produced as part of the small variant caller pipeline. - Using QuaC-Watch tool, it performs QC checkup based on the expected thresholds for certain QC metrics and summarizes the results for easier human consumption
- Aggregates QC output produced here as well as those by the small variant caller pipeline using mulitqc, both at the sample level and project level.
- Optionally, above mentioned QC checkup and QC aggregation steps can accept pre-run results from few QC tools (fastqc, fastq-screen, picard's markduplicates). At CGDS, these files are produced as part of the small variant caller pipeline.
QC tools utilized
QuaC quacks using the tools listed below:
Tool | Use |
---|---|
BAM QC | |
Qualimap | QCs alignment data in SAM/BAM files |
Picard-CollectMultipleMetrics | Summarizes alignment metrics from a SAM/BAM file using several modules |
Picard-CollectWgsMetrics | Collects metrics about coverage and performance of whole genome sequencing (WGS) experiments. |
mosdepth | Fast BAM/CRAM depth calculation |
indexcov | Estimate coverage from whole-genome bam or cram index (Not run in exome mode) |
covviz | Identifies large, coverage-based anomalies (Not run in exome mode) |
VCF QC | |
bcftools stats | Stats variants in VCF |
Within-species contamination | |
verifybamid | Estimates within-species (i.e. cross-sample) contamination |
Sex, ancestry and relatedness estimation | |
somalier | Estimation of sex, ancestry and relatedness |
Optional tools' results consumption
In addition to the above tools, optionally QuaC can also utilize QC results produced by the tools below when run with
flag --include_prior_qc
. At CGDS, these files are produced as part of the small variant caller
pipeline.
Tool | Use |
---|---|
fastqc | Performs QC checks on raw sequence data (fastq) |
FastQ Screen | Screens fastq for other-species contamination |
Picard's MarkDuplicates | Determines level of read duplication on bam files |
QuaC-Watch
QuaC includes a tool called QuaC-Watch. After running all the QC tools for samples, QuaC-Watch summarizes if samples
have passed the configurable QC thresholds defined using config files (available at
configs/quac_watch/
), both at the sample level as well as project level. This summary makes
it easy to quickly review whether sample or samples are of sufficient quality and highlight samples that need further
review.
Installation
Installation requires
- fetching the source code
- creating the conda environment
Requirements
- Git v2.0+
- CGDS GitLab access
- SSH Key for access to Cheaha cluster
- Anaconda/miniconda
- Tested with Anaconda3/2020.02
- Available as module from cheaha at UAB
Retrieve pipeline source code
Go to the directory of your choice and run the command below.
git clone -b master \
git@gitlab.rc.uab.edu:center-for-computational-genomics-and-data-science/sciops/pipelines/quac.git
Create conda environment
Conda environment will install all necessary dependencies, including snakemake, to run the QuaC workflow.
cd /path/to/quac/repo
# For use only at Cheaha in UAB. Load conda into environment.
module reset
module load Anaconda3/2020.02
# create conda environment. Needed only the first time.
conda env create --file configs/env/quac.yaml
# activate conda environment
conda activate quac
# if you need to update the existing environment
conda env update --file configs/env/quac.yaml
How to run QuaC
In order to run the QuaC pipeline, user needs to
- Install the pipeline and set up the conda environment (see above)
- Set up config files specifying paths required by QC tools used in the pipeline.
- Run QuaC pipeline just to create singularity+conda environments using the testing dataset (optional)
Requirements
Direct
- Snakemake-minimal
- Tested with v6.0.5
- Gets installed as part of conda environment
- Python
- Tested with v3.6.13
- Gets installed as part of conda environment
- slurmpy
- Tested with v0.0.8
- Gets installed as part of conda environment
Indirect
- Anaconda/miniconda
- Tested with Anaconda3/2020.02
- Available as module from cheaha at UAB
- Singularity
- Tested with v3.5.2
- Available as module from cheaha at UAB
Tools below are used in the QuaC pipeline, and snakemake automatically installs them in conda environments inside the singularity container. Therefore, they don't need to be manually installed. For tool versions used, refer to the snakemake rules.
- qualimap
- picard
- mosdepth
- indexcov
- covviz
- bcftools
- verifybamid
- somalier
- multiqc
Set up workflow config file
QuaC requires a workflow config file in yaml format (default is configs/workflow.yaml
),
which provides:
- Filepaths to necessary dataset dependencies required by certain QC tools
- Hardware resource configs
Custom workflow config file can be provided to QuaC via --workflow_config
.
Prepare verifybamid datasets for exome analysis
This step is necessary only if QuaC will be run in exome mode (--exome
).
verifybamid has provided auxiliary resource files, which are necessary for
analysis. However, chromosome contigs do not include chr
prefix in their exome resource files, which are expected for
our analyses at CGDS. Follow these steps to setup resource files with chr
prefix in their contig names.
# cd into exome resources dir
cd <path-to>/VerifyBamID-2.0.1/resource/exome/
sed -e 's/^/chr/' 1000g.phase3.10k.b38.exome.vcf.gz.dat.bed > 1000g.phase3.10k.b38_chr.exome.vcf.gz.dat.bed
sed -e 's/^/chr/' 1000g.phase3.10k.b38.exome.vcf.gz.dat.mu > 1000g.phase3.10k.b38_chr.exome.vcf.gz.dat.mu
cp 1000g.phase3.10k.b38.exome.vcf.gz.dat.UD 1000g.phase3.10k.b38_chr.exome.vcf.gz.dat.UD
cp 1000g.phase3.10k.b38.exome.vcf.gz.dat.V 1000g.phase3.10k.b38_chr.exome.vcf.gz.dat.V
Run pipeline
After activating the conda environment, QuaC pipeline can be run using the wrapper script src/run_quac.py
. Here are
all the options available:
$ python src/run_quac.py -h
usage: run_quac.py [-h] [--project_name] [--projects_path] [--pedigree]
[--quac_watch_config] [--workflow_config] [--outdir]
[--exome] [--include_prior_qc] [--allow_sample_renaming]
[--cluster_config] [--log_dir] [-e] [-n] [-l]
[--rerun_failed] [--slurm_partition]
Wrapper tool for QuaC pipeline.
optional arguments:
-h, --help show this help message and exit
QuaC workflow options:
--project_name Project name (default: None)
--projects_path Path where all projects are hosted. Do not include
project name here. (default:
/data/project/worthey_lab/projects/)
--pedigree Pedigree filepath. Must correspond to the project
supplied via --project_name (default: None)
--quac_watch_config YAML config path specifying QC thresholds for QuaC-
Watch (default:
configs/quac_watch/wgs_quac_watch_config.yaml)
--workflow_config YAML config path specifying filepath to dependencies
of tools used in QuaC (default: configs/workflow.yaml)
--outdir Out directory path (default:
$USER_SCRATCH/tmp/quac/results/test_project/analysis)
--exome Flag to run the workflow in exome mode. WARNING:
Please provide appropriate configs via
--quac_watch_config. (default: False)
--include_prior_qc Flag to additionally use prior QC data as input. See
documentation for more info. (default: False)
--allow_sample_renaming
Flag to allow sample renaming in MultiQC report. See
documentation for more info. (default: False)
QuaC wrapper options:
--cluster_config Cluster config json file. Needed for snakemake to run
jobs in cluster. (default: configs/cluster_config.json)
--log_dir Directory path where logs (both workflow's and
wrapper's) will be stored (default:
$USER_SCRATCH/tmp/quac/logs)
-e , --extra_args Pass additional custom args to snakemake. Equal symbol
is needed for assignment as in this example: -e='--
forceall' (default: None)
-n, --dryrun Flag to dry-run snakemake. Does not execute anything,
and just display what would be done. Equivalent to '--
extra_args "-n"' (default: False)
-l, --run_locally Flag to run the snakemake locally and not as a Slurm
job. Useful for testing purposes. (default: False)
--rerun_failed Number of times snakemake restarts failed jobs. This
may be set to >0 to avoid pipeline failing due to job
fails due to random SLURM issues (default: 1)
--slurm_partition Request a specific partition for the slurm resource
allocation for QuaC workflow. Available partitions in
Cheaha are: express(max 2 hrs), short(max 12 hrs),
medium(max 50 hrs), long(max 150 hrs) (default: short)
Minimal example to run the wrapper script, which in turn will execute the QuaC pipeline:
module reset
module load Anaconda3/2020.02
module load Singularity/3.5.2-GCC-5.4.0-2.26
conda activate quac
python src/run_quac.py \
--project_name PROJECT_DUCK \
--pedigree "path/to/lake/with/pedigree_file.ped"
NOTE:
Besides the basic features, wrapper script src/run_quac.py
offers the following:
- Pass custom snakemake args using option
--extra_args
. - Dry-run snakemake using flag
--dryrun
. Note that this is same as--extra_args='-n'
. - Override cluster config file passed to snakemake using
--cluster_config
. - Run snakemake locally, instead of submitting it to Slurm, using
--run_locally
. Note that jobs from snakemake will still be submitted to Slurm. - Reruns failed jobs once again by default. This may be modified using
--rerun_failed
. - Override slurm partition used via
--slurm_partition
.
Create singularity+conda environments for tools used in QuaC pipeline
All the jobs initiated by QuaC would be run inside a conda environment, which themselves were created inside a singularity container. It may be a good idea to create these environments before they are run with actual samples. While this step is optional, this will ensure that there will not be any conflicts when running multiple instances of the pipeline.
Running the commands below will submit a slurm job to just create these singularity+conda environments. Note that this slurm job will exit right after creating the environments, and it will not run any QC analyses on the input samples provided.
module reset
module load Anaconda3/2020.02
module load Singularity/3.5.2-GCC-5.4.0-2.26
conda activate quac
# WGS mode
python src/run_quac.py \
--project_name test_project \
--projects_path ".test/ngs-data/" \
--pedigree ".test/configs/no_priorQC/project_2samples.ped" \
--outdir "$USER_SCRATCH/tmp/quac/results/test_project_wgs/analysis" \
-e="--conda-create-envs-only"
Input requirements
-
Pedigree file supplied via
--pedigree
. Only the samples that are supplied in pedigree file will be processed by QuaC and all of these samples must belong to the same project.-
For CGDS use only: This repo includes a handy script
src/create_dummy_ped.py
that can create a dummy pedigree file, which will lack sex (unless project tracking sheet is provided), relatedness and affected status info. See header of the script for usage instructions. Note that we plan to use phenotips in future to produce fully capable pedigree file. One could manually create them as well, but this could be error-prone.
-
For CGDS use only: This repo includes a handy script
-
Output produced by the small variant caller pipeline. This includes bam, vcf and QC output. Refer to test sample dataset, which is representative of the input required.
-
QuaC workflow config file. Refer to section here for more info.
-
When run in exome mode, QuaC requires a capture-regions bed file at path
path_to_sample/configs/small_variant_caller/<capture_regions>.bed
for each sample.
Example usage
# to quack on a WGS project, which also has prior QC data
PROJECT="Quack_Quack"
python src/run_quac.py \
--project_name $PROJECT \
--pedigree "data/raw/ped/${PROJECT}.ped" \
--include_prior_qc \
--allow_sample_renaming
# to quack on a WGS project, run in a medium slurm partition and write results to a dir of choice
PROJECT="Quack_This"
python src/run_quac.py \
--slurm_partition medium \
--project_name $PROJECT \
--pedigree "data/raw/ped/${PROJECT}.ped" \
--outdir "$USER_SCRATCH/tmp/quac/results/test_${PROJECT}/analysis"
# to quack on an exome project
PROJECT="Quack_That"
python src/run_quac.py \
--project_name $PROJECT \
--pedigree "data/raw/ped/${PROJECT}.ped" \
--quac_watch_config "configs/quac_watch/exome_quac_watch_config.yaml" \
--exome
# to quack on an exome project by providing path to that project
PROJECT="Quack_That"
python src/run_quac.py \
--project_name $PROJECT \
--projects_path "/path/to/project/${$PROJECT}/" \
--pedigree "data/raw/ped/${PROJECT}.ped" \
--quac_watch_config "configs/quac_watch/exome_quac_watch_config.yaml" \
--exome
Output
QuaC results are stored at the path specified via option --outdir
(default:
$USER_SCRATCH/tmp/quac/results/test_project/analysis
). Refer to the testing's output to
learn more about the output directory structure. Users may primarily be interested in the the aggregated QC results
produced by multiqc, both at sample-level as well as at the project-level. These multiqc
reports also include summary of QuaC-Watch results.
Note that QuaC's output directory structure has been designed based on the output structure of the small variant caller pipeline.
Testing pipeline
The system-level testing implemented for this pipeline tests whether the pipeline runs from start to finish without any
error. This testing uses test datasets present in .test/ngs-data/test_project
, which
reflects a test project containing four samples (2 with input needed when include_priorQC
is used and 2 other samples
without priorQC data). See here for more info on how these test datasets were created.
NOTE: This testing does not verify that pipeline's output are correct. Instead, its purpose is to ensure that pipeline runs from beginning to end without any execution error for the given test dataset.
How to run
module reset
module load Anaconda3/2020.02
module load Singularity/3.5.2-GCC-5.4.0-2.26
conda activate quac
########## No prior QC data involved ##########
PROJECT_CONFIG="project_2samples"
PRIOR_QC_STATUS="no_priorQC"
# WGS mode
python src/run_quac.py \
--project_name test_project \
--projects_path ".test/ngs-data/" \
--pedigree ".test/configs/${PRIOR_QC_STATUS}/${PROJECT_CONFIG}.ped" \
--outdir "$USER_SCRATCH/tmp/quac/results/test_${PROJECT_CONFIG}_wgs-${PRIOR_QC_STATUS}/analysis"
# Exome mode
python src/run_quac.py \
--project_name test_project \
--projects_path ".test/ngs-data/" \
--pedigree ".test/configs/${PRIOR_QC_STATUS}/${PROJECT_CONFIG}.ped" \
--outdir "$USER_SCRATCH/tmp/quac/results/test_${PROJECT_CONFIG}_exome-${PRIOR_QC_STATUS}/analysis" \
--quac_watch_config "configs/quac_watch/exome_quac_watch_config.yaml" \
--exome
########## Includes prior QC data and allows sample renaming ##########
PROJECT_CONFIG="project_2samples"
PRIOR_QC_STATUS="include_priorQC"
# WGS mode
python src/run_quac.py \
--project_name test_project \
--projects_path ".test/ngs-data/" \
--pedigree ".test/configs/${PRIOR_QC_STATUS}/${PROJECT_CONFIG}.ped" \
--outdir "$USER_SCRATCH/tmp/quac/results/test_${PROJECT_CONFIG}_wgs-${PRIOR_QC_STATUS}/analysis" \
--include_prior_qc \
--allow_sample_renaming
# Exome mode
python src/run_quac.py \
--project_name test_project \
--projects_path ".test/ngs-data/" \
--pedigree ".test/configs/${PRIOR_QC_STATUS}/${PROJECT_CONFIG}.ped" \
--outdir "$USER_SCRATCH/tmp/quac/results/test_${PROJECT_CONFIG}_exome-${PRIOR_QC_STATUS}/analysis" \
--quac_watch_config "configs/quac_watch/exome_quac_watch_config.yaml" \
--include_prior_qc \
--allow_sample_renaming \
--exome
Note: Use PROJECT="project_1sample"
to test out a project with only one sample.
Expected output files
$ tree $USER_SCRATCH/tmp/quac/results/test_project_2_samples/ -d -L 4
/data/scratch/manag/tmp/quac/results/test_project_2_samples/
└── analysis
├── A
│ └── qc
│ ├── bcftools-index
│ │ └── ...
│ ├── bcftools-stats
│ │ └── ...
│ ├── mosdepth
│ │ └── ...
│ ├── multiqc_final_pass
│ │ ├── ...
│ │ └── A_multiqc.html
│ ├── multiqc_initial_pass
│ │ ├── ...
│ │ └── A_multiqc.html
│ ├── picard-stats
│ │ └── ...
│ ├── quac_watch
│ │ └── ...
│ ├── qualimap
│ │ └── ...
│ ├── samtools-stats
│ │ └── ...
│ └── verifyBamID
│ └── ...
├── B
│ └── qc
│ └── same directory structure as that of sample A
└── project_level_qc
├── covviz
│ └── ...
├── indexcov
│ └── ...
├── mosdepth
│ └── ...
├── multiqc
│ ├── ...
│ └── multiqc_report.html
└── somalier
├── ancestry
│ └── ...
├── extract
│ └── ...
└── relatedness
└── ...
Note: Certain tools (eg. indexcov and covviz) are not executed when QuaC is run in exome mode (--exome
).
Visualization of workflow
Visualization of the pipeline
based on the test datasets are available in directory dag_pipeline
. Commands used to create this
visualization:
# open interactive node
srun --ntasks=1 --cpus-per-task=1 --mem-per-cpu=4096 --partition=express --pty /bin/bash
# setup environment
module reset
module load Anaconda3/2020.02
module load Singularity/3.5.2-GCC-5.4.0-2.26
conda activate quac
DAG_DIR="pipeline_visualized"
###### WGS mode ######
# DAG
python src/run_quac.py \
--project_name test_project \
--projects_path .test/ngs-data/ \
--pedigree .test/configs/project_2_samples.ped \
--run_locally --extra_args "--dag -F | dot -Tpng > ${DAG_DIR}/wgs_dag.png"
# Rulegraph - less informative than DAG at sample level but less dense than DAG makes this easier to skim
python src/run_quac.py \
--project_name test_project \
--projects_path .test/ngs-data/ \
--pedigree .test/configs/project_2_samples.ped \
--run_locally --extra_args "--rulegraph -F | dot -Tpng > ${DAG_DIR}/wgs_rulegraph.png"
###### Exome mode ######
# DAG
python src/run_quac.py \
--project_name test_project \
--projects_path .test/ngs-data/ \
--pedigree .test/configs/project_2_samples.ped \
--exome \
--quac_watch_config "configs/quac_watch/exome_quac_watch_config.yaml" \
--run_locally --extra_args "--dag -F | dot -Tpng > ${DAG_DIR}/exome_dag.png"
# Rulegraph - less informative than DAG at sample level but less dense than DAG makes this easier to skim
python src/run_quac.py \
--project_name test_project \
--projects_path .test/ngs-data/ \
--pedigree .test/configs/project_2_samples.ped \
--exome \
--quac_watch_config "configs/quac_watch/exome_quac_watch_config.yaml" \
--run_locally --extra_args "--rulegraph -F | dot -Tpng > ${DAG_DIR}/exome_rulegraph.png"
Contributing
If you like to make changes to the source code, please see the contribution guidelines.
Changelog
See here.
Repo owner
- Manavalan Gajapathy