Skip to content
Snippets Groups Projects
Commit bec75027 authored by Manavalan Gajapathy's avatar Manavalan Gajapathy
Browse files

Preps QuaC for public availability

parent df9cde51
No related branches found
No related tags found
1 merge request!6Preps QuaC for public availability
Showing
with 28 additions and 20 deletions
[submodule "configs/snakemake_slurm_profile"]
path = configs/snakemake_slurm_profile
url = git@gitlab.rc.uab.edu:center-for-computational-genomics-and-data-science/sciops/external-projects/snakemake_slurm_profile.git
[submodule "src/utility_cgds"]
path = src/utility_cgds
url = git@gitlab.rc.uab.edu:center-for-computational-genomics-and-data-science/utility-images.git
# Testing
Output from [Small variant caller
pipeline](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/small_variant_caller_pipeline)
are the inputs to QuaC pipeline. Hence following datasets are necessary for testing:
Input directory structure to QuaC is based on the output directory structure of the [Small variant caller
pipeline](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/small_variant_caller_pipeline).
Following files are necessary for testing:
1. bams
2. vcfs
3. QC output (from tools fastqc, fastq-screen and picard-markduplicates)
4. Sample rename config
3. Capture regions bed file - Required only for exome mode
4. QC output from tools fastqc, fastq-screen and picard-markduplicates - Required only if `priorQC` is used
5. Sample rename config - Required only if `priorQC` is used
Note: Be sure to preserve directory structure used in the output of Small variant caller
**Note**: If `priorQC` is used, be sure to preserve directory structure used in the output of CGDS Small variant caller
pipeline.
## Setup test datasets
* To setup test bam and vcf files, which are from sub-sampled NA12878 data, run:
### Required
* To setup test bam, vcf and capture region bed files, which are from sub-sampled NA12878 data, run:
```sh
cd .test
./setup_test_datasets.sh
```
* QuaC also needs test QC outputs for fastq (and sample rename config), which get created by small var caller pipeline.
This was achieved by running the small variant caller pipeline using its test datasets with some modifications. Steps
are briefly shown here:
### Optional - priorQC mode
* If used in `priorQC` mode, QuaC also needs test QC outputs for fastq (and sample rename config), which at CGDS get
created by the small var caller pipeline. Below, we create fastq QC and sample rename config using the small variant
caller pipeline for samples `A` and `B`.
```sh
cd <small_var_caller_pipeline_dir>
......
#family_id sample_id paternal_id maternal_id sex phenotype
unknown C father_1 mother_1 -9 -9
#family_id sample_id paternal_id maternal_id sex phenotype
unknown C father_1 mother_1 -9 -9
unknown D father_1 mother_1 -9 -9
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/mapped/A-1.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/dedup/A-1.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/A/mapped/A-1.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/A/dedup/A-1.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Apr 02 19:39:58 UTC 2021
......
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/mapped/A-2.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/dedup/A-2.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/A/mapped/A-2.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/A/dedup/A-2.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Apr 02 19:40:06 UTC 2021
......
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/mapped/B-1.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/dedup/B-1.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/B/mapped/B-1.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/B/dedup/B-1.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Apr 02 19:39:58 UTC 2021
......
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/mapped/B-2.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/dedup/B-2.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/B/mapped/B-2.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/B/dedup/B-2.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Apr 02 19:40:06 UTC 2021
......
File added
File added
chr20 59992 3653078
File added
File added
File added
File added
chr20 59992 3653078
File added
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment