R/create_background_panel.R
create_background_panel.Rd
This function scans the targets regions in the list of bam files, and reports the number of reference, non-reference reads for each loci in addition to the non-reference (VAF) allele frequency. Loci with VAF higher than vaf_threshold are masked with NA.
create_background_panel(bam_list, targets, reference, vaf_threshold = 0.05, bam_list_tags = rep("", length(bam_list)), min_base_quality = 10, max_depth = 1e+05, min_mapq = 20, substitution_specific = TRUE)
bam_list | A character vector of paths to bam files. |
---|---|
targets | The targets data frame must have the columns chr, start and end. |
reference | The reference genome as BSgenome object. |
vaf_threshold | Loci with the fraction of non-reference reads above this value are masked with NA. |
bam_list_tags | RG tags for the list of bam files. By default, the whole bam file will be used. |
min_base_quality | The minimum base quality to count a read for a loci. |
max_depth | Maximum depth for the pileup |
min_mapq | The minimum mapping quality to count a read for a loci |
substitution_specific | logical, whether to have the loci by substitutions. |
A named list having depth, alt and vaf data frames. Each has the same order of loci in rows and the input samples in columns.
Extracts the depth, variant allele counts and variant allele frequency (VAF) for each genomic position in the input targets
across a panel of bam files (e.g. from healthy samples to represent only technical noise). The extracted information can be fed to
create_black_list
in order to extract a black listed loci according to defined criteria
The function support two modes, either loci-specific regardless of the basepair substitution, or substitution-specific where each substitution class (e.g. C>T, C>G) are quantified separately. This behavior is controlled by the substitution_specific parameter.
VAF above vaf_threshold parameters are masked with NA, to exclude real SNPs/mutations.
Since this function can take a long time when the bam_list comprises a large number of bam files, the function supports multi-threading using the furrr and future R packages. All you need to do is call 'plan(multiprocess)' or other multi-threading strategies before calling this function.
create_black_list
test_ctDNA
# \donttest{ ## Load example data data("targets", package = "ctDNAtools") bamN1 <- system.file("extdata", "N1.bam", package = "ctDNAtools") bamN2 <- system.file("extdata", "N2.bam", package = "ctDNAtools") bamN3 <- system.file("extdata", "N3.bam", package = "ctDNAtools") ## Use human reference genome from BSgenome.Hsapiens.UCSC.hg19 library suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19)) ## Use a black list based on loci bg_panel <- create_background_panel( bam_list = c(bamN1, bamN2, bamN3), targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, substitution_specific = FALSE )#> #>#>#> #>#>#> #>#>#> #>bl1 <- create_black_list(bg_panel, mean_vaf_quantile = 0.99, min_samples_one_read = 2, min_samples_two_reads = 1 )#>#>#>#>## Use a substitution-specific black list bg_panel <- create_background_panel( bam_list = c(bamN1, bamN2, bamN3), targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, substitution_specific = TRUE ) bl2 <- create_black_list(bg_panel, mean_vaf_quantile = 0.99, min_samples_one_read = 2, min_samples_two_reads = 1 )#>#>#>#>## Multi-threading (commented) ## library(furrr) ## plan(multiprocess) ## plan(multiprocess, workers = 3) bg_panel <- create_background_panel( bam_list = c(bamN1, bamN2, bamN3), targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, substitution_specific = TRUE ) # }