Given a set of reporter mutation, this functions counts the reads matching the reporter mutations in the sample to be tested, estimates the mismatch rate for the sample to be tested, and then runs a Monte Carlo simulation test to determine whether the tested sample is positive or negative.
test_ctDNA(mutations, bam, targets, reference, tag = "", ID_column = NULL, black_list = NULL, substitution_specific = TRUE, vaf_threshold = 0.1, min_base_quality = 30, max_depth = 1e+05, min_mapq = 40, bam_list = NULL, bam_list_tags = rep("", length(bam_list)), min_alt_reads = 1, min_samples = ceiling(length(bam_list)/10), n_simulations = 10000, pvalue_threshold = 0.05, seed = 123, informative_reads_threshold = 10000)
mutations | A data frame with the reporter mutations. Should have the columns CHROM, POS, REF, ALT. |
---|---|
bam | path to bam file |
targets | a data frame with the target regions. Must have three columns: chr, start and end |
reference | the reference genome in BSgenome format |
tag | the RG tag if the bam has more than one sample |
ID_column | The name of the column that contains the ID of mutations in phase. All mutations in Phase should have the same ID in that column. |
black_list | a character vector of genomic loci to filter. The format is chr_pos if substitution_specific is false or
chr_pos_ref_alt if substitution_specific is true. If given, will override |
substitution_specific | logical, whether to have the loci of black_list by substitutions. |
vaf_threshold | When calculating the background rate, the bases with higher than this VAF threshold will be ignored (real mutations/SNPs). |
min_base_quality | minimum base quality for a read to be counted |
max_depth | maximum depth above which sampling will happen |
min_mapq | the minimum mapping quality for a read to be counted |
bam_list | A vector containing the paths to bam files used to filter mutations. Mutations that have more than min_alt_reads in more than min_samples will be filtered. Using black_list is more recommended. |
bam_list_tags | the RG tags for the bams included in bams list. |
min_alt_reads | When bam_list is provided, this sets the minimum number of alternative allele reads for a sample to be counted. |
min_samples | When number of samples having more than |
n_simulations | the number of Monte Carlo simulations. |
pvalue_threshold | the p-value threshold used to decide positivity or negativity. |
seed | the random seed to make the Monte Carlo simulations reproducible. |
informative_reads_threshold | the number of informative reads (unique reads mapping to specified mutations) under which the test will be undetermined. |
a data frame with the following columns:
sample: The sample name taken from SM field in the bam file or file base name
n_mutations: The number of mutations used in the test.
n_nonzero_alt: Number of mutations that have at least one read supporting alternative allele.
total_alt_reads: Total number of reads supporting alternative alleles of all mutations in input.
mutations_filtered: The number of filtered mutations.
background_rate: The background rate of the tested sample (after all adjustments)
informative_reads: The number of unique reads covering the mutations used.
multi_support_reads: The number of reads that support more than one mutations in phase. Non-zero values is a sign of positivity not used in the p-value calculation.
pvalue: The empirical p-value from the Monte Carlo test.
decision: The decision can be positive, negative or undetermined.
This is the main function to test minimal residual disease by ctDNA positivity in a follow-up sample (e.g. after treatment). The inputs include a bam file for the follow-up sample to be tested, a list of reporter mutations (detected for example before treatment in a ctDNA positive sample), and an optional black_list (recommended to use) computed from a list of bam files of healthy-like samples or bam_list of the bam_files to use instead of black_list.
The workflow includes the following steps:
Filtering mutations (optional but recommended): The mutations in the input will be filtered removing the ones reported in the black list. If bam_list is provided,
the mutations will be filtered according to the min_samples and min_alt_reads parameters. See filter_mutations
.
The background rate will be computed for the input bam. The black list will be plugged in to exclude the black-listed loci when computing the background rate.
The black list can be substitution_specific or not, but in both cases, the background rate will be adjusted accordingly. See get_background_rate
.
Counting reference and alternative alleles of the reporter mutations in the bam file.
Merging mutations in phase (optional but recommended): If the ID_column is specified in the mutations input, mutations with the same ID (in phase) will be merged.
While doing so, it is expected that real traces of mutations will be exhibited jointly in the reads spanning phased mutations, whereas artifacts will not show in all the covered
mutations in phase. Therefore, the mismatches that map only to a subset of the mutations in phase but not the others (which are covered and show reference allele) will be removed.
This will lead to reduction of the observed mismatches, and therefore, the computed background rate will be adjusted accordingly:
new rate = old rate * (1 - purification_probability). See merge_mutations_in_phase
.
Monte Carlo test: this approach was used by Newman et al., Nature Biotechnology 2016.
Given \(N\) reporter mutations each with depth Di, randomly sample variant allele reads Xi under the background rate \(p\) using binomial distribution Xi ~ Binom(Di, \(p\)).
Repeat n_simuations
times.
Count the number of simulations, where simulated data equal or exceed observed data in jointly two measurements: (1) the average VAF for the \(N\) mutations, and (2) the number of mutations with non-zero VAF.
Compute an empirical p-value as (#successes + 1)/(#simulations + 1)
Make a decision: If number of informative reads is lower than informative_reads_threshold
parameter,
the decision will be undetermined. Otherwise, the pvalue_threshold
parameters will be used to determine positivity or negativity.
get_background_rate
merge_mutations_in_phase
create_black_list
create_background_panel
filter_mutations
## Load example data data("mutations", package = "ctDNAtools") data("targets", package = "ctDNAtools") bamT1 <- system.file("extdata", "T1.bam", package = "ctDNAtools") bamT2 <- system.file("extdata", "T2.bam", package = "ctDNAtools") bamN1 <- system.file("extdata", "N1.bam", package = "ctDNAtools") bamN2 <- system.file("extdata", "N2.bam", package = "ctDNAtools") bamN3 <- system.file("extdata", "N3.bam", package = "ctDNAtools") ## Use human reference genome from BSgenome.Hsapiens.UCSC.hg19 library suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19)) # \donttest{ ## basic usage test_ctDNA( mutations = mutations, bam = bamT1, targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, n_simulation = 100 )#>#>#>#>#>#>#> sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered #> 1 T1 10 9 13 0 #> background_rate informative_reads multi_support_reads pvalue decision #> 1 0.0001403503 1327 NA 0.00990099 undetermined## More options test_ctDNA( mutations = mutations, bam = bamT1, targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, n_simulation = 100, min_base_quality = 20, min_mapq = 30, vaf_threshold = 0.05 )#>#>#>#>#>#>#> sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered #> 1 T1 10 9 14 0 #> background_rate informative_reads multi_support_reads pvalue decision #> 1 0.0001508991 1346 NA 0.00990099 undetermined## Use phasing information test_ctDNA( mutations = mutations, bam = bamT2, targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, n_simulation = 100, ID_column = "PHASING" )#>#>#>#>#>#>#>#> sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered #> 1 T2 10 0 0 0 #> background_rate informative_reads multi_support_reads pvalue decision #> 1 9.547441e-05 2413 0 1 undetermined## Use a black list based on loci bg_panel <- create_background_panel( bam_list = c(bamN1, bamN2, bamN3), targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, substitution_specific = FALSE ) bl1 <- create_black_list(bg_panel, mean_vaf_quantile = 0.99, min_samples_one_read = 2, min_samples_two_reads = 1 )#>#>#>#>test_ctDNA( mutations = mutations, bam = bamT1, targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, n_simulation = 100, ID_column = "PHASING", black_list = bl1, substitution_specific = FALSE )#>#>#>#>#>#>#>#>#>#> sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered #> 1 T1 10 6 9 0 #> background_rate informative_reads multi_support_reads pvalue decision #> 1 0.0001193859 1327 1 0.00990099 undetermined# } ## Use a substitution-specific black list bg_panel <- create_background_panel( bam_list = c(bamN1, bamN2, bamN3), targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, substitution_specific = TRUE ) bl2 <- create_black_list(bg_panel, mean_vaf_quantile = 0.99, min_samples_one_read = 2, min_samples_two_reads = 1 )#>#>#>#>test_ctDNA( mutations = mutations, bam = bamT1, targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, n_simulation = 100, ID_column = "PHASING", black_list = bl2, substitution_specific = TRUE )#>#>#>#>#>#>#>#>#>#> sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered #> 1 T1 10 6 9 0 #> background_rate informative_reads multi_support_reads pvalue decision #> 1 0.0001111016 1327 1 0.00990099 undetermined