Given a set of reporter mutation, this functions counts the reads matching the reporter mutations in the sample to be tested, estimates the mismatch rate for the sample to be tested, and then runs a Monte Carlo simulation test to determine whether the tested sample is positive or negative.

test_ctDNA(mutations, bam, targets, reference, tag = "",
  ID_column = NULL, black_list = NULL, substitution_specific = TRUE,
  vaf_threshold = 0.1, min_base_quality = 30, max_depth = 1e+05,
  min_mapq = 40, bam_list = NULL, bam_list_tags = rep("",
  length(bam_list)), min_alt_reads = 1,
  min_samples = ceiling(length(bam_list)/10), n_simulations = 10000,
  pvalue_threshold = 0.05, seed = 123,
  informative_reads_threshold = 10000)

Arguments

mutations

A data frame with the reporter mutations. Should have the columns CHROM, POS, REF, ALT.

bam

path to bam file

targets

a data frame with the target regions. Must have three columns: chr, start and end

reference

the reference genome in BSgenome format

tag

the RG tag if the bam has more than one sample

ID_column

The name of the column that contains the ID of mutations in phase. All mutations in Phase should have the same ID in that column.

black_list

a character vector of genomic loci to filter. The format is chr_pos if substitution_specific is false or chr_pos_ref_alt if substitution_specific is true. If given, will override bam_list.

substitution_specific

logical, whether to have the loci of black_list by substitutions.

vaf_threshold

When calculating the background rate, the bases with higher than this VAF threshold will be ignored (real mutations/SNPs).

min_base_quality

minimum base quality for a read to be counted

max_depth

maximum depth above which sampling will happen

min_mapq

the minimum mapping quality for a read to be counted

bam_list

A vector containing the paths to bam files used to filter mutations. Mutations that have more than min_alt_reads in more than min_samples will be filtered. Using black_list is more recommended.

bam_list_tags

the RG tags for the bams included in bams list.

min_alt_reads

When bam_list is provided, this sets the minimum number of alternative allele reads for a sample to be counted.

min_samples

When number of samples having more than min_alt_reads exceeds this number, the mutation will be filtered.

n_simulations

the number of Monte Carlo simulations.

pvalue_threshold

the p-value threshold used to decide positivity or negativity.

seed

the random seed to make the Monte Carlo simulations reproducible.

informative_reads_threshold

the number of informative reads (unique reads mapping to specified mutations) under which the test will be undetermined.

Value

a data frame with the following columns:

  • sample: The sample name taken from SM field in the bam file or file base name

  • n_mutations: The number of mutations used in the test.

  • n_nonzero_alt: Number of mutations that have at least one read supporting alternative allele.

  • total_alt_reads: Total number of reads supporting alternative alleles of all mutations in input.

  • mutations_filtered: The number of filtered mutations.

  • background_rate: The background rate of the tested sample (after all adjustments)

  • informative_reads: The number of unique reads covering the mutations used.

  • multi_support_reads: The number of reads that support more than one mutations in phase. Non-zero values is a sign of positivity not used in the p-value calculation.

  • pvalue: The empirical p-value from the Monte Carlo test.

  • decision: The decision can be positive, negative or undetermined.

Details

This is the main function to test minimal residual disease by ctDNA positivity in a follow-up sample (e.g. after treatment). The inputs include a bam file for the follow-up sample to be tested, a list of reporter mutations (detected for example before treatment in a ctDNA positive sample), and an optional black_list (recommended to use) computed from a list of bam files of healthy-like samples or bam_list of the bam_files to use instead of black_list.

The workflow includes the following steps:

1.

Filtering mutations (optional but recommended): The mutations in the input will be filtered removing the ones reported in the black list. If bam_list is provided, the mutations will be filtered according to the min_samples and min_alt_reads parameters. See filter_mutations.

2.

The background rate will be computed for the input bam. The black list will be plugged in to exclude the black-listed loci when computing the background rate. The black list can be substitution_specific or not, but in both cases, the background rate will be adjusted accordingly. See get_background_rate.

3.

Counting reference and alternative alleles of the reporter mutations in the bam file.

4.

Merging mutations in phase (optional but recommended): If the ID_column is specified in the mutations input, mutations with the same ID (in phase) will be merged. While doing so, it is expected that real traces of mutations will be exhibited jointly in the reads spanning phased mutations, whereas artifacts will not show in all the covered mutations in phase. Therefore, the mismatches that map only to a subset of the mutations in phase but not the others (which are covered and show reference allele) will be removed. This will lead to reduction of the observed mismatches, and therefore, the computed background rate will be adjusted accordingly: new rate = old rate * (1 - purification_probability). See merge_mutations_in_phase.

5.

Monte Carlo test: this approach was used by Newman et al., Nature Biotechnology 2016.

  • Given \(N\) reporter mutations each with depth Di, randomly sample variant allele reads Xi under the background rate \(p\) using binomial distribution Xi ~ Binom(Di, \(p\)).

  • Repeat n_simuations times.

  • Count the number of simulations, where simulated data equal or exceed observed data in jointly two measurements: (1) the average VAF for the \(N\) mutations, and (2) the number of mutations with non-zero VAF.

  • Compute an empirical p-value as (#successes + 1)/(#simulations + 1)

6.

Make a decision: If number of informative reads is lower than informative_reads_threshold parameter, the decision will be undetermined. Otherwise, the pvalue_threshold parameters will be used to determine positivity or negativity.

See also

get_background_rate merge_mutations_in_phase create_black_list create_background_panel filter_mutations

Examples

## Load example data data("mutations", package = "ctDNAtools") data("targets", package = "ctDNAtools") bamT1 <- system.file("extdata", "T1.bam", package = "ctDNAtools") bamT2 <- system.file("extdata", "T2.bam", package = "ctDNAtools") bamN1 <- system.file("extdata", "N1.bam", package = "ctDNAtools") bamN2 <- system.file("extdata", "N2.bam", package = "ctDNAtools") bamN3 <- system.file("extdata", "N3.bam", package = "ctDNAtools") ## Use human reference genome from BSgenome.Hsapiens.UCSC.hg19 library suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19)) # \donttest{ ## basic usage test_ctDNA( mutations = mutations, bam = bamT1, targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, n_simulation = 100 )
#> Analyzing Sample T1 ...
#> Estimating background rate ...
#> Getting ref and alt Counts ...
#> Running Monte Carlo simulations
#> Pvalue = 0.0099009900990099
#> Sample T1 is undetermined
#> sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered #> 1 T1 10 9 13 0 #> background_rate informative_reads multi_support_reads pvalue decision #> 1 0.0001403503 1327 NA 0.00990099 undetermined
## More options test_ctDNA( mutations = mutations, bam = bamT1, targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, n_simulation = 100, min_base_quality = 20, min_mapq = 30, vaf_threshold = 0.05 )
#> Analyzing Sample T1 ...
#> Estimating background rate ...
#> Getting ref and alt Counts ...
#> Running Monte Carlo simulations
#> Pvalue = 0.0099009900990099
#> Sample T1 is undetermined
#> sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered #> 1 T1 10 9 14 0 #> background_rate informative_reads multi_support_reads pvalue decision #> 1 0.0001508991 1346 NA 0.00990099 undetermined
## Use phasing information test_ctDNA( mutations = mutations, bam = bamT2, targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, n_simulation = 100, ID_column = "PHASING" )
#> Analyzing Sample T2 ...
#> Estimating background rate ...
#> Getting ref and alt Counts ...
#> merging mutations in phase ...
#> Running Monte Carlo simulations
#> Pvalue = 1
#> Sample T2 is undetermined
#> sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered #> 1 T2 10 0 0 0 #> background_rate informative_reads multi_support_reads pvalue decision #> 1 9.547441e-05 2413 0 1 undetermined
## Use a black list based on loci bg_panel <- create_background_panel( bam_list = c(bamN1, bamN2, bamN3), targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, substitution_specific = FALSE ) bl1 <- create_black_list(bg_panel, mean_vaf_quantile = 0.99, min_samples_one_read = 2, min_samples_two_reads = 1 )
#> 7 loci added satisfying Mean VAF condition
#> 0 loci added satisfying one read condition
#> 2 loci added satisfying two reads condition
#> Black list has 7 loci
test_ctDNA( mutations = mutations, bam = bamT1, targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, n_simulation = 100, ID_column = "PHASING", black_list = bl1, substitution_specific = FALSE )
#> Analyzing Sample T1 ...
#> Filtering mutations ...
#> Dropped 0 mutations
#> Estimating background rate ...
#> Getting ref and alt Counts ...
#> merging mutations in phase ...
#> Running Monte Carlo simulations
#> Pvalue = 0.0099009900990099
#> Sample T1 is undetermined
#> sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered #> 1 T1 10 6 9 0 #> background_rate informative_reads multi_support_reads pvalue decision #> 1 0.0001193859 1327 1 0.00990099 undetermined
# } ## Use a substitution-specific black list bg_panel <- create_background_panel( bam_list = c(bamN1, bamN2, bamN3), targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, substitution_specific = TRUE ) bl2 <- create_black_list(bg_panel, mean_vaf_quantile = 0.99, min_samples_one_read = 2, min_samples_two_reads = 1 )
#> 19 loci added satisfying Mean VAF condition
#> 0 loci added satisfying one read condition
#> 2 loci added satisfying two reads condition
#> Black list has 19 loci
test_ctDNA( mutations = mutations, bam = bamT1, targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19, n_simulation = 100, ID_column = "PHASING", black_list = bl2, substitution_specific = TRUE )
#> Analyzing Sample T1 ...
#> Filtering mutations ...
#> Dropped 0 mutations
#> Estimating background rate ...
#> Getting ref and alt Counts ...
#> merging mutations in phase ...
#> Running Monte Carlo simulations
#> Pvalue = 0.0099009900990099
#> Sample T1 is undetermined
#> sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered #> 1 T1 10 6 9 0 #> background_rate informative_reads multi_support_reads pvalue decision #> 1 0.0001111016 1327 1 0.00990099 undetermined