Tests the ctDNA positivity of a sample

Given a set of reporter mutation, this functions counts the reads matching the reporter mutations in the sample to be tested, estimates the mismatch rate for the sample to be tested, and then runs a Monte Carlo simulation test to determine whether the tested sample is positive or negative.

test_ctDNA(mutations, bam, targets, reference, tag = "",
  ID_column = NULL, black_list = NULL, substitution_specific = TRUE,
  vaf_threshold = 0.1, min_base_quality = 30, max_depth = 1e+05,
  min_mapq = 40, bam_list = NULL, bam_list_tags = rep("",
  length(bam_list)), min_alt_reads = 1,
  min_samples = ceiling(length(bam_list)/10), n_simulations = 10000,
  pvalue_threshold = 0.05, seed = 123,
  informative_reads_threshold = 10000)

Arguments

mutations	A data frame with the reporter mutations. Should have the columns CHROM, POS, REF, ALT.
bam	path to bam file
targets	a data frame with the target regions. Must have three columns: chr, start and end
reference	the reference genome in BSgenome format
tag	the RG tag if the bam has more than one sample
ID_column	The name of the column that contains the ID of mutations in phase. All mutations in Phase should have the same ID in that column.
black_list	a character vector of genomic loci to filter. The format is chr_pos if substitution_specific is false or chr_pos_ref_alt if substitution_specific is true. If given, will override `bam_list`.
substitution_specific	logical, whether to have the loci of black_list by substitutions.
vaf_threshold	When calculating the background rate, the bases with higher than this VAF threshold will be ignored (real mutations/SNPs).
min_base_quality	minimum base quality for a read to be counted
max_depth	maximum depth above which sampling will happen
min_mapq	the minimum mapping quality for a read to be counted
bam_list	A vector containing the paths to bam files used to filter mutations. Mutations that have more than min_alt_reads in more than min_samples will be filtered. Using black_list is more recommended.
bam_list_tags	the RG tags for the bams included in bams list.
min_alt_reads	When bam_list is provided, this sets the minimum number of alternative allele reads for a sample to be counted.
min_samples	When number of samples having more than `min_alt_reads` exceeds this number, the mutation will be filtered.
n_simulations	the number of Monte Carlo simulations.
pvalue_threshold	the p-value threshold used to decide positivity or negativity.
seed	the random seed to make the Monte Carlo simulations reproducible.
informative_reads_threshold	the number of informative reads (unique reads mapping to specified mutations) under which the test will be undetermined.

Value

a data frame with the following columns:

sample: The sample name taken from SM field in the bam file or file base name
n_mutations: The number of mutations used in the test.
n_nonzero_alt: Number of mutations that have at least one read supporting alternative allele.
total_alt_reads: Total number of reads supporting alternative alleles of all mutations in input.
mutations_filtered: The number of filtered mutations.
background_rate: The background rate of the tested sample (after all adjustments)
informative_reads: The number of unique reads covering the mutations used.
multi_support_reads: The number of reads that support more than one mutations in phase. Non-zero values is a sign of positivity not used in the p-value calculation.
pvalue: The empirical p-value from the Monte Carlo test.
decision: The decision can be positive, negative or undetermined.

Details

This is the main function to test minimal residual disease by ctDNA positivity in a follow-up sample (e.g. after treatment). The inputs include a bam file for the follow-up sample to be tested, a list of reporter mutations (detected for example before treatment in a ctDNA positive sample), and an optional black_list (recommended to use) computed from a list of bam files of healthy-like samples or bam_list of the bam_files to use instead of black_list.

The workflow includes the following steps:

1.

Filtering mutations (optional but recommended): The mutations in the input will be filtered removing the ones reported in the black list. If bam_list is provided, the mutations will be filtered according to the min_samples and min_alt_reads parameters. See filter_mutations.

2.

The background rate will be computed for the input bam. The black list will be plugged in to exclude the black-listed loci when computing the background rate. The black list can be substitution_specific or not, but in both cases, the background rate will be adjusted accordingly. See get_background_rate.

3.

Counting reference and alternative alleles of the reporter mutations in the bam file.

4.

Merging mutations in phase (optional but recommended): If the ID_column is specified in the mutations input, mutations with the same ID (in phase) will be merged. While doing so, it is expected that real traces of mutations will be exhibited jointly in the reads spanning phased mutations, whereas artifacts will not show in all the covered mutations in phase. Therefore, the mismatches that map only to a subset of the mutations in phase but not the others (which are covered and show reference allele) will be removed. This will lead to reduction of the observed mismatches, and therefore, the computed background rate will be adjusted accordingly: new rate = old rate * (1 - purification_probability). See merge_mutations_in_phase.

5.

Monte Carlo test: this approach was used by Newman et al., Nature Biotechnology 2016.

Given \(N\) reporter mutations each with depth D_i, randomly sample variant allele reads X_i under the background rate \(p\) using binomial distribution X_i ~ Binom(D_i, \(p\)).
Repeat n_simuations times.
Count the number of simulations, where simulated data equal or exceed observed data in jointly two measurements: (1) the average VAF for the \(N\) mutations, and (2) the number of mutations with non-zero VAF.
Compute an empirical p-value as (#successes + 1)/(#simulations + 1)

6.

Make a decision: If number of informative reads is lower than informative_reads_threshold parameter, the decision will be undetermined. Otherwise, the pvalue_threshold parameters will be used to determine positivity or negativity.

Examples

## Load example data
data("mutations", package = "ctDNAtools")
data("targets", package = "ctDNAtools")
bamT1 <- system.file("extdata", "T1.bam", package = "ctDNAtools")
bamT2 <- system.file("extdata", "T2.bam", package = "ctDNAtools")
bamN1 <- system.file("extdata", "N1.bam", package = "ctDNAtools")
bamN2 <- system.file("extdata", "N2.bam", package = "ctDNAtools")
bamN3 <- system.file("extdata", "N3.bam", package = "ctDNAtools")

## Use human reference genome from BSgenome.Hsapiens.UCSC.hg19 library
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19))
# \donttest{
## basic usage
test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100
)
#> Analyzing Sample T1 ...
#> Estimating background rate ...
#> Getting ref and alt Counts ...
#> Running Monte Carlo simulations
#> Pvalue =  0.0099009900990099
#> Sample T1 is undetermined
#>   sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered
#> 1     T1          10             9              13                  0
#>   background_rate informative_reads multi_support_reads     pvalue     decision
#> 1    0.0001403503              1327                  NA 0.00990099 undetermined

## More options
test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, min_base_quality = 20, min_mapq = 30,
  vaf_threshold = 0.05
)
#> Analyzing Sample T1 ...
#> Estimating background rate ...
#> Getting ref and alt Counts ...
#> Running Monte Carlo simulations
#> Pvalue =  0.0099009900990099
#> Sample T1 is undetermined
#>   sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered
#> 1     T1          10             9              14                  0
#>   background_rate informative_reads multi_support_reads     pvalue     decision
#> 1    0.0001508991              1346                  NA 0.00990099 undetermined

## Use phasing information
test_ctDNA(
  mutations = mutations, bam = bamT2,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, ID_column = "PHASING"
)
#> Analyzing Sample T2 ...
#> Estimating background rate ...
#> Getting ref and alt Counts ...
#> merging mutations in phase ...
#> Running Monte Carlo simulations
#> Pvalue =  1
#> Sample T2 is undetermined
#>   sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered
#> 1     T2          10             0               0                  0
#>   background_rate informative_reads multi_support_reads pvalue     decision
#> 1    9.547441e-05              2413                   0      1 undetermined

## Use a black list based on loci
bg_panel <- create_background_panel(
  bam_list = c(bamN1, bamN2, bamN3),
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  substitution_specific = FALSE
)

bl1 <- create_black_list(bg_panel,
  mean_vaf_quantile = 0.99,
  min_samples_one_read = 2, min_samples_two_reads = 1
)
#> 7 loci added satisfying Mean VAF condition
#> 0 loci added satisfying one read condition
#> 2 loci added satisfying two reads condition
#> Black list has 7 loci

test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, ID_column = "PHASING", black_list = bl1,
  substitution_specific = FALSE
)
#> Analyzing Sample T1 ...
#> Filtering mutations ...
#> Dropped 0 mutations
#> Estimating background rate ...
#> Getting ref and alt Counts ...
#> merging mutations in phase ...
#> Running Monte Carlo simulations
#> Pvalue =  0.0099009900990099
#> Sample T1 is undetermined
#>   sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered
#> 1     T1          10             6               9                  0
#>   background_rate informative_reads multi_support_reads     pvalue     decision
#> 1    0.0001193859              1327                   1 0.00990099 undetermined
# }
## Use a substitution-specific black list
bg_panel <- create_background_panel(
  bam_list = c(bamN1, bamN2, bamN3),
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  substitution_specific = TRUE
)

bl2 <- create_black_list(bg_panel,
  mean_vaf_quantile = 0.99,
  min_samples_one_read = 2, min_samples_two_reads = 1
)
#> 19 loci added satisfying Mean VAF condition
#> 0 loci added satisfying one read condition
#> 2 loci added satisfying two reads condition
#> Black list has 19 loci

test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, ID_column = "PHASING", black_list = bl2,
  substitution_specific = TRUE
)
#> Analyzing Sample T1 ...
#> Filtering mutations ...
#> Dropped 0 mutations
#> Estimating background rate ...
#> Getting ref and alt Counts ...
#> merging mutations in phase ...
#> Running Monte Carlo simulations
#> Pvalue =  0.0099009900990099
#> Sample T1 is undetermined
#>   sample n_mutations n_nonzero_alt total_alt_reads mutations_filtered
#> 1     T1          10             6               9                  0
#>   background_rate informative_reads multi_support_reads     pvalue     decision
#> 1    0.0001111016              1327                   1 0.00990099 undetermined

Arguments

Value

Details

See also

Examples

Contents