R/merge_mutations_in_phase.R
merge_mutations_in_phase.Rd
Given a mutations data frame and a bam file, this function collapses mutations in phase identified by the ID_column into one event. While doing that, it ignores the reads that support both the reference and alternative alleles for different mutations in phase.
merge_mutations_in_phase(mutations, bam, tag = "", ID_column = "phasingID", min_base_quality = 20, min_mapq = 30)
mutations | A data frame with the reporter mutations. Should have the columns CHROM, POS, REF, ALT. |
---|---|
bam | path to bam file |
tag | the RG tag if the bam has more than one sample |
ID_column | The name of the column in mutations data.frame that has the IDs for mutations in phase. NA values will be filled automatically by unique mutation identifiers. |
min_base_quality | minimum base quality for a read to be counted |
min_mapq | integer specifying the minimum mapping quality for reads to be included. |
A list with the following slots:
A data frame that has the columns:
Phasing_id: the ID of the mutations/event.
ref: number of reference reads.
alt: number of alternative reads.
n_reads_multi_mutation: Number of reads that span more than one mutation in phase.
all_reads: total number of reads.
multi_support: number of reads that support the alt allele of multiple mutations in phase.
Probability of purification: sum(n_reads_multi_mutation)/sum(all_reads)
Number of multi-support reads in all mutations/events
Number of unique reads covering the mutations/events
Mutations in phase are those that are supported by the same reads (same allele). The function doesn't identify mutations in phase, but rather use an ID column in the input whose name is specified by ID_column to tell which mutations are in phase.
Since two or more mutations can be supported by the same evidence, this function merges these mutations into one event. The function will also remove the mismatches that are not exhibited in all the covered phased mutations (since this function is developed for the intent of minimal residual disease testing).
The output will include the merged mutations, the probability of purification, which is defined as the number of reads covering at least two mutations in phase divided by the number of informative reads. Informative reads count is the total number of unique reads mapping to the mutations input (including both mutations in phase and other mutations).
test_ctDNA
get_mutations_read_names
data("mutations", package = "ctDNAtools") bamT1 <- system.file("extdata", "T1.bam", package = "ctDNAtools") merge_mutations_in_phase(mutations = mutations[5:10, ], bam = bamT1, ID_column = "PHASING")#> $out #> Phasing_id ref alt n_reads_multi_mutation all_reads multi_support #> 1 chr14:106327838_T_A 359 0 0 359 0 #> 2 106327869_C_A 469 1 407 470 1 #> 3 chr14:106327966_C_T 319 2 0 321 0 #> #> $purification_prob #> [1] 0.353913 #> #> $multi_support #> [1] 1 #> #> $informative_reads #> [1] 610 #>