A function to extract fragment lengths from a bam file. Optionally, given a mutation data frame, it can categorize read lengths in mutated vs non-mutated reads.

get_fragment_size(bam, mutations = NULL, targets = NULL, tag = "",
  isProperPair = NA, mapqFilter = 30, min_size = 1, max_size = 400,
  ignore_trimmed = TRUE, different_strands = TRUE,
  simple_cigar = FALSE)

Arguments

bam

path to bam file.

mutations

An optional data frame with mutations. Must have the columns CHROM, POS, REF, ALT.

targets

a data frame with the target regions to restrict the reads in the bam. Must have three columns: chr, start and end

tag

the RG tag if the bam has more than one sample.

isProperPair

a logical whether to return only proper pairs (true), only improper pairs (false), or it does not matter (NA).

mapqFilter

mapping quality threshold for considering reads.

min_size

Integer with the lowest fragment length.

max_size

Integer with the highest fragment length.

ignore_trimmed

logical, whether to remove reads that have been hard trimmed.

different_strands

logical, whether to keep only reads whose mates map to different strand.

simple_cigar

logical, whether to include only reads with simple cigar.

Value

A data frame with the columns:

  • Sample: The SM tag in bam or file name

  • ID: the read ID

  • chr: chromosome

  • start: the left most end of either the read or mate

  • end: the right most end of either the read or mate.

  • size: the fragment size

  • category (only if mutations is provided): either ref, alt, or other

Details

Extracts the fragment size of reads in the input bam that satisfy the following conditions:

  • Paired, and optionally properly paired depending on the isProperPair parameter.

  • Both the reads and mate are mapped.

  • Not secondary or supplementary alignment

  • Not duplicate

  • Passing quality control

  • Read and mate on different strands (optional depending on the different_strands parameter)

  • Not trimmed (optional depending on the ignore_trimmed parameter), i.e. will keep only reads with the max length.

  • Having a simple cigar (optional depending on the simple_cigar parameter).

  • Satisfy the mapping quality threshold specified in the mapqFilter parameter.

  • Reads and mates on the same chromosome when min_size > 0.

When the input mutations is given, the output will label the reads that support the variant alleles of the mutation in a separate column.

See also

summarize_fragment_size bin_fragment_size analyze_fragmentation get_mutations_fragment_size

Examples

# \donttest{ data("mutations", package = "ctDNAtools") data("targets", package = "ctDNAtools") bamT1 <- system.file("extdata", "T1.bam", package = "ctDNAtools") ## basic usage fs <- get_fragment_size(bam = bamT1) ## More options fs1 <- get_fragment_size( bam = bamT1, isProperPair = TRUE, min_size = 70, max_size = 200, ignore_trimmed = FALSE, different_strands = FALSE, simple_cigar = TRUE ) ## with mutations input fs2 <- get_fragment_size(bam = bamT1, mutations = mutations) ## using targets fs3 <- get_fragment_size(bam = bamT1, targets = targets) # }