This tool estimates the number of reads that would be filtered given a set of settings and prints this to the terminal. Further, it tracks the number of singleton reads. The following metrics will always be tracked regardless of what you specify (the order output also matches this):

• Total reads (including unmapped)
• Reads in blacklisted regions (–blackListFileName)
The following metrics are estimated according to the –binSize and –distanceBetweenBins parameters
• Estimated mapped reads filtered (the total number of mapped reads filtered for any reason)
• Alignments with a below threshold MAPQ (–minMappingQuality)
• Alignments with at least one missing flag (–samFlagInclude)
• Alignments with undesirable flags (–samFlagExclude)
• Duplicates determined by deepTools (–ignoreDuplicates)
• Duplicates marked externally (e.g., by picard)
• Singletons (paired-end reads with only one mate aligning)
• Wrong strand (due to –filterRNAstrand)

The sum of these may be more than the total number of reads. Note that alignments are sampled from bins of size –binSize spaced –distanceBetweenBins apart.

usage: Example usage: estimateReadFiltering.py -b sample1.bam sample2.bam > log.txt


## Required arguments¶

 --bamfiles, -b List of indexed bam files separated by spaces.

## General arguments¶

 --outFile, -o The file to write results to. By default, results are printed to the console --sampleLabels Labels for the samples. The default is to use the file name of the sample. The sample labels should be separated by spaces and quoted if a label itselfcontains a space E.g. –sampleLabels label-1 “label 2” --smartLabels Instead of manually specifying labels for the input BAM files, this causes deepTools to use the file name after removing the path and extension. --binSize, -bs Length in bases of the window used to sample the genome. (Default: 1000000) --distanceBetweenBins, -n To reduce the computation time, not every possible genomic bin is sampled. This option allows you to set the distance between bins actually sampled from. Larger numbers are sufficient for high coverage samples, while smaller values are useful for lower coverage samples. Note that if you specify a value that results in too few (<1000) reads sampled, the value will be decreased. (Default: 10000) --numberOfProcessors, -p Number of processors to use. Type “max/2” to use half the maximum number of processors or “max” to use all available processors. (Default: 1) --verbose, -v Set to see processing messages. --version show program’s version number and exit

## Background¶

Many tools within deepTools allow one to filter BAM files according to alignment mapping qualities or other criteria. It’s difficult to know ahead of time how these various settings will affect the number of filtered reads. Consequently, estimateReadFiltering can be used to approximate the number of reads in a BAM file or files that will be filtered according to one or more criteria. This can also be used the quickly estimate the duplication level in a BAM file.

## Usage example¶

estimateReadFiltering needs one or more sorted and indexed BAM files and the desired filtering criteria.

\$ estimateReadFiltering -b paired_chr2L.bam \
--minMappingQuality 5 --samFlagInclude 16 \
--samFlagExclude 256 --ignoreDuplicates


By default, the output is printed to the screen. You can change this with the -o option. The output is a tab-separated file:

Sample Total Reads Mapped Reads Alignments in blacklisted regions Estimated mapped reads filtered Below MAPQ Missing Flags Excluded Flags Internally-determined Duplicates Marked Duplicates Singletons Wrong strand paired_chr2L.bam 12644 12589 0 6313.2 4114.0 6340.0 0.0 1163.0 0.0 55.0 0.0

The columns are as follows:

• Total reads (including unmapped)