correctGCBias

Hint

For background information about the GC bias assessment and correction, see computeGCBias.

This tool corrects the GC-bias using the method proposed by [Benjamini & Speed (2012). Nucleic Acids Research, 40(10)]. It will remove reads from regions with too high coverage compared to the expected values (typically GC-rich regions) and will add reads to regions where too few reads are seen (typically AT-rich regions). The tool computeGCBias needs to be run first to generate the frequency table needed here.

usage: An example usage is:
 correctGCBias -b file.bam --effectiveGenomeSize 2150570000 -g mm9.2bit --GCbiasFrequenciesFile freq.txt -o gc_corrected.bam [options]
Required arguments
--bamfile, -b Sorted BAM file to correct.
--effectiveGenomeSize
 The effective genome size is the portion of the genome that is mappable. Large fractions of the genome are stretches of NNNN that should be discarded. Also, if repetitive regions were not included in the mapping of reads, the effective genome size needs to be adjusted accordingly. Common values are: mm9: 2150570000, hg19:2451960000, dm3:121400000 and ce10:93260000. See Table 2 of http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030377 or http://www.nature.com/nbt/journal/v27/n1/fig_tab/nbt.1518_T1.html for several effective genome sizes. This value is needed to detect enriched regions that, if not discarded, could bias the results.
--genome, -g Genome in two bit format. Most genomes can be found here: http://hgdownload.cse.ucsc.edu/gbdb/ Search for the .2bit ending. Otherwise, fasta files can be converted to 2bit using faToTwoBit available here: http://hgdownload.cse.ucsc.edu/admin/exe/
--GCbiasFrequenciesFile, -freq
 Indicate the output file from computeGCBias containing the observed and expected read frequencies per GC-content.
Output options
--correctedFile, -o
 Name of the corrected file. The ending will be used to decide the output file format. The options are ”.bam”, ”.bw” for a bigWig file, ”.bg” for a bedGraph file.
Optional arguments
--version show program’s version number and exit
--binSize, -bs Size of the bins, in bases, for the output of the bigwig/bedgraph file.
--region, -r Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time. The format is chr:start:end, for example –region chr10 or –region chr10:456700:891000.
--numberOfProcessors, -p
 Number of processors to use. Type “max/2” to use half the maximum number of processors or “max” to use all available processors.
--verbose, -v Set to see processing messages.

Usage example

Note

correctGCBias requires the output of computeGCBias and a genome file in 2bit format. Most genomes can be found here: http://hgdownload.cse.ucsc.edu/gbdb/. Search for the .2bit ending. Otherwise, FASTA files can be converted to 2bit using faToTwoBit, which is available here: http://hgdownload.cse.ucsc.edu/admin/exe/

$ correctGCBias -b H3K27Me3.bam
   --effectiveGenomeSize 2695000000
   --genome genome.2bit
   --GCbiasFrequenciesFile freq_test.txt # output of computeGCBias
   -o gc_corrected.bam

Warning

The GC-corrected BAM file will most likely contain several duplicated reads in regions where the coverage had to increased in order to match the expected read density. This means that you should absolutely avoid using any filtering of duplicate reads during your downstream analyses!

deepTools Galaxy. code @ github.