correctGCBias

Hint

For background information about the GC bias assessment and correction, see computeGCBias.

This tool corrects the GC-bias using the method proposed by [Benjamini & Speed (2012). Nucleic Acids Research, 40(10)]. It will remove reads from regions with too high coverage compared to the expected values (typically GC-rich regions) and will add reads to regions where too few reads are seen (typically AT-rich regions). The tool computeGCBias needs to be run first to generate the frequency table needed here.

usage: correctGCBias -b file.bam --effectiveGenomeSize 2150570000 -g mm9.2bit --GCbiasFrequenciesFile freq.txt -o gc_corrected.bam
help: correctGCBias -h / correctGCBias --help

Required arguments

--bamfile, -b: Sorted BAM file to correct.
--effectiveGenomeSize: The effective genome size is the portion of the genome that is mappable. Large fractions of the genome are stretches of NNNN that should be discarded. Also, if repetitive regions were not included in the mapping of reads, the effective genome size needs to be adjusted accordingly. A table of values is available here: http://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html .
--genome, -g: Genome in two bit format. Most genomes can be found here: http://hgdownload.cse.ucsc.edu/gbdb/ Search for the .2bit ending. Otherwise, fasta files can be converted to 2bit using faToTwoBit available here: http://hgdownload.cse.ucsc.edu/admin/exe/
--GCbiasFrequenciesFile, -freq: Indicate the output file from computeGCBias containing the observed and expected read frequencies per GC-content.

Output options

--correctedFile, -o: Name of the corrected file. The ending will be used to decide the output file format. The options are “.bam”, “.bw” for a bigWig file, “.bg” for a bedGraph file.

Optional arguments

--version: show program’s version number and exit
--binSize, -bs: Size of the bins, in bases, for the output of the bigwig/bedgraph file. (Default: 50)
--region, -r: Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time. The format is chr:start:end, for example –region chr10 or –region chr10:456700:891000.
--numberOfProcessors, -p: Number of processors to use. Type “max/2” to use half the maximum number of processors or “max” to use all available processors. (Default: 1)
--verbose, -v: Set to see processing messages.

Usage example

Note

correctGCBias requires the output of computeGCBias and a genome file in 2bit format. Most genomes can be found here: http://hgdownload.cse.ucsc.edu/gbdb/. Search for the .2bit ending. Otherwise, FASTA files can be converted to 2bit using faToTwoBit, which is available here: http://hgdownload.cse.ucsc.edu/admin/exe/

$ correctGCBias -b H3K27Me3.bam
   --effectiveGenomeSize 2695000000
   --genome genome.2bit
   --GCbiasFrequenciesFile freq_test.txt # output of computeGCBias
   -o gc_corrected.bam

Warning

The GC-corrected BAM file will most likely contain several duplicated reads in regions where the coverage had to increased in order to match the expected read density. This means that you should absolutely avoid using any filtering of duplicate reads during your downstream analyses!

deepTools Galaxy.

code @ github.