Effective Genome Size

A number of tools can accept an “effective genome size”. This is defined as the length of the “mappable” genome. There are two common alternative ways to calculate this:

1. The number of non-N bases in the genome.
2. The number of regions (of some size) in the genome that are uniquely mappable (possibly given some maximal edit distance).

Option 1 can be computed using faCount from Kent’s tools. The effective genome size for a number of genomes using this method is given below:

Genome Effective size
GRCh37 2864785220
GRCh38 2913022398
GRCm37 2620345972
GRCm38 2652783500
dm3 162367812
dm6 142573017
GRCz10 1369631918
WBcel235 100286401

These values only appropriate if multimapping reads are included. If they are excluded (or there’s any MAPQ filter applied), then values derived from option 2 are more appropriate. These are then based on the read length. We can approximate these values for various read lengths using the khmer program program and unique-kmers.py in particular. A table of effective genome sizes given a read length using this method is provided below:

Read length GRCh37 GRCh38 GRCm37 GRCm38 dm3 dm6 GRCz10 WBcel235
50 2685511504 2701495761 2304947926 2308125349 130428560 125464728 1195445591 95159452
75 2736124973 2747877777 2404646224 2407883318 135004462 127324632 1251132686 96945445
100 2776919808 2805636331 2462481010 2467481108 139647232 129789873 1280189044 98259998
150 2827437033 2862010578 2489384235 2494787188 144307808 129941135 1312207169 98721253
200 2855464000 2887553303 2513019276 2520869189 148524010 132509163 1321355241 98672758
deepTools Galaxy. code @ github.