computeMatrix

This tool calculates scores per genome regions and prepares an intermediate file that can be used with plotHeatmap and plotProfiles. Typically, the genome regions are genes, but any other regions defined in a BED file can be used. computeMatrix accepts multiple score files (bigWig format) and multiple regions files (BED format). This tool can also be used to filter and sort regions according to their score.

To learn more about the specific parameters, type:

$ computeMatrix reference-point –help or

$ computeMatrix scale-regions –help

usage: computeMatrix [-h] [--version]  ...
optional arguments
--version show program’s version number and exit
Commands

Undocumented

Possible choices: scale-regions, reference-point

Sub-commands:
scale-regions

In the scale-regions mode, all regions in the BED file are stretched or shrunken to the length (in bases) indicated by the user.

usage: An example usage is:
  computeMatrix -S <biwig file> -R <bed file> -b 1000
Required arguments
--regionsFileName, -R
 File name, in BED format, containing the regions to plot. If multiple bed files are given, each one is considered a group that can be plotted separately. Also, adding a “#” symbol in the bed file causes all the regions until the previous “#” to be considered one group.
--scoreFileName, -S
 bigWig file(s) containing the scores to be plotted. BigWig files can be obtained by using the bamCoverage or bamCompare tools. More information about the bigWig file format can be found at http://genome.ucsc.edu/goldenPath/help/bigWig.html
Output options
--outFileName, -out
 File name to save the gzipped matrix file needed by the “plotHeatmap” and “plotProfile” tools.
--outFileNameMatrix
 If this option is given, then the matrix of values underlying the heatmap will be saved using the indicated name, e.g. IndividualValues.tab.This matrix can easily be loaded into R or other programs.
--outFileSortedRegions
 File name in which the regions are saved after skiping zeros or min/max threshold values. The order of the regions in the file follows the sorting order selected. This is useful, for example, to generate other heatmaps keeping the sorting of the first heatmap. Example: Heatmap1sortedRegions.bed
Optional arguments
--version show program’s version number and exit
--regionBodyLength=1000, -m=1000
 Distance in bases to which all regions will be fit.
--startLabel=TSS
 Label shown in the plot for the start of the region. Default is TSS (transcription start site), but could be changed to anything, e.g. “peak start”. Note that this is only useful if you plan to plot the results yourself and not, for example, with plotHeatmap, which will override this.
--endLabel=TES Label shown in the plot for the region end. Default is TES (transcription end site). See the –startLabel option for more information.
--beforeRegionStartLength=0, -b=0, --upstream=0
 Distance upstream of the start site of the regions defined in the region file. If the regions are genes, this would be the distance upstream of the transcription start site.
--afterRegionStartLength=0, -a=0, --downstream=0
 Distance downstream of the end site of the given regions. If the regions are genes, this would be the distance downstream of the transcription end site.
--unscaled5prime=0
 Number of bases at the 5-prime end of the region to exclude from scaling. By default, each region is scaled to a given length (see the –regionBodyLength option). In some cases it is useful to look at unscaled signals around region boundaries, so this setting specifies the number of unscaled bases on the 5-prime end of each boundary.
--unscaled3prime=0
 Like –unscaled3prime, but for the 3-prime end.
--binSize=10, -bs=10
 Length, in bases, of the non-overlapping bins for averaging the score over the regions length.
--sortRegions=no
 

Whether the output file should present the regions sorted. The default is to not sort the regions. Note that this is only useful if you plan to plot the results yourself and not, for example, with plotHeatmap, which will override this.

Possible choices: descend, ascend, no

--sortUsing=mean
 

Indicate which method should be used for sorting. The value is computed for each row.Note that the region_length option will lead to a dotted line within the heatmap that indicates the end of the regions.

Possible choices: mean, median, max, min, sum, region_length

--averageTypeBins=mean
 

Define the type of statistic that should be used over the bin size range. The options are: “mean”, “median”, “min”, “max”, “sum” and “std”. The default is “mean”.

Possible choices: mean, median, min, max, std, sum

--missingDataAsZero=False
 If set, missing data (NAs) will be treated as zeros. The default is to ignore such cases, which will be depicted as black areas in a heatmap. (see the –missingDataColor argument of the plotHeatmap command for additional options).
--skipZeros=False
 Whether regions with only scores of zero should be included or not. Default is to include them.
--minThreshold Numeric value. Any region containing a value that is less than or equal to this will be skipped. This is useful to skip, for example, genes where the read count is zero for any of the bins. This could be the result of unmappable areas and can bias the overall results.
--maxThreshold Numeric value. Any region containing a value greater than or equal to this will be skipped. The maxThreshold is useful to skip those few regions with very high read counts (e.g. micro satellites) that may bias the average values.
--blackListFileName, -bl
 A BED file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered.
--quiet=False, -q=False
 Set to remove any warning or processing messages.
--scale=1 If set, all values are multiplied by this number.
--numberOfProcessors=max/2, -p=max/2
 Number of processors to use. Type “max/2” to use half the maximum number of processors or “max” to use all available processors.
reference-point

Reference-point refers to a position within a BED region (e.g., the starting point). In this mode, only those genomicpositions before (upstream) and/or after (downstream) of the reference point will be plotted.

usage: An example usage is:
  computeMatrix -S <biwig file> -R <bed file> -a 3000 -b 3000
Required arguments
--regionsFileName, -R
 File name, in BED format, containing the regions to plot. If multiple bed files are given, each one is considered a group that can be plotted separately. Also, adding a “#” symbol in the bed file causes all the regions until the previous “#” to be considered one group.
--scoreFileName, -S
 bigWig file(s) containing the scores to be plotted. BigWig files can be obtained by using the bamCoverage or bamCompare tools. More information about the bigWig file format can be found at http://genome.ucsc.edu/goldenPath/help/bigWig.html
Output options
--outFileName, -out
 File name to save the gzipped matrix file needed by the “plotHeatmap” and “plotProfile” tools.
--outFileNameMatrix
 If this option is given, then the matrix of values underlying the heatmap will be saved using the indicated name, e.g. IndividualValues.tab.This matrix can easily be loaded into R or other programs.
--outFileSortedRegions
 File name in which the regions are saved after skiping zeros or min/max threshold values. The order of the regions in the file follows the sorting order selected. This is useful, for example, to generate other heatmaps keeping the sorting of the first heatmap. Example: Heatmap1sortedRegions.bed
Optional arguments
--version show program’s version number and exit
--referencePoint=TSS
 

The reference point for the plotting could be either the region start (TSS), the region end (TES) or the center of the region. Note that regardless of what you specify, plotHeatmap/plotProfile will default to using “TSS” as the label.

Possible choices: TSS, TES, center

--beforeRegionStartLength=500, -b=500, --upstream=500
 Distance upstream of the reference-point selected.
--afterRegionStartLength=1500, -a=1500, --downstream=1500
 Distance downstream of the reference-point selected.
--nanAfterEnd=False
 If set, any values after the region end are discarded. This is useful to visualize the region end when not using the scale-regions mode and when the reference-point is set to the TSS.
--binSize=10, -bs=10
 Length, in bases, of the non-overlapping bins for averaging the score over the regions length.
--sortRegions=no
 

Whether the output file should present the regions sorted. The default is to not sort the regions. Note that this is only useful if you plan to plot the results yourself and not, for example, with plotHeatmap, which will override this.

Possible choices: descend, ascend, no

--sortUsing=mean
 

Indicate which method should be used for sorting. The value is computed for each row.Note that the region_length option will lead to a dotted line within the heatmap that indicates the end of the regions.

Possible choices: mean, median, max, min, sum, region_length

--averageTypeBins=mean
 

Define the type of statistic that should be used over the bin size range. The options are: “mean”, “median”, “min”, “max”, “sum” and “std”. The default is “mean”.

Possible choices: mean, median, min, max, std, sum

--missingDataAsZero=False
 If set, missing data (NAs) will be treated as zeros. The default is to ignore such cases, which will be depicted as black areas in a heatmap. (see the –missingDataColor argument of the plotHeatmap command for additional options).
--skipZeros=False
 Whether regions with only scores of zero should be included or not. Default is to include them.
--minThreshold Numeric value. Any region containing a value that is less than or equal to this will be skipped. This is useful to skip, for example, genes where the read count is zero for any of the bins. This could be the result of unmappable areas and can bias the overall results.
--maxThreshold Numeric value. Any region containing a value greater than or equal to this will be skipped. The maxThreshold is useful to skip those few regions with very high read counts (e.g. micro satellites) that may bias the average values.
--blackListFileName, -bl
 A BED file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered.
--quiet=False, -q=False
 Set to remove any warning or processing messages.
--scale=1 If set, all values are multiplied by this number.
--numberOfProcessors=max/2, -p=max/2
 Number of processors to use. Type “max/2” to use half the maximum number of processors or “max” to use all available processors.

An example usage is:
computeMatrix reference-point -S <bigwig file(s)> -R <bed file(s)> -b 1000

Details

computeMatrix has two main modes of use:

  • for computing the signal distribution relative to a point (reference-point), e.g., the beginning or end of each genomic region
  • for computing the signal over a set of regions (scale-regions) where all regions are scaled to the same size
../../_images/computeMatrix_modes.png

computeMatrix is tightly connected to plotHeatmap and plotProfile: it takes the values of all the signal files and all genomic regions that you would like to plot and computes the corresponding data matrix.

See plotHeatmap and plotProfile for example plots.

../../_images/computeMatrix_overview.png

In addition to generating the intermediate, gzipped file for plotHeatmap and plotProfile, computeMatrix can also be used to simply output the values underlying the heatmap or to filter and sort BED files using, for example, the --skipZeros and the --sortUsing parameters.

The following tables summarizes the kinds of optional outputs that are available with the three tools.

optional output type command computeMatrix plotHeatmap plotProfile
values underlying the heatmap --outFileNameMatrix yes yes no
values underlying the profile --outFileNameData no yes yes
sorted and/or filtered regions --outFileSortedRegions yes yes yes

Tip

computeMatrix can use multiple threads (-p option), which significantly decreases the time for calculating the values.

Examples

The following examples should give you an idea of some of the most often used settings for computeMatrix. As you can see, computeMatrix offers myriad tweaks and may turn out to be more useful to you than “just” to calculate heatmap matrices.

Example 1: single input files (reference-point mode)

Here, we start with a single bigWig and a single BED file, i.e., computeMatrix will:

  1. take the beginning of the regions specified in the BED file
  2. add the values indicated with --beforeRegionStartLength (-b) and --afterRegionStartLength (-a)
  3. split the resulting region up into 50 bp bins (can be changed via (--binSize)
  4. calculate the mean score based on the scores given in the bigWig file (the kind of score can be changed via --averageTypeBins)
  5. write out the values where each row corresponds to one region in the BED file (note that you can, for example, skip regions with zero coverage; sorting is also possible)
$ computeMatrix reference-point \ # choose the mode
       --referencePoint TSS \ # alternatives: TES, center
       -b 3000 -a 10000 \ # define the region you are interested in
       -R testFiles/genes.bed \
       -S testFiles/log2ratio_H3K4Me3_chr19.bw  \
       --skipZeros \
       -o matrix1_H3K4me3_l2r_TSS.gz \ # to be used with plotHeatmap and plotProfile
       --outFileSortedRegions regions1_H3K4me3_l2r_genes.bed

Let’s have a closer look at the regions’ output:

$ wc -l testFiles/genes.bed # original file
   18257 testFiles/genes.bed
$ wc -l regions1_H3K4me3_l2r_genes.bed # file generated by computeMatrix
   12423 regions1_H3K4me3_l2r_genes.bed

As you can see, the number of regions is drastically reduced. The remaining genes happen to be the ones on chromosome 19 for which there was at least one overlapping read. This makes sense since the bigWig file used above only contained reads for chromosome 19.

# the original file contained genes for chr.19 and chr.X
$ cut -f 1 testFiles/genes.bed | sort | uniq -c
    12439 19
    5818 X

# the regions used for the computation of the matrix for the heatmap are all located on chr.19 due to the --skipZeros setting (see above)
$ cut -f 1 regions1_H3K4me3_l2r_genes.bed | sort | uniq -c
    1 #genes
    12422 19

Example 2: multiple input files (scale-regions mode)

$ deepTools2.0/bin/computeMatrix scale-regions \
  -R genes_chr19_firstHalf.bed genes_chr19_secondHalf.bed \ # separate multiple files with spaces
  -S testFiles/log2ratio_*.bw  \ or use the wild card approach
  -b 3000 -a 3000 \
  --regionBodyLength 5000 \
  --skipZeros -o matrix2_multipleBW_l2r_twoGroups_scaled.gz \
  --outFileNameMatrix matrix2_multipleBW_l2r_twoGroups_scaled.tab \
  --outFileSortedRegions regions2_multipleBW_l2r_twoGroups_genes.bed

Note that the reported regions will have the same coordinates as the ones in the originally supplied file, not the region that was used for the heatmap matrix.

The groups of regions supplied by two individual files will be merged into one:

$ head -n 2 regions2_multipleBW_l2r_twoGroups_genes.bed
19  60104   70951   ENST00000592209 0.0     -       genes_chr19_firstHalf
19  60950   70966   ENST00000606728 0.0     -       genes_chr19_firstHalf

$ tail -n 3 regions2_multipleBW_l2r_twoGroups_genes.bed
19  59108549        59110722        ENST00000596427 0.0     -       genes_chr19_secondHalf
19  59110333        59110802        ENST00000464061 0.0     +       genes_chr19_secondHalf
#genes_chr19_secondHalf

Tip

More examples can be found in our Gallery.

deepTools Galaxy. code @ github.