plotPCA
Tool for generating a principal component analysis (PCA) plot from multiBamSummary or multiBigwigSummary output. By default, the loadings for each sample in each principal component is plotted. If the data is transposed, the projections of each sample on the requested principal components is plotted instead.
Detailed help:
plotPCA -h
usage: plotPCA -in coverage.npz -o pca.png
help: plotPCA -h / plotPCA --help
Named Arguments
- --transpose
Perform the PCA on the transposed matrix, (i.e., on the matrix where rows are samples and columns are bins/features. This then matches what is typically done in R.
- --rowCenter
When specified, each row (bin, gene, etc.) in the matrix is centered at 0 before the PCA is computed. This is useful only if you have a strong bin/gene/etc. correlation and the resulting principal component has samples stacked vertically. This option is not applicable if –transpose is specified.
Required arguments
- --corData, -in
Coverage file (generated by multiBamSummary or multiBigwigSummary)
Optional arguments
- --plotFile, -o
File name to save the plot to. The extension determines the file format. For example: pca.pdf will save the PCA plot in PDF format. The available options are: .png, .eps, .pdf and .svg. If this option is omitted, then you MUST specify –outFileNameData
- --labels, -l
User defined labels instead of default labels from file names. Multiple labels have to be separated by spaces, e.g. –labels sample1 sample2 sample3
- --plotTitle, -T
Title of the plot, to be printed on top of the generated image. Leave blank for no title. (Default: )
- --plotFileFormat
Possible choices: png, pdf, svg, eps, plotly
Image format type. If given, this option overrides the image format based on the plotFile ending. The available options are: png, eps, pdf, plotly and svg.
- --plotHeight
Plot height in cm. (Default: 10)
- --plotWidth
Plot width in cm. The minimum value is 1 cm. (Default: 10)
- --outFileNameData
File name to which the data underlying the plot should be saved, such as myPCA.tab. For untransposed data, this is the loading per-sample and PC as well as the eigenvalues. For transposed data, this is the rotation per-sample and PC and the eigenvalues. The projections are truncated to the number of eigenvalues for transposed data.
- --ntop
Use only the top N most variable rows in the original matrix. Specifying 0 will result in all rows being used. If the matrix is to be transposed, rows with 0 variance are always excluded, even if a values of 0 is specified. The default is 1000. (Default: 1000)
- --PCs
The principal components to plot. If specified, you must provide two different integers, greater than zero, separated by a space. An example (and the default) is “1 2”. (Default: [1, 2])
- --log2
log2 transform the datapoints prior to computing the PCA. Note that 0.01 is added to all values to prevent 0 values from becoming -infinity. Using this option with input that contains negative values will result in an error.
- --colors
A list of colors for the symbols. Color names and html hex string (e.g., #eeff22) are accepted. The color names should be space separated. For example, –colors red blue green. If not specified, the symbols will be given automatic colors.
- --markers
A list of markers for the symbols. (e.g., ‘<’,’>’,’o’) are accepted. The marker values should be space separated. For example, –markers ‘s’ ‘o’ ‘s’ ‘o’. If not specified, the symbols will be given automatic shapes.
- --version
show program’s version number and exit
example usages: plotPCA -in coverages.npz -o pca.png
Background
Principal component analysis (PCA) can be used, for example, to determine whether samples display greater variability between experimental conditions than between replicates of the same treatment. PCA is also useful to identify unexpected patterns, such as those caused by batch effects or outliers. Principal components represent the directions along which the variation in the data is maximal, so that the information (e.g., read coverage values) from thousands of regions can be represented by just a few dimensions.
Note
PCA is not designed to identify unknown groupings or clustering and given an unexpected result, it is up to the researcher to determine the experimental or technical reason underlying the principal components.
Usage example
plotPCA
needs the compressed numpy array
output from either multiBamSummary or multiBigwigSummary
$ deepTools2.0/bin/plotPCA -in readCounts.npz \
-o PCA_readCounts.png \
-T "PCA of read counts"
After perfoming the PCA on the values supplied as the input, plotPCA
will sort the principal components according to the amount of variability of the data that they explain. Based on this, you will obtain two plots:
the eigenvalues of the top two principal components
the Scree plot for the top five principal components where the bars represent the amount of variability explained by the individual factors and the red line traces the amount of variability is explained by the individual components in a cumulative manner