# deeptools package modules¶

## deeptools.SES_scaleFactor module¶

class deeptools.SES_scaleFactor.Tester[source]

Bases: object

deeptools.SES_scaleFactor.estimateScaleFactor(bamFilesList, binLength, numberOfSamples, normalizationLength, avg_method='median', blackListFileName=None, numberOfProcessors=1, verbose=False, chrsToSkip=[], mappingStatsList=[])[source]

Subdivides the genome into chunks to be analyzed in parallel using several processors. The code handles the creation of workers that compute fragment counts (coverage) for different regions and then collect and integrates the results.

bamFilesList : list
list of bam files to normalize
binLength : int
the window size in bp, where reads are going to be counted.
numberOfSamples : int
number of sites to sample from the genome. For more info see the documentation of the CountReadsPerBin class
normalizationLength : int
length, in bp, to normalize the data. For a value of 1, on average 1 read per base pair is found
avg_method : str
defines how the different values are to be summarized. The options are ‘mean’ and ‘median’
chrsToSkip : list
name of the chromosomes to be excluded from the scale estimation. Usually the chrX is included.
blackListFileName : str
BED file containing blacklisted regions
mappingStatsList : list
List of the number of mapped reads per file
dict
Dictionary with the following keys::
>>> test = Tester()
>>> bin_length = 50
>>> num_samples = 4
>>> _dict = estimateScaleFactor([test.bamFile1, test.bamFile2], bin_length, num_samples,  1)
>>> _dict['size_factors']
array([1. , 0.5])
>>> _dict['size_factors_based_on_mean']
array([1. , 0.5])


## deeptools.bamHandler module¶

deeptools.bamHandler.countReadsInInterval(args)[source]
deeptools.bamHandler.getMappingStats(bam, nThreads)[source]

This is used for CRAM files, since idxstats() and .mapped/.unmapped are meaningless

This requires pysam > 0.13.0

deeptools.bamHandler.openBam(bamFile, returnStats=False, nThreads=1, minimalDecoding=True)[source]

A wrapper for opening BAM/CRAM files.

bamFile: str
A BAM/CRAM file name
returnStats: bool
Return a tuple of (file_handle, nMappedReads, nUnmappedReads, statsDict). These additional values are needed by some downstream functions, since one can’t use file_handle.mapped on CRAM files (or idxstats())
If returnStats is True, number of threads to use for computing statistics
minimalDecoding: Bool
For CRAM files, don’t decode the read name, sequence, qual, or auxiliary tag fields (these aren’t used by most functions).

Returns either the file handle or a tuple as described in returnStats

## deeptools.correlation module¶

class deeptools.correlation.Correlation(matrix_file, corr_method=None, labels=None, remove_outliers=False, skip_zeros=False, log1p=False)[source]

class to work with matrices having sample data to compute correlations, plot them and make scatter plots

compute_correlation()[source]

computes spearman or pearson correlation for the samples in the matrix

The matrix should contain the values of each sample per column that’s why the transpose is used.

>>> matrix = np.array([[1, 2, 3, np.nan],
...                    [1, 2, 3, 4],
...                    [6, 4, 3, 1]]).T
>>> np.savez_compressed("/tmp/test_matrix.npz", matrix=matrix, labels=['a', 'b', 'c'])

>>> c = Correlation("/tmp/test_matrix.npz", corr_method='pearson')


the results should be as in R

>>> c.compute_correlation().filled(np.nan)
array([[ 1.        ,  1.        , -0.98198051],
[ 1.        ,  1.        , -0.98198051],
[-0.98198051, -0.98198051,  1.        ]])
>>> c.corr_method = 'spearman'
>>> c.corr_matrix = None
>>> c.compute_correlation()
array([[ 1.,  1., -1.],
[ 1.,  1., -1.],
[-1., -1.,  1.]])

static get_outlier_indices(data, max_deviation=200)[source]

The method is based on the median absolute deviation. See Boris Iglewicz and David Hoaglin (1993), “Volume 16: How to Detect and Handle Outliers”, The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.

returns the list, without the outliers

The max_deviation=200 is like selecting a z-score larger than 200, just that it is based on the median and the median absolute deviation instead of the mean and the standard deviation.

load_matrix(matrix_file)[source]

loads a matrix file saved using the numpy savez method. Two keys are expected: ‘matrix’ and ‘labels’. The matrix should contain one sample per row

plot_correlation(plot_filename, plot_title='', vmax=None, vmin=None, colormap='jet', image_format=None, plot_numbers=False, plotWidth=11, plotHeight=9.5)[source]

plots a correlation using a symmetric heatmap

plot_pca(plot_filename=None, PCs=[1, 2], plot_title='', image_format=None, log1p=False, plotWidth=5, plotHeight=10, cols=None, marks=None)[source]

Plot the PCA of a matrix

Returns the matrix of plotted values.

plot_scatter(plot_filename, plot_title='', image_format=None, log1p=False, xRange=None, yRange=None)[source]

Plot the scatter plots of a matrix in which each row is a sample

plotly_correlation(corr_matrix, plot_filename, labels, plot_title='', vmax=None, vmin=None, plot_numbers=True, colormap='jet')[source]

plot_correlation, but using plotly

plotly_pca(plotFile, Wt, pvar, PCs, eigenvalues, cols, plotTitle)[source]

A plotly version of plot_pca, that’s called by it to do the actual plotting

plotly_scatter(plot_filename, corr_matrix, plot_title='', minXVal=None, maxXVal=None, minYVal=None, maxYVal=None)[source]

Make the scatter plot of a matrix with plotly

remove_outliers(verbose=True)[source]

get the outliers per column using the median absolute deviation method

Returns the filtered matrix

remove_rows_of_zeros()[source]
save_corr_matrix(file_handle)[source]

saves the correlation matrix

## deeptools.correlation_heatmap module¶

deeptools.correlation_heatmap.plot_correlation(corr_matrix, labels, plotFileName, vmax=None, vmin=None, colormap='jet', image_format=None, plot_numbers=False, plot_title='')[source]

class deeptools.countReadsPerBin.CountReadsPerBin(bamFilesList, binLength=50, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, blackListFileName=None, minMappingQuality=None, ignoreDuplicates=False, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, out_file_for_raw_data=None, statsList=[], mappedList=[])[source]

Bases: object

Collects coverage over multiple bam files using multiprocessing

This function collects read counts (coverage) from several bam files and returns an numpy array with the results. This class uses multiprocessing to compute the coverage.

bamFilesList : list
List containing the names of indexed bam files. E.g. [‘file1.bam’, ‘file2.bam’]
binLength : int
Length of the window/bin. This value is overruled by bedFile if present.
numberOfSamples : int
Total number of samples. The genome is divided into numberOfSamples, each with a window/bin length equal to binLength. This value is overruled by stepSize in case such value is present and by bedFile in which case the number of samples and bins are defined in the bed file
numberOfProcessors : int
Number of processors to use. Default is 4
verbose : bool
Output messages. Default: False
region : str
Region to limit the computation in the form chrom:start:end.
bedFile : list of file_handles.
Each file handle corresponds to a bed file containing the regions for which to compute the coverage. This option overrules binLength, numberOfSamples and stepSize.
blackListFileName : str
A string containing a BED file with blacklist regions.

Whether coverage should be computed for the extended read length (i.e. the region covered by the two mates or the regions expected to be covered by single-reads). If the value is ‘int’, then then this is interpreted as the fragment length to extend reads that are not paired. For Illumina reads, usual values are around 300. This value can be determined using the peak caller MACS2 or can be approximated by the fragment lengths computed when preparing the library for sequencing. If the value is of the variable is true and not value is given, the fragment size is sampled from the library but only if the library is paired-end. Default: False
minMappingQuality : int
Reads of a mapping quality less than the give value are not considered. Default: None
ignoreDuplicates : bool
Whether read duplicates (same start, end position. If paired-end, same start-end for mates) are to be excluded. Default: false
chrToSkip: list
List with names of chromosomes that do not want to be included in the coverage computation. This is useful to remove unwanted chromosomes (e.g. ‘random’ or ‘Het’).
stepSize : int
the positions for which the coverage is computed are defined as follows: range(start, end, stepSize). Thus, a stepSize of 1, will compute the coverage at each base pair. If the stepSize is equal to the binLength then the coverage is computed for consecutive bins. If seepSize is smaller than the binLength, then teh bins will overlap.
Determines if reads should be centered with respect to the fragment length.
samFlag_include : int
Extracts only those reads having the SAM flag. For example, to get only reads that are the first mates a samFlag of 64 could be used. Similarly, the samFlag_include can be used to select only reads mapping on the reverse strand or to get only properly paired reads.
samFlag_exclude : int
Removes reads that match the SAM flag. For example to get all reads that map to the forward strand a samFlag_exlude 16 should be used. Which translates into exclude all reads that map to the reverse strand.
zerosToNans : bool
If true, zero values encountered are transformed to Nans. Default false.
minFragmentLength : int
If greater than 0, fragments below this size are excluded.
maxFragmentLength : int
If greater than 0, fragments above this size are excluded.
out_file_for_raw_data : str
File name to save the raw counts computed
statsList : list
For each BAM file in bamFilesList, the associated per-chromosome statistics returned by openBam
mappedList : list
For each BAM file in bamFilesList, the number of mapped reads in the file.

numpy array

Each row correspond to each bin/bed region and each column correspond to each of the bamFiles.

The test data contains reads for 200 bp.

>>> test = Tester()


The transpose function is used to get a nicer looking output. The first line corresponds to the number of reads per bin in bam file 1

>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4)
>>> np.transpose(c.run())
array([[0., 0., 1., 1.],
[0., 1., 1., 2.]])

count_reads_in_region(chrom, start, end, bed_regions_list=None)[source]

Counts the reads in each bam file at each ‘stepSize’ position within the interval (start, end) for a window or bin of size binLength.

The stepSize controls the distance between bins. For example, a step size of 20 and a bin size of 20 will create bins next to each other. If the step size is smaller than the bin size the bins will overlap.

If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.

chrom : str
Chrom name
start : int
start coordinate
end : int
end coordinate
bed_regions_list: list
List of list of tuples of the form (start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.
numpy array
The result is a numpy array that as rows each bin and as columns each bam file.

Initialize some useful values

>>> test = Tester()
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 25, 0, stepSize=50)


The transpose is used to get better looking numbers. The first line corresponds to the number of reads per bin in the first bamfile.

>>> _array, __ = c.count_reads_in_region(test.chrom, 0, 200)
>>> _array
array([[0., 0.],
[0., 1.],
[1., 1.],
[1., 2.]])

getReadLength(read)[source]
getSmoothRange(tileIndex, tileSize, smoothRange, maxPosition)[source]

Given a tile index position and a tile size (length), return the a new indices over a larger range, called the smoothRange. This region is centered in the tileIndex an spans on both sizes to cover the smoothRange. The smoothRange is trimmed in case it is less than zero or greater than maxPosition

---------------|==================|------------------
tileStart
|--------------------------------------|
|    <--      smoothRange     -->      |
|
tileStart - (smoothRange-tileSize)/2


Test for a smooth range that spans 3 tiles.

>>> c = CountReadsPerBin([], 1, 1, 1, 0)
>>> c.getSmoothRange(5, 1, 3, 10)
(4, 7)


Test smooth range truncated on start.

>>> c.getSmoothRange(0, 10, 30, 200)
(0, 2)


Test smooth range truncated on start.

>>> c.getSmoothRange(1, 10, 30, 4)
(0, 3)


Test smooth range truncated on end.

>>> c.getSmoothRange(5, 1, 3, 5)
(4, 5)


Test smooth range not multiple of tileSize.

>>> c.getSmoothRange(5, 10, 24, 10)
(4, 6)

get_chunk_length(bamFilesHandles, genomeSize, chromSizes, chrLengths)[source]
get_coverage_of_region(bamHandle, chrom, regions, fragmentFromRead_func=None)[source]

Returns a numpy array that corresponds to the number of reads that overlap with each tile.

>>> test = Tester()
>>> import pysam


For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested.

>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000833, 5000834), (5000834, 5000835)])
array([4., 5.])


In the following example a paired read is extended to the fragment length which is 100 The first mate starts at 5000000 and the second at 5000064. Each mate is extended to the fragment length independently At position 500090-500100 one fragment of length 100 overlap, and after position 5000101 there should be zero reads.

>>> c.zerosToNans = True
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000090, 5000100), (5000100, 5000110)])
array([ 1., nan])


In the following case the reads length is 50. Reads are not extended.

>>> c.extendReads=False
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', [(148, 150), (150, 152), (152, 154)])
array([1., 2., 2.])

get_fragment_from_read(read)[source]

Get read start and end position of a read. If given, the reads are extended as follows: If reads are paired end, each read mate is extended to match the fragment length, otherwise, a default fragment length is used. If reads are split (give by the CIGAR string) then the multiple positions of the read are returned. When reads are extended the cigar information is skipped.

The following values are defined (for forward reads):

     |--          -- read.tlen --              --|
-----|===============>------------<==============|----
|               |            |

-----|===============>-----------<===============|----
|                           |               |


this is a sketch of a pair-end reads

The function returns the fragment start and end, either using the paired end information (if available) or extending the read in the appropriate direction if this is single-end.

list of tuples
[(fragment start, fragment end)]
>>> test = Tester()
>>> c.defaultFragmentLength=100
[(5000000, 5000100)]
[(5000000, 5000100)]
>>> c.defaultFragmentLength = 200
[(5001491, 5001691)]
[(5001536, 5001736)]
[(5001491, 5001527)]
[(5000000, 5000036)]


>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True, center_read=True)
>>> c.defaultFragmentLength = 100
>>> c.defaultFragmentLength = 200

static is_proper_pair(read, maxPairedFragmentLength)[source]

Checks if a read is proper pair meaning that both mates are facing each other and are in the same chromosome and are not to far away. The sam flag for proper pair can not always be trusted. Note that if the fragment size is > maxPairedFragmentLength (~2kb usually) that False will be returned. :return: bool

>>> import pysam
>>> import os
>>> root = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bam = pysam.AlignmentFile("{}/test_proper_pair_filtering.bam".format(root))
>>> iter = bam.fetch()
True
False
>>> cr.is_proper_pair(read, 1000) # "improper pair"
False
>>> cr.is_proper_pair(read, 1000) # "mismatch chr"
False
>>> cr.is_proper_pair(read, 1000) # "same orientation1"
False
>>> cr.is_proper_pair(read, 1000) # "same orientation2"
False
>>> cr.is_proper_pair(read, 1000) # "rev first"
False
>>> cr.is_proper_pair(read, 1000) # "rev first OK"
True
>>> cr.is_proper_pair(read, 1000) # "for first"
False
>>> cr.is_proper_pair(read, 1000) # "for first"
True

run(allArgs=None)[source]
class deeptools.countReadsPerBin.Tester[source]

Bases: object

getRead(readType)[source]

prepare arguments for test

deeptools.countReadsPerBin.countReadsInRegions_wrapper(args)[source]

Passes the arguments to countReadsInRegions_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the countReadsPerBin object

deeptools.countReadsPerBin.remove_row_of_zeros(matrix)[source]

## deeptools.deepBlue¶

class deeptools.deepBlue.deepBlue(sample, url='http://deepblue.mpi-inf.mpg.de/xmlrpc', userKey='anonymous_key')[source]

Bases: object

chroms(chrom=None)[source]

Like the chroms() function in pyBigWig, returns either chromsDict (chrom is None) or the length of a given chromosome

close()[source]
getChroms()[source]

Determines and sets the chromosome names/sizes for a given sample. On error, this raises a runtime exception.

self.chroms is then a dictionary of chromosome:length pairs

getEID()[source]

Given a sample name, return its associated experiment ID (or None on error).

self.experimentID is then the internal ID (e.g., e52525)

getGenome()[source]

Determines and sets the genome assigned to a given sample. On error, this raises a runtime exception.

self.genome is then the internal genome ID.

preload(regions, tmpDir=None)[source]

Given a sample and a set of regions, write a bigWig file containing the underlying signal.

This function returns the file name, which needs to be deleted by the calling function at some point.

This sends queries one chromosome at a time, due to memory limits on deepBlue

deeptools.deepBlue.isDeepBlue(fname)[source]

Returns true if the file ends in .wig, .wiggle, or .bedgraph, since these indicate a file on the deepBlue server

deeptools.deepBlue.makeChromTiles(db)[source]

Make a region for each chromosome

deeptools.deepBlue.makeRegions(BED, args)[source]

Given a list of BED/GTF files, make a list of regions. These are vaguely extended as appropriate. For simplicity, the maximum of –beforeRegionStartLength and –afterRegionStartLength are tacked on to each end and transcripts are used for GTF files.

deeptools.deepBlue.makeTiles(db, args)[source]

Given a deepBlue object, return a list of regions that will be queried

deeptools.deepBlue.mergeRegions(regions)[source]

Given a list of [(chrom, start, end), …], merge all overlapping regions

This returns a dict, where values are sorted lists of [start, end].

deeptools.deepBlue.preloadWrapper(foo)[source]

This is a wrapper around the preload function for multiprocessing

deeptools.getFragmentAndReadSize.getFragmentLength_worker(chrom, start, end, bamFile, distanceBetweenBins)[source]

chrom : str
chromosome name
start : int
region start
end : int
region end
bamFile : str
BAM file name
distanceBetweenBins : int
the number of bases at the end of each bin to ignore
np.array
an np.array, where first column is fragment length, the second is for read length
deeptools.getFragmentAndReadSize.getFragmentLength_wrapper(args)[source]
deeptools.getFragmentAndReadSize.get_read_and_fragment_length(bamFile, return_lengths=False, blackListFileName=None, binSize=50000, distanceBetweenBins=1000000, numberOfProcessors=None, verbose=False)[source]

Estimates the fragment length and read length through sampling

bamFile : str
BAM file name

return_lengths : bool numberOfProcessors : int verbose : bool binSize : int distanceBetweenBins : int

d : dict
tuple of two dictionaries, one for the fragment length and the other for the read length. The dictionaries summarise the mean, median etc. values

## deeptools.getRatio module¶

deeptools.getRatio.compute_ratio(value1, value2, args)[source]
deeptools.getRatio.getRatio(tileCoverage, args)[source]

The mapreduce method calls this function for each tile. The parameters (args) are fixed in the main method.

>>> funcArgs= {'valueType': 'ratio', 'scaleFactors': (1,1), 'pseudocount': 1}
>>> getRatio([9, 19], funcArgs)
0.5
>>> getRatio([0, 0], funcArgs)
1.0
>>> getRatio([np.nan, np.nan], funcArgs)
nan
>>> getRatio([np.nan, 1.0], funcArgs)
nan
>>> funcArgs['valueType'] ='subtract'
>>> getRatio([20, 10], funcArgs)
10
>>> funcArgs['scaleFactors'] = (1, 0.5)
>>> getRatio([10, 20], funcArgs)
0.0


The reciprocal ratio is of a and b is: is a/b if a/b > 1 else -1* b/a >>> funcArgs[‘valueType’] =’reciprocal_ratio’ >>> funcArgs[‘scaleFactors’] = (1, 1) >>> funcArgs[‘pseudocount’] = 0 >>> getRatio([2, 1], funcArgs) 2.0 >>> getRatio([1, 2], funcArgs) -2.0 >>> getRatio([1, 1], funcArgs) 1.0

## deeptools.getScorePerBigWigBin module¶

class deeptools.getScorePerBigWigBin.Tester[source]

Bases: object

deeptools.getScorePerBigWigBin.countFragmentsInRegions_worker(chrom, start, end, bigWigFiles, stepSize, binLength, save_data, bedRegions=None)[source]

returns the average score in each bigwig file at each ‘stepSize’ position within the interval start, end for a ‘binLength’ window. Because the idea is to get counts for window positions at different positions for sampling the bins are equally spaced and not adjacent.

If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.

Test dataset with two samples covering 200 bp. >>> test = Tester()

Fragment coverage. >>> np.transpose(countFragmentsInRegions_worker(test.chrom, 0, 200, [test.bwFile1, test.bwFile2], 50, 25, False)[0]) array([[1., 1., 2., 2.],

[1., 1., 1., 3.]])
>>> np.transpose(countFragmentsInRegions_worker(test.chrom, 0, 200, [test.bwFile1, test.bwFile2], 200, 200, False)[0])
array([[1.5],
[1.5]])


BED regions: >>> bedRegions = [[test.chrom, [(45, 55)]], [test.chrom, [(95, 105)]], [test.chrom, [(145, 155)]]] >>> np.transpose(countFragmentsInRegions_worker(test.chrom, 0, 200,[test.bwFile1, test.bwFile2], 200, 200, False, … bedRegions=bedRegions)[0]) array([[1. , 1.5, 2. ],

[1. , 1. , 2. ]])
deeptools.getScorePerBigWigBin.countReadsInRegions_wrapper(args)[source]
deeptools.getScorePerBigWigBin.getChromSizes(bigwigFilesList)[source]

Get chromosome sizes from bigWig file with pyBigWig

Test dataset with two samples covering 200 bp. >>> test = Tester()

Chromosome name(s) and size(s). >>> assert(getChromSizes([test.bwFile1, test.bwFile2]) == ([(‘3R’, 200)], set([])))

deeptools.getScorePerBigWigBin.getScorePerBin(bigWigFiles, binLength, numberOfProcessors=1, verbose=False, region=None, bedFile=None, blackListFileName=None, stepSize=None, chrsToSkip=[], out_file_for_raw_data=None, allArgs=None)[source]

This function returns a matrix containing scores (median) for the coverage of fragments within a region. Each row corresponds to a sampled region. Likewise, each column corresponds to a bigwig file.

Test dataset with two samples covering 200 bp. >>> test = Tester() >>> np.transpose(getScorePerBin([test.bwFile1, test.bwFile2], 50, 3)) array([[1., 1., 2., 2.],

[1., 1., 1., 3.]])

## deeptools.heatmapper module¶

deeptools.heatmapper.chopRegions(exonsInput, left=0, right=0)[source]

exons is a list of (start, end) tuples. The goal is to chop these into separate lists of tuples, to take care or unscaled regions. “left” and “right” denote regions of a given size to exclude from the normal binning process (unscaled regions).

This outputs three lists of (start, end) tuples:

leftBins: 5’ unscaled regions bodyBins: body bins for scaling rightBins: 3’ unscaled regions

In addition are two integers padLeft: Number of bases of padding on the left (due to not being able to fulfill “left”) padRight: As above, but on the right side

deeptools.heatmapper.chopRegionsFromMiddle(exonsInput, left=0, right=0)[source]

Like chopRegions(), above, but returns two lists of tuples on each side of the center point of the exons.

The steps are as follow:

1. Find the center point of the set of exons (e.g., [(0, 200), (300, 400), (800, 900)] would be centered at 200)
• If a given exon spans the center point then the exon is split
1. The given number of bases at the end of the left-of-center list are extracted
• If the set of exons don’t contain enough bases, then padLeft is incremented accordingly
1. As above but for the right-of-center list
2. A tuple of (#2, #3, pading on the left, and padding on the right) is returned
deeptools.heatmapper.compute_sub_matrix_wrapper(args)[source]
class deeptools.heatmapper.heatmapper[source]

Bases: object

Class to handle the reading and plotting of matrices.

static change_chrom_names(chrom)[source]

Changes UCSC chromosome names to ensembl chromosome names and vice versa.

computeMatrix(score_file_list, regions_file, parameters, blackListFileName=None, verbose=False, allArgs=None)[source]

Splits into multiple cores the computation of the scores per bin for each region (defined by a hash ‘#’ in the regions (BED/GFF) file.

static compute_sub_matrix_worker(self, chrom, start, end, score_file_list, parameters, regions)[source]
numpy matrix
A numpy matrix that contains per each row the values found per each of the regions given
static coverage_from_array(valuesArray, zones, binSize, avgType)[source]
static coverage_from_big_wig(bigwig, chrom, zones, binSize, avgType, nansAsZeros=False, verbose=True)[source]

uses pyBigWig to query a region define by chrom and zones. The output is an array that contains the bigwig value per base pair. The summary over bins is done in a later step when coverage_from_array is called. This method is more reliable than querying the bins directly from the bigwig, which should be more efficient.

By default, any region, even if no chromosome match is found on the bigwig file, produces a result. In other words no regions are skipped.

zones: array as follows zone0: region before the region start,
zone1: 5’ unscaled region (if present) zone2: the body of the region (not always present) zone3: 3’ unscaled region (if present) zone4: the region from the end of the region downstream

each zone is a tuple containing start, end, and number of bins

This is useful if several matrices wants to be merged or if the sorted BED output of one computeMatrix operation needs to be used for other cases

getTicks(idx)[source]

This is essentially a wrapper around getProfileTicks to accomdate the fact that each column has its own ticks.

get_individual_matrices(matrix)[source]

In case multiple matrices are saved one after the other this method splits them appart. Returns a list containing the matrices

get_num_individual_matrix_cols()[source]

returns the number of columns that each matrix should have. This is done because the final matrix that is plotted can be composed of smaller matrices that are merged one after the other.

static matrix_avg(matrix, avgType='mean')[source]
matrix_from_dict(matrixDict, regionsDict, parameters)[source]
static my_average(valuesArray, avgType='mean')[source]

computes the mean, median, etc but only for those values that are not Nan

read_matrix_file(matrix_file)[source]
save_BED(file_handle)[source]
save_matrix(file_name)[source]

saves the data required to reconstruct the matrix the format is: A header containing the parameters used to create the matrix encoded as: @key:value key2:value2 etc… The rest of the file has the same first 5 columns of a BED file: chromosome name, start, end, name, score and strand, all separated by tabs. After the fifth column the matrix values are appended separated by tabs. Groups are separated by adding a line starting with a hash (#) and followed by the group name.

The file is gzipped.

save_matrix_values(file_name)[source]
save_tabulated_values(file_handle, reference_point_label='TSS', start_label='TSS', end_label='TES', averagetype='mean')[source]

Saves the values averaged by col using the avg_type given

Args:
file_handle: file name to save the file reference_point_label: Name of the reference point label start_label: Name of the star label end_label: Name of the end label averagetype: average type (e.g. mean, median, std)
deeptools.heatmapper.trimZones(zones, maxLength, binSize, padRight)[source]

Given a (variable length) list of lists of (start, end) tuples, trim/remove and tuple that extends past maxLength (e.g., the end of a chromosome)

Returns the trimmed zones and padding

## deeptools.heatmapper_utilities module¶

deeptools.heatmapper_utilities.getProfileTicks(hm, referencePointLabel, startLabel, endLabel, idx)[source]

returns the position and labelling of the xticks that correspond to the heatmap

As of deepTools 3, the various parameters can be lists, in which case we then need to index things (the idx parameter)

deeptools.heatmapper_utilities.plot_single(ax, ma, average_type, color, label, plot_type='simple')[source]

Adds a line to the plot in the given ax using the specified method

ax : matplotlib axis
matplotlib axis
ma : numpy array
numpy array The data on this matrix is summarized according to the average_type argument.
average_type : str
string values are sum mean median min max std
color : str
a valid color: either a html color name, hex (e.g #002233), RGB + alpha tuple or list or RGB tuple or list
label : str
label
plot_type: str
type of plot. Either ‘se’ for standard error, ‘std’ for standard deviation, ‘overlapped_lines’ to plot each line of the matrix, fill to plot the area between the x axis and the value or None, just to plot the average line.
ax
matplotlib axis
>>> import matplotlib.pyplot as plt
>>> import os
>>> fig = plt.figure()
>>> matrix = np.array([[1,2,3],
...                    [4,5,6],
...                    [7,8,9]])
>>> ax = plot_single(ax, matrix -2, 'mean', color=[0.6, 0.8, 0.9], label='fill light blue', plot_type='fill')
>>> ax = plot_single(ax, matrix, 'mean', color='blue', label='red')
>>> ax = plot_single(ax, matrix + 5, 'mean', color='red', label='red', plot_type='std')
>>> ax = plot_single(ax, matrix + 10, 'mean', color='#cccccc', label='gray se', plot_type='se')
>>> ax = plot_single(ax, matrix + 20, 'mean', color=(0.9, 0.5, 0.9), label='violet', plot_type='std')
>>> ax = plot_single(ax, matrix + 30, 'mean', color=(0.9, 0.5, 0.9, 0.5), label='violet with alpha', plot_type='std')
>>> leg = ax.legend()
>>> plt.savefig("/tmp/test.pdf")
>>> plt.close()
>>> fig = plt.figure()
>>> os.remove("/tmp/test.pdf")

deeptools.heatmapper_utilities.plotly_single(ma, average_type, color, label, plot_type='simple')[source]

A plotly version of plot_single. Returns a list of traces

## deeptools.mapReduce module¶

deeptools.mapReduce.blSubtract(t, chrom, chunk)[source]

If a genomic region overlaps with a blacklisted region, then subtract that region out

returns a list of lists

deeptools.mapReduce.getUserRegion(chrom_sizes, region_string, max_chunk_size=1000000.0)[source]

Verifies if a given region argument, given by the user is valid. The format of the region_string is chrom:start:end:tileSize where start, end and tileSize are optional.

Parameters: chrom_sizes – dictionary of chromosome/scaffold size. Key=chromosome name region_string – a string of the form chr:start:end max_chunk_size – upper limit for the chunk size tuple chrom_size for the region start, region end, chunk size

#>>> data = getUserRegion({‘chr2’: 1000}, “chr1:10:10”) #Traceback (most recent call last): # … #NameError: Unknown chromosome: chr1 #Known chromosomes are: [‘chr2’]

If the region end is biger than the chromosome size, this value is used instead >>> getUserRegion({‘chr2’: 1000}, “chr2:10:1001”) ([(‘chr2’, 1000)], 10, 1000, 990)

Test chunk and regions size reduction to match tile size >>> getUserRegion({‘chr2’: 200000}, “chr2:10:123344:3”) ([(‘chr2’, 123344)], 9, 123345, 123336)

Test chromosome name mismatch >>> getUserRegion({‘2’: 200000}, “chr2:10:123344:3”) ([(‘2’, 123344)], 9, 123345, 123336) >>> getUserRegion({‘chrM’: 200000}, “MT:10:123344:3”) ([(‘chrM’, 123344)], 9, 123345, 123336)

deeptools.mapReduce.mapReduce(staticArgs, func, chromSize, genomeChunkLength=None, region=None, bedFile=None, blackListFileName=None, numberOfProcessors=4, verbose=False, includeLabels=False, keepExons=False, transcriptID='transcriptID', exonID='exonID', transcript_id_designator='transcript_id', self_=None)[source]

Split the genome into parts that are sent to workers using a defined number of procesors. Results are collected and returned.

For each genomic region the given ‘func’ is called using the following parameters:

chrom, start, end, staticArgs

The arg are static, pickable variables that need to be sent to workers.

The genome chunk length corresponds to a fraction of the genome, in bp, that is send to each of the workers for processing.

Depending on the type of process a larger or shorter regions may be preferred

Parameters: chromSize – A list of duples containing the chromosome name and its length region – The format is chr:start:end:tileSize (see function getUserRegion) staticArgs – tuple of arguments that are sent to the given ‘func’ func – function to call. The function is called using the following parameters (chrom, start, end, staticArgs) bedFile – Is a bed file is given, the args to the func to be called are extended to include a list of bed defined regions. blackListFileName – A list of regions to exclude from all computations. Note that this has genomeChunkLength resolution… self – In case mapreduce should make a call to an object the self variable has to be passed. includeLabels – Pass group and transcript labels into the calling function. These are added to the static args (groupLabel and transcriptName).

If “includeLabels” is true, a tuple of (results, labels) is returned

## deeptools.sumCoveragePerBin¶

class deeptools.sumCoveragePerBin.SumCoveragePerBin(bamFilesList, binLength=50, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, blackListFileName=None, minMappingQuality=None, ignoreDuplicates=False, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, out_file_for_raw_data=None, statsList=[], mappedList=[])[source]

This is an extension of CountReadsPerBin for use with plotFingerprint. There, we need to sum the per-base coverage.

get_coverage_of_region(bamHandle, chrom, regions, fragmentFromRead_func=None)[source]

Returns a numpy array that corresponds to the number of reads that overlap with each tile.

>>> test = Tester()
>>> import pysam
>>> c = SumCoveragePerBin([], stepSize=1, extendReads=300)


For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested. Note that reads are NOT extended, due to there being a 0 length input list of BAM files!

>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000833, 5000834), (5000834, 5000835)])
array([4., 5.])


In the following case the reads length is 50. Reads are not extended.

>>> c.extendReads=False
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', [(148, 150), (150, 152), (152, 154)])
array([2., 4., 4.])

class deeptools.sumCoveragePerBin.Tester[source]

Bases: object

## deeptools.utilities module¶

deeptools.utilities.bam_blacklisted_reads(bam_handle, chroms_to_ignore, blackListFileName=None, numberOfProcessors=1)[source]
deeptools.utilities.bam_blacklisted_worker(args)[source]
deeptools.utilities.bam_total_reads(bam_handle, chroms_to_ignore, stats)[source]

Count the total number of mapped reads in a BAM file, filtering the chromosome given in chroms_to_ignore list

deeptools.utilities.convertCmap(c, vmin=0, vmax=1)[source]
deeptools.utilities.copyFileInMemory(filePath, suffix='')[source]

copies a file into the special /dev/shm device which moves the file into memory. This process speeds ups the multiprocessor access to such files

deeptools.utilities.getCommonChrNames(bamFileHandles, verbose=True)[source]

Compares the names and lengths of a list of bam file handles. The input is list of pysam file handles.

The function returns a duple containing the common chromosome names and the common chromome lengths.

Hopefully, only _random and chrM are not common.

deeptools.utilities.getGC_content(tb, chrom, fragStart, fragEnd, fraction=True)[source]
deeptools.utilities.getTLen(read, notAbs=False)[source]

Get the observed template length of a read. For a paired-end read, this is normally just the TLEN field. For SE reads this is the observed coverage of the genome (excluding splicing).

deeptools.utilities.getTempFileName(suffix='')[source]

Return a temporary file name. The calling function is responsible for deleting this upon completion.

deeptools.utilities.gtfOptions(allArgs=None)[source]

This is used a couple places to setup arguments to mapReduce

deeptools.utilities.mungeChromosome(chrom, chromList)[source]

A generic chromosome munging function. “chrom” is munged by adding/removing “chr” such that it appears in chromList

On error, None is returned, but a common chromosome list should be used beforehand to avoid this possibility

deeptools.utilities.smartLabel(label)[source]

Given a file name, likely with a path, return the file name without the path and with the file extension removed. Thus, something like /path/to/some.special.file should return some.special, since only the first extension (if present) should be stripped.

deeptools.utilities.smartLabels(labels)[source]
deeptools.utilities.tbitToBamChrName(tbitNames, bamNames)[source]

checks if the chromosome names from the two-bit and bam file coincide. In case they do not coincide, a fix is tried. If successful, then a mapping table is returned. tbitNames and bamNames should be lists

deeptools.utilities.toBytes(s)[source]

Like toString, but for functions requiring bytes in python3

deeptools.utilities.toString(s)[source]

This takes care of python2/3 differences

## deeptools.writeBedGraph module¶

class deeptools.writeBedGraph.WriteBedGraph(bamFilesList, binLength=50, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, blackListFileName=None, minMappingQuality=None, ignoreDuplicates=False, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, out_file_for_raw_data=None, statsList=[], mappedList=[])[source]

Reads bam files coverages and writes a bedgraph or bigwig file

Extends the CountReadsPerBin object such that the coverage of bam files is writen to multiple bedgraph files at once.

The bedgraph files are later merge into one and converted into a bigwig file if necessary.

The constructor arguments are the same as for CountReadsPerBin. However, when calling the run method, the following parameters have to be passed

Given the following distribution of reads that cover 200 on a chromosome named ‘3R’:

  0                              100                           200
|------------------------------------------------------------|
A                                ===============
===============

B                 ===============               ===============
===============
===============

>>> import tempfile
>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"

>>> outFile = tempfile.NamedTemporaryFile()
>>> bam_file = test_path +  "testA.bam"


For the example a simple scaling function is going to be used. This function takes the coverage found at each region and multiplies it to the scaling factor. In this case the scaling factor is 1.5

>>> function_to_call = scaleCoverage
>>> funcArgs = {'scaleFactor': 1.5}


Restrict process to a region between positions 0 and 200 of chromosome 3R

>>> region = '3R:0:200'


Set up such that coverage is computed for consecutive bins of length 25 bp >>> bin_length = 25 >>> step_size = 25

>>> num_sample_sites = 0 #overruled by step_size
>>> c = WriteBedGraph([bam_file], binLength=bin_length, region=region, stepSize=step_size)
>>> c.run(function_to_call, funcArgs, outFile.name)
>>> f = open(outFile.name, 'r')
['3R\t0\t100\t0\n', '3R\t100\t200\t1.5\n']
>>> f.close()
>>> outFile.close()

run(func_to_call, func_args, out_file_name, blackListFileName=None, format='bedgraph', smoothLength=0)[source]

Given a list of bamfiles, a function and a function arguments, this method writes a bedgraph file (or bigwig) file for a partition of the genome into tiles of given size and a value for each tile that corresponds to the given function and that is related to the coverage underlying the tile.

func_to_call : str
function name to be called to convert the list of coverages computed for each bam file at each position into a single value. An example is a function that takes the ratio between the coverage of two bam files.
func_args : dict
dict of arguments to pass to func. E.g. {‘scaleFactor’:1.0}
out_file_name : str
name of the file to save the resulting data.
smoothLength : int
Distance in bp for smoothing the coverage per tile.
writeBedGraph_worker(chrom, start, end, func_to_call, func_args, bed_regions_list=None)[source]

Writes a bedgraph based on the read coverage found on bamFiles

The given func is called to compute the desired bedgraph value using the funcArgs

chrom : str
Chrom name
start : int
start coordinate
end : int
end coordinate
func_to_call : str
function name to be called to convert the list of coverages computed for each bam file at each position into a single value. An example is a function that takes the ratio between the coverage of two bam files.
func_args : dict
dict of arguments to pass to func.
smoothLength : int
Distance in bp for smoothing the coverage per tile.
bed_regions_list: list
List of tuples of the form (chrom, start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

A list of [chromosome, start, end, temporary file], where the temporary file contains the bedgraph results for the region queried.

>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bamFile1 = test_path +  "testA.bam"
>>> bin_length = 50
>>> number_of_samples = 0 # overruled by step_size
>>> func_to_call = scaleCoverage
>>> funcArgs = {'scaleFactor': 1.0}

>>> c = WriteBedGraph([bamFile1], bin_length, number_of_samples, stepSize=50)
>>> tempFile = c.writeBedGraph_worker( '3R', 0, 200, func_to_call, funcArgs)
>>> f = open(tempFile[3], 'r')
['3R\t0\t100\t0\n', '3R\t100\t200\t1\n']
>>> f.close()
>>> os.remove(tempFile[3])

deeptools.writeBedGraph.bedGraphToBigWig(chromSizes, bedGraphFiles, bigWigPath)[source]

Takes a sorted list of bedgraph files and write them to a single bigWig file using pyBigWig. The order of bedGraphFiles must match that of chromSizes!

deeptools.writeBedGraph.getGenomeChunkLength(bamHandles, tile_size, mappedList)[source]

Tries to estimate the length of the genome sent to the workers based on the density of reads per bam file and the number of bam files.

The chunk length should be a multiple of the tileSize

deeptools.writeBedGraph.ratio(tile_coverage, args)[source]

tileCoverage should be an list of two elements

deeptools.writeBedGraph.scaleCoverage(tile_coverage, args)[source]

tileCoverage should be an list with only one element

deeptools.writeBedGraph.writeBedGraph_wrapper(args)[source]

Passes the arguments to writeBedGraph_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the WriteBedGraph object

## deeptools.writeBedGraph_bam_and_bw module¶

deeptools.writeBedGraph_bam_and_bw.getCoverageFromBigwig(bigwigHandle, chrom, start, end, tileSize, missingDataAsZero=False)[source]
deeptools.writeBedGraph_bam_and_bw.writeBedGraph(bamOrBwFileList, outputFileName, fragmentLength, func, funcArgs, tileSize=25, region=None, blackListFileName=None, numberOfProcessors=1, format='bedgraph', extendPairedEnds=True, missingDataAsZero=False, smoothLength=0, fixed_step=False, verbose=False)[source]

Given a list of bamfiles, a function and a function arguments, this method writes a bedgraph file (or bigwig) file for a partition of the genome into tiles of given size and a value for each tile that corresponds to the given function and that is related to the coverage underlying the tile.

deeptools.writeBedGraph_bam_and_bw.writeBedGraph_worker(chrom, start, end, tileSize, defaultFragmentLength, bamOrBwFileList, func, funcArgs, extendPairedEnds=True, smoothLength=0, missingDataAsZero=False, fixed_step=False)[source]

Writes a bedgraph having as base a number of bam files.

The given func is called to compute the desired bedgraph value using the funcArgs

tileSize

deeptools.writeBedGraph_bam_and_bw.writeBedGraph_wrapper(args)[source]