deeptools package modules

deeptools.SES_scaleFactor module

class deeptools.SES_scaleFactor.Tester[source]: Bases: object

deeptools.SES_scaleFactor.estimateScaleFactor(bamFilesList, binLength, numberOfSamples, normalizationLength, avg_method='median', blackListFileName=None, numberOfProcessors=1, verbose=False, chrsToSkip=[], mappingStatsList=[])[source]

Subdivides the genome into chunks to be analyzed in parallel using several processors. The code handles the creation of workers that compute fragment counts (coverage) for different regions and then collect and integrates the results.

Parameters

bamFilesListlist: list of bam files to normalize
binLengthint: the window size in bp, where reads are going to be counted.
numberOfSamplesint: number of sites to sample from the genome. For more info see the documentation of the CountReadsPerBin class
normalizationLengthint: length, in bp, to normalize the data. For a value of 1, on average 1 read per base pair is found
avg_methodstr: defines how the different values are to be summarized. The options are ‘mean’ and ‘median’
chrsToSkiplist: name of the chromosomes to be excluded from the scale estimation. Usually the chrX is included.
blackListFileNamestr: BED file containing blacklisted regions
mappingStatsListlist: List of the number of mapped reads per file

Returns

dict

Dictionary with the following keys::: ‘size_factors’ ‘size_factors_based_on_mapped_reads’ ‘size_factors_SES’ ‘size_factors_based_on_mean’ ‘size_factors_based_on_median’ ‘mean’ ‘meanSES’ ‘median’ ‘reads_per_bin’ ‘std’ ‘sites_sampled’

Examples

>>> test = Tester()
>>> bin_length = 50
>>> num_samples = 4
>>> _dict = estimateScaleFactor([test.bamFile1, test.bamFile2], bin_length, num_samples,  1)
>>> _dict['size_factors']
array([1. , 0.5])
>>> _dict['size_factors_based_on_mean']
array([1. , 0.5])

deeptools.bamHandler module

deeptools.bamHandler.countReadsInInterval(args)[source]

deeptools.bamHandler.getMappingStats(bam, nThreads)[source]

This is used for CRAM files, since idxstats() and .mapped/.unmapped are meaningless

This requires pysam > 0.13.0

deeptools.bamHandler.openBam(bamFile, returnStats=False, nThreads=1, minimalDecoding=True)[source]

A wrapper for opening BAM/CRAM files.

bamFile: str: A BAM/CRAM file name
returnStats: bool: Return a tuple of (file_handle, nMappedReads, nUnmappedReads, statsDict). These additional values are needed by some downstream functions, since one can’t use file_handle.mapped on CRAM files (or idxstats())
nThreads: int: If returnStats is True, number of threads to use for computing statistics
minimalDecoding: Bool: For CRAM files, don’t decode the read name, sequence, qual, or auxiliary tag fields (these aren’t used by most functions).

Returns either the file handle or a tuple as described in returnStats

deeptools.correlation module

class deeptools.correlation.Correlation(matrix_file, corr_method=None, labels=None, remove_outliers=False, skip_zeros=False, log1p=False)[source]

Bases: object

class to work with matrices having sample data to compute correlations, plot them and make scatter plots

compute_correlation()[source]

computes spearman or pearson correlation for the samples in the matrix

The matrix should contain the values of each sample per column that’s why the transpose is used.

>>> matrix = np.array([[1, 2, 3, np.nan],
...                    [1, 2, 3, 4],
...                    [6, 4, 3, 1]]).T
>>> np.savez_compressed("/tmp/test_matrix.npz", matrix=matrix, labels=['a', 'b', 'c'])

>>> c = Correlation("/tmp/test_matrix.npz", corr_method='pearson')

the results should be as in R

>>> c.compute_correlation().filled(np.nan)
array([[ 1.        ,  1.        , -0.98198051],
       [ 1.        ,  1.        , -0.98198051],
       [-0.98198051, -0.98198051,  1.        ]])
>>> c.corr_method = 'spearman'
>>> c.corr_matrix = None
>>> c.compute_correlation()
array([[ 1.,  1., -1.],
       [ 1.,  1., -1.],
       [-1., -1.,  1.]])

static get_outlier_indices(data, max_deviation=200)[source]

The method is based on the median absolute deviation. See Boris Iglewicz and David Hoaglin (1993), “Volume 16: How to Detect and Handle Outliers”, The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.

returns the list, without the outliers

The max_deviation=200 is like selecting a z-score larger than 200, just that it is based on the median and the median absolute deviation instead of the mean and the standard deviation.

load_matrix(matrix_file)[source]: loads a matrix file saved using the numpy savez method. Two keys are expected: ‘matrix’ and ‘labels’. The matrix should contain one sample per row

plot_correlation(plot_filename, plot_title='', vmax=None, vmin=None, colormap='jet', image_format=None, plot_numbers=False, plotWidth=11, plotHeight=9.5)[source]: plots a correlation using a symmetric heatmap

plot_pca(plot_filename=None, PCs=[1, 2], plot_title='', image_format=None, log1p=False, plotWidth=5, plotHeight=10, cols=None, marks=None)[source]

Plot the PCA of a matrix

Returns the matrix of plotted values.

plot_scatter(plot_filename, plot_title='', image_format=None, log1p=False, xRange=None, yRange=None)[source]: Plot the scatter plots of a matrix in which each row is a sample

plotly_correlation(corr_matrix, plot_filename, labels, plot_title='', vmax=None, vmin=None, plot_numbers=True, colormap='jet')[source]: plot_correlation, but using plotly

plotly_pca(plotFile, Wt, pvar, PCs, eigenvalues, cols, plotTitle)[source]: A plotly version of plot_pca, that’s called by it to do the actual plotting

plotly_scatter(plot_filename, corr_matrix, plot_title='', minXVal=None, maxXVal=None, minYVal=None, maxYVal=None)[source]: Make the scatter plot of a matrix with plotly

remove_outliers(verbose=True)[source]

get the outliers per column using the median absolute deviation method

Returns the filtered matrix

remove_rows_of_zeros()[source]

save_corr_matrix(file_handle)[source]: saves the correlation matrix

deeptools.correlation_heatmap module

deeptools.correlation_heatmap.plot_correlation(corr_matrix, labels, plotFileName, vmax=None, vmin=None, colormap='jet', image_format=None, plot_numbers=False, plot_title='')[source]

deeptools.countReadsPerBin module

class deeptools.countReadsPerBin.CountReadsPerBin(bamFilesList, binLength=50, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, ignoreDuplicates=False, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, out_file_for_raw_data=None, bed_and_bin=False, statsList=[], mappedList=[])[source]

Bases: object

Collects coverage over multiple bam files using multiprocessing

This function collects read counts (coverage) from several bam files and returns an numpy array with the results. This class uses multiprocessing to compute the coverage.

Parameters

bamFilesListlist: List containing the names of indexed bam files. E.g. [‘file1.bam’, ‘file2.bam’]
binLengthint: Length of the window/bin. This value is overruled by bedFile if present.
numberOfSamplesint: Total number of samples. The genome is divided into numberOfSamples, each with a window/bin length equal to binLength. This value is overruled by stepSize in case such value is present and by bedFile in which case the number of samples and bins are defined in the bed file
numberOfProcessorsint: Number of processors to use. Default is 4
verbosebool: Output messages. Default: False
regionstr: Region to limit the computation in the form chrom:start:end.
bedFilelist of file_handles.: Each file handle corresponds to a bed file containing the regions for which to compute the coverage. This option overrules binLength, numberOfSamples and stepSize.
blackListFileNamestr: A string containing a BED file with blacklist regions.

extendReads : bool, int

Whether coverage should be computed for the extended read length (i.e. the region covered by the two mates or the regions expected to be covered by single-reads). If the value is ‘int’, then then this is interpreted as the fragment length to extend reads that are not paired. For Illumina reads, usual values are around 300. This value can be determined using the peak caller MACS2 or can be approximated by the fragment lengths computed when preparing the library for sequencing. If the value is of the variable is true and not value is given, the fragment size is sampled from the library but only if the library is paired-end. Default: False

minMappingQualityint: Reads of a mapping quality less than the give value are not considered. Default: None
ignoreDuplicatesbool: Whether read duplicates (same start, end position. If paired-end, same start-end for mates) are to be excluded. Default: false
chrToSkip: list: List with names of chromosomes that do not want to be included in the coverage computation. This is useful to remove unwanted chromosomes (e.g. ‘random’ or ‘Het’).
stepSizeint: the positions for which the coverage is computed are defined as follows: range(start, end, stepSize). Thus, a stepSize of 1, will compute the coverage at each base pair. If the stepSize is equal to the binLength then the coverage is computed for consecutive bins. If seepSize is smaller than the binLength, then teh bins will overlap.
center_readbool: Determines if reads should be centered with respect to the fragment length.
samFlag_includeint: Extracts only those reads having the SAM flag. For example, to get only reads that are the first mates a samFlag of 64 could be used. Similarly, the samFlag_include can be used to select only reads mapping on the reverse strand or to get only properly paired reads.
samFlag_excludeint: Removes reads that match the SAM flag. For example to get all reads that map to the forward strand a samFlag_exlude 16 should be used. Which translates into exclude all reads that map to the reverse strand.
zerosToNansbool: If true, zero values encountered are transformed to Nans. Default false.
skipZeroOverZerobool: If true, skip bins where all input BAM files have no coverage (only applicable to bamCompare).
minFragmentLengthint: If greater than 0, fragments below this size are excluded.
maxFragmentLengthint: If greater than 0, fragments above this size are excluded.
out_file_for_raw_datastr: File name to save the raw counts computed
statsListlist: For each BAM file in bamFilesList, the associated per-chromosome statistics returned by openBam
mappedListlist: For each BAM file in bamFilesList, the number of mapped reads in the file.
bed_and_binboolean: If true AND a bedFile is given, compute coverage of each bin of the given size in each region of bedFile
genomeChunkSizeint: If not None, the length of the genome used for multiprocessing.

Returns

numpy array

Each row correspond to each bin/bed region and each column correspond to each of the bamFiles.

Examples

The test data contains reads for 200 bp.

>>> test = Tester()

The transpose function is used to get a nicer looking output. The first line corresponds to the number of reads per bin in bam file 1

>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4)
>>> np.transpose(c.run())
array([[0., 0., 1., 1.],
       [0., 1., 1., 2.]])

count_reads_in_region(chrom, start, end, bed_regions_list=None)[source]

Counts the reads in each bam file at each ‘stepSize’ position within the interval (start, end) for a window or bin of size binLength.

The stepSize controls the distance between bins. For example, a step size of 20 and a bin size of 20 will create bins next to each other. If the step size is smaller than the bin size the bins will overlap.

If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.

Parameters

chromstr: Chrom name
startint: start coordinate
endint: end coordinate
bed_regions_list: list: List of list of tuples of the form (start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

Returns

numpy array: The result is a numpy array that as rows each bin and as columns each bam file.

Examples

Initialize some useful values

>>> test = Tester()
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 25, 0, stepSize=50)

The transpose is used to get better looking numbers. The first line corresponds to the number of reads per bin in the first bamfile.

>>> _array, __ = c.count_reads_in_region(test.chrom, 0, 200)
>>> _array
array([[0., 0.],
       [0., 1.],
       [1., 1.],
       [1., 2.]])

getReadLength(read)[source]

getSmoothRange(tileIndex, tileSize, smoothRange, maxPosition)[source]

Given a tile index position and a tile size (length), return the a new indices over a larger range, called the smoothRange. This region is centered in the tileIndex an spans on both sizes to cover the smoothRange. The smoothRange is trimmed in case it is less than zero or greater than maxPosition

---------------|==================|------------------
           tileStart
      |--------------------------------------|
      |    <--      smoothRange     -->      |
      |
tileStart - (smoothRange-tileSize)/2

Test for a smooth range that spans 3 tiles.

Examples

>>> c = CountReadsPerBin([], 1, 1, 1, 0)
>>> c.getSmoothRange(5, 1, 3, 10)
(4, 7)

Test smooth range truncated on start.

>>> c.getSmoothRange(0, 10, 30, 200)
(0, 2)

Test smooth range truncated on start.

>>> c.getSmoothRange(1, 10, 30, 4)
(0, 3)

Test smooth range truncated on end.

>>> c.getSmoothRange(5, 1, 3, 5)
(4, 5)

Test smooth range not multiple of tileSize.

>>> c.getSmoothRange(5, 10, 24, 10)
(4, 6)

get_chunk_length(bamFilesHandles, genomeSize, chromSizes, chrLengths)[source]

get_coverage_of_region(bamHandle, chrom, regions, fragmentFromRead_func=None)[source]

Returns a numpy array that corresponds to the number of reads that overlap with each tile.

>>> test = Tester()
>>> import pysam
>>> c = CountReadsPerBin([], stepSize=1, extendReads=300)

For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested.

>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000833, 5000834), (5000834, 5000835)])
array([4., 5.])

In the following example a paired read is extended to the fragment length which is 100 The first mate starts at 5000000 and the second at 5000064. Each mate is extended to the fragment length independently At position 500090-500100 one fragment of length 100 overlap, and after position 5000101 there should be zero reads.

>>> c.zerosToNans = True
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000090, 5000100), (5000100, 5000110)])
array([ 1., nan])

In the following case the reads length is 50. Reads are not extended.

>>> c.extendReads=False
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', [(148, 150), (150, 152), (152, 154)])
array([1., 2., 2.])

get_fragment_from_read(read)[source]

Get read start and end position of a read. If given, the reads are extended as follows: If reads are paired end, each read mate is extended to match the fragment length, otherwise, a default fragment length is used. If reads are split (give by the CIGAR string) then the multiple positions of the read are returned. When reads are extended the cigar information is skipped.

Parameters

read: pysam object.

The following values are defined (for forward reads):

     |--          -- read.tlen --              --|
     |-- read.alen --|
-----|===============>------------<==============|----
     |               |            |
read.reference_start
            read.reference_end  read.pnext

  and for reverse reads


     |--             -- read.tlen --           --|
                                 |-- read.alen --|
-----|===============>-----------<===============|----
     |                           |               |
  read.pnext           read.reference_start  read.reference_end

this is a sketch of a pair-end reads

The function returns the fragment start and end, either using the paired end information (if available) or extending the read in the appropriate direction if this is single-end.

Parameters

read : pysam read object

Returns

list of tuples: [(fragment start, fragment end)]

>>> test = Tester()
>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True)
>>> c.defaultFragmentLength=100
>>> c.get_fragment_from_read(test.getRead("paired-forward"))
[(5000000, 5000100)]
>>> c.get_fragment_from_read(test.getRead("paired-reverse"))
[(5000000, 5000100)]
>>> c.defaultFragmentLength = 200
>>> c.get_fragment_from_read(test.getRead("single-forward"))
[(5001491, 5001691)]
>>> c.get_fragment_from_read(test.getRead("single-reverse"))
[(5001536, 5001736)]
>>> c.defaultFragmentLength = 'read length'
>>> c.get_fragment_from_read(test.getRead("single-forward"))
[(5001491, 5001527)]
>>> c.defaultFragmentLength = 'read length'
>>> c.extendReads = False
>>> c.get_fragment_from_read(test.getRead("paired-forward"))
[(5000000, 5000036)]

Tests for read centering.

>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True, center_read=True)
>>> c.defaultFragmentLength = 100
>>> assert c.get_fragment_from_read(test.getRead("paired-forward")) == [(5000032, 5000068)]
>>> c.defaultFragmentLength = 200
>>> assert c.get_fragment_from_read(test.getRead("single-reverse")) == [(5001618, 5001654)]

static is_proper_pair(read, maxPairedFragmentLength)[source]

Checks if a read is proper pair meaning that both mates are facing each other and are in the same chromosome and are not to far away. The sam flag for proper pair can not always be trusted. Note that if the fragment size is > maxPairedFragmentLength (~2kb usually) that False will be returned. :return: bool

>>> import pysam
>>> import os
>>> from deeptools.countReadsPerBin import CountReadsPerBin as cr
>>> root = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bam = pysam.AlignmentFile("{}/test_proper_pair_filtering.bam".format(root))
>>> iter = bam.fetch()
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "keep" read
True
>>> cr.is_proper_pair(read, 200) # "keep" read, but maxPairedFragmentLength is too short
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "improper pair"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "mismatch chr"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "same orientation1"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "same orientation2"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "rev first"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "rev first OK"
True
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "for first"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "for first"
True

run(allArgs=None)[source]

class deeptools.countReadsPerBin.Tester[source]

Bases: object

getRead(readType)[source]: prepare arguments for test

deeptools.countReadsPerBin.countReadsInRegions_wrapper(args)[source]: Passes the arguments to countReadsInRegions_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the countReadsPerBin object

deeptools.countReadsPerBin.estimateSizeFactors(m)[source]

Compute size factors in the same way as DESeq2. The inverse of that is returned, as it’s then compatible with bamCoverage.

m : a numpy ndarray

>>> m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 10, 0], [10, 5, 100]])
>>> sf = estimateSizeFactors(m)
>>> assert np.all(np.abs(sf - [1.305, 0.9932, 0.783]) < 1e-4)
>>> m = np.array([[0, 0], [0, 1], [1, 1], [1, 2]])
>>> sf = estimateSizeFactors(m)
>>> assert np.all(np.abs(sf - [1.1892, 0.8409]) < 1e-4)

deeptools.countReadsPerBin.remove_row_of_zeros(matrix)[source]

deeptools.getFragmentAndReadSize module

deeptools.getFragmentAndReadSize.getFragmentLength_worker(chrom, start, end, bamFile, distanceBetweenBins)[source]

Queries the reads at the given region for the distance between reads and the read length

Parameters

chromstr: chromosome name
startint: region start
endint: region end
bamFilestr: BAM file name
distanceBetweenBinsint: the number of bases at the end of each bin to ignore

Returns

np.array: an np.array, where first column is fragment length, the second is for read length

deeptools.getFragmentAndReadSize.getFragmentLength_wrapper(args)[source]

deeptools.getFragmentAndReadSize.get_read_and_fragment_length(bamFile, return_lengths=False, blackListFileName=None, binSize=50000, distanceBetweenBins=1000000, numberOfProcessors=None, verbose=False)[source]

Estimates the fragment length and read length through sampling

bamFilestr
BAM file name

return_lengths : bool numberOfProcessors : int verbose : bool binSize : int distanceBetweenBins : int

ddict
tuple of two dictionaries, one for the fragment length and the other

for the read length. The dictionaries summarise the mean, median etc. values

deeptools.getRatio module

deeptools.getRatio.compute_ratio(value1, value2, args)[source]

deeptools.getRatio.getRatio(tileCoverage, args)[source]

The mapreduce method calls this function for each tile. The parameters (args) are fixed in the main method.

>>> funcArgs= {'valueType': 'ratio', 'scaleFactors': (1,1), 'pseudocount': [1, 1]}
>>> getRatio([9, 19], funcArgs)
0.5
>>> getRatio([0, 0], funcArgs)
1.0
>>> getRatio([np.nan, np.nan], funcArgs)
nan
>>> getRatio([np.nan, 1.0], funcArgs)
nan
>>> funcArgs['valueType'] ='subtract'
>>> getRatio([20, 10], funcArgs)
10
>>> funcArgs['scaleFactors'] = (1, 0.5)
>>> getRatio([10, 20], funcArgs)
0.0

The reciprocal ratio is of a and b is: is a/b if a/b > 1 else -1* b/a >>> funcArgs[‘valueType’] =’reciprocal_ratio’ >>> funcArgs[‘scaleFactors’] = (1, 1) >>> funcArgs[‘pseudocount’] = [0, 0] >>> getRatio([2, 1], funcArgs) 2.0 >>> getRatio([1, 2], funcArgs) -2.0 >>> getRatio([1, 1], funcArgs) 1.0

deeptools.getScorePerBigWigBin module

class deeptools.getScorePerBigWigBin.Tester[source]: Bases: object

deeptools.getScorePerBigWigBin.countFragmentsInRegions_worker(chrom, start, end, bigWigFiles, stepSize, binLength, save_data, bedRegions=None)[source]

returns the average score in each bigwig file at each ‘stepSize’ position within the interval start, end for a ‘binLength’ window. Because the idea is to get counts for window positions at different positions for sampling the bins are equally spaced and not adjacent.

If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.

Test dataset with two samples covering 200 bp. >>> test = Tester()

Fragment coverage. >>> np.transpose(countFragmentsInRegions_worker(test.chrom, 0, 200, [test.bwFile1, test.bwFile2], 50, 25, False)[0]) array([[1., 1., 2., 2.],

[1., 1., 1., 3.]])

>>> np.transpose(countFragmentsInRegions_worker(test.chrom, 0, 200, [test.bwFile1, test.bwFile2], 200, 200, False)[0])
array([[1.5],
       [1.5]])

BED regions: >>> bedRegions = [[test.chrom, [(45, 55)]], [test.chrom, [(95, 105)]], [test.chrom, [(145, 155)]]] >>> np.transpose(countFragmentsInRegions_worker(test.chrom, 0, 200,[test.bwFile1, test.bwFile2], 200, 200, False, … bedRegions=bedRegions)[0]) array([[1. , 1.5, 2. ],

[1. , 1. , 2. ]])

deeptools.getScorePerBigWigBin.countReadsInRegions_wrapper(args)[source]

deeptools.getScorePerBigWigBin.getChromSizes(bigwigFilesList)[source]

Get chromosome sizes from bigWig file with pyBigWig

Test dataset with two samples covering 200 bp. >>> test = Tester()

Chromosome name(s) and size(s). >>> assert getChromSizes([test.bwFile1, test.bwFile2]) == ([(‘3R’, 200)], set([]))

deeptools.getScorePerBigWigBin.getScorePerBin(bigWigFiles, binLength, numberOfProcessors=1, verbose=False, region=None, bedFile=None, blackListFileName=None, stepSize=None, chrsToSkip=[], out_file_for_raw_data=None, allArgs=None)[source]

This function returns a matrix containing scores (median) for the coverage of fragments within a region. Each row corresponds to a sampled region. Likewise, each column corresponds to a bigwig file.

Test dataset with two samples covering 200 bp. >>> test = Tester() >>> np.transpose(getScorePerBin([test.bwFile1, test.bwFile2], 50, 3)) array([[1., 1., 2., 2.],

[1., 1., 1., 3.]])

deeptools.heatmapper module

deeptools.heatmapper.chopRegions(exonsInput, left=0, right=0)[source]

exons is a list of (start, end) tuples. The goal is to chop these into separate lists of tuples, to take care or unscaled regions. “left” and “right” denote regions of a given size to exclude from the normal binning process (unscaled regions).

This outputs three lists of (start, end) tuples:

leftBins: 5’ unscaled regions bodyBins: body bins for scaling rightBins: 3’ unscaled regions

In addition are two integers padLeft: Number of bases of padding on the left (due to not being able to fulfill “left”) padRight: As above, but on the right side

deeptools.heatmapper.chopRegionsFromMiddle(exonsInput, left=0, right=0)[source]

Like chopRegions(), above, but returns two lists of tuples on each side of the center point of the exons.

The steps are as follow:

Find the center point of the set of exons (e.g., [(0, 200), (300, 400), (800, 900)] would be centered at 200)

If a given exon spans the center point then the exon is split

The given number of bases at the end of the left-of-center list are extracted

If the set of exons don’t contain enough bases, then padLeft is incremented accordingly

As above but for the right-of-center list

A tuple of (#2, #3, pading on the left, and padding on the right) is returned

deeptools.heatmapper.computeSilhouetteScore(d, idx, labels)[source]: Given a square distance matrix with NaN diagonals, compute the silhouette score of a given row (idx). Each row should have an associated label (labels).

deeptools.heatmapper.compute_sub_matrix_wrapper(args)[source]

class deeptools.heatmapper.heatmapper[source]

Bases: object

Class to handle the reading and plotting of matrices.

static change_chrom_names(chrom)[source]: Changes UCSC chromosome names to ensembl chromosome names and vice versa.

computeMatrix(score_file_list, regions_file, parameters, blackListFileName=None, verbose=False, allArgs=None)[source]: Splits into multiple cores the computation of the scores per bin for each region (defined by a hash ‘#’ in the regions (BED/GFF) file.

static compute_sub_matrix_worker(self, chrom, start, end, score_file_list, parameters, regions)[source]

Returns

numpy matrix: A numpy matrix that contains per each row the values found per each of the regions given

static coverage_from_array(valuesArray, zones, binSize, avgType)[source]

static coverage_from_big_wig(bigwig, chrom, zones, binSize, avgType, nansAsZeros=False, verbose=True)[source]

uses pyBigWig to query a region define by chrom and zones. The output is an array that contains the bigwig value per base pair. The summary over bins is done in a later step when coverage_from_array is called. This method is more reliable than querying the bins directly from the bigwig, which should be more efficient.

By default, any region, even if no chromosome match is found on the bigwig file, produces a result. In other words no regions are skipped.

zones: array as follows zone0: region before the region start,: zone1: 5’ unscaled region (if present) zone2: the body of the region (not always present) zone3: 3’ unscaled region (if present) zone4: the region from the end of the region downstream

each zone is a tuple containing start, end, and number of bins

This is useful if several matrices wants to be merged or if the sorted BED output of one computeMatrix operation needs to be used for other cases

getTicks(idx)[source]: This is essentially a wrapper around getProfileTicks to accomdate the fact that each column has its own ticks.

get_individual_matrices(matrix)[source]: In case multiple matrices are saved one after the other this method splits them appart. Returns a list containing the matrices

get_num_individual_matrix_cols()[source]: returns the number of columns that each matrix should have. This is done because the final matrix that is plotted can be composed of smaller matrices that are merged one after the other.

static matrix_avg(matrix, avgType='mean')[source]

matrix_from_dict(matrixDict, regionsDict, parameters)[source]

static my_average(valuesArray, avgType='mean')[source]: computes the mean, median, etc but only for those values that are not Nan

read_matrix_file(matrix_file)[source]

save_BED(file_handle)[source]

save_matrix(file_name)[source]

saves the data required to reconstruct the matrix the format is: A header containing the parameters used to create the matrix encoded as: @key:value key2:value2 etc… The rest of the file has the same first 5 columns of a BED file: chromosome name, start, end, name, score and strand, all separated by tabs. After the fifth column the matrix values are appended separated by tabs. Groups are separated by adding a line starting with a hash (#) and followed by the group name.

The file is gzipped.

save_matrix_values(file_name)[source]

save_tabulated_values(file_handle, reference_point_label='TSS', start_label='TSS', end_label='TES', averagetype='mean')[source]

Saves the values averaged by col using the avg_type given

Args:: file_handle: file name to save the file reference_point_label: Name of the reference point label start_label: Name of the star label end_label: Name of the end label averagetype: average type (e.g. mean, median, std)

deeptools.heatmapper.trimZones(zones, maxLength, binSize, padRight)[source]

Given a (variable length) list of lists of (start, end) tuples, trim/remove and tuple that extends past maxLength (e.g., the end of a chromosome)

Returns the trimmed zones and padding

deeptools.heatmapper_utilities module

deeptools.heatmapper_utilities.getProfileTicks(hm, referencePointLabel, startLabel, endLabel, idx)[source]

returns the position and labelling of the xticks that correspond to the heatmap

As of deepTools 3, the various parameters can be lists, in which case we then need to index things (the idx parameter)

As of matplotlib 3 the ticks in the heatmap need to have 0.5 added to them.

As of matplotlib 3.1 there is no longer padding added to all ticks. Reference point ticks will be adjusted by width/2 or width for spacing and the last half of scaled ticks will be shifed by 1 bin so the ticks are at the beginning of bins.

deeptools.heatmapper_utilities.plot_single(ax, ma, average_type, color, label, plot_type='lines')[source]

Adds a line to the plot in the given ax using the specified method

Parameters

axmatplotlib axis: matplotlib axis
manumpy array: numpy array The data on this matrix is summarized according to the average_type argument.
average_typestr: string values are sum mean median min max std
colorstr: a valid color: either a html color name, hex (e.g #002233), RGB + alpha tuple or list or RGB tuple or list
labelstr: label
plot_type: str: type of plot. Either ‘se’ for standard error, ‘std’ for standard deviation, ‘overlapped_lines’ to plot each line of the matrix, fill to plot the area between the x axis and the value or any other string to just plot the average line.

Returns

ax: matplotlib axis

Examples

>>> import matplotlib.pyplot as plt
>>> import os
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> matrix = np.array([[1,2,3],
...                    [4,5,6],
...                    [7,8,9]])
>>> ax = plot_single(ax, matrix -2, 'mean', color=[0.6, 0.8, 0.9], label='fill light blue', plot_type='fill')
>>> ax = plot_single(ax, matrix, 'mean', color='blue', label='red')
>>> ax = plot_single(ax, matrix + 5, 'mean', color='red', label='red', plot_type='std')
>>> ax = plot_single(ax, matrix + 10, 'mean', color='#cccccc', label='gray se', plot_type='se')
>>> ax = plot_single(ax, matrix + 20, 'mean', color=(0.9, 0.5, 0.9), label='violet', plot_type='std')
>>> ax = plot_single(ax, matrix + 30, 'mean', color=(0.9, 0.5, 0.9, 0.5), label='violet with alpha', plot_type='std')
>>> leg = ax.legend()
>>> plt.savefig("/tmp/test.pdf")
>>> plt.close()
>>> fig = plt.figure()
>>> os.remove("/tmp/test.pdf")

deeptools.heatmapper_utilities.plotly_single(ma, average_type, color, label, plot_type='line')[source]: A plotly version of plot_single. Returns a list of traces

deeptools.mapReduce module

deeptools.mapReduce.blSubtract(t, chrom, chunk)[source]

If a genomic region overlaps with a blacklisted region, then subtract that region out

returns a list of lists

deeptools.mapReduce.getUserRegion(chrom_sizes, region_string, max_chunk_size=1000000.0)[source]

Verifies if a given region argument, given by the user is valid. The format of the region_string is chrom:start:end:tileSize where start, end and tileSize are optional.

Parameters:

chrom_sizes – dictionary of chromosome/scaffold size. Key=chromosome name
region_string – a string of the form chr:start:end
max_chunk_size – upper limit for the chunk size

Returns:

tuple chrom_size for the region start, region end, chunk size

#>>> data = getUserRegion({‘chr2’: 1000}, “chr1:10:10”) #Traceback (most recent call last): # … #NameError: Unknown chromosome: chr1 #Known chromosomes are: [‘chr2’]

If the region end is biger than the chromosome size, this value is used instead >>> getUserRegion({‘chr2’: 1000}, “chr2:10:1001”) ([(‘chr2’, 1000)], 10, 1000, 990)

Test chunk and regions size reduction to match tile size >>> getUserRegion({‘chr2’: 200000}, “chr2:10:123344:3”) ([(‘chr2’, 123344)], 9, 123345, 123336)

Test chromosome name mismatch >>> getUserRegion({‘2’: 200000}, “chr2:10:123344:3”) ([(‘2’, 123344)], 9, 123345, 123336) >>> getUserRegion({‘chrM’: 200000}, “MT:10:123344:3”) ([(‘chrM’, 123344)], 9, 123345, 123336)

deeptools.mapReduce.mapReduce(staticArgs, func, chromSize, genomeChunkLength=None, region=None, bedFile=None, blackListFileName=None, numberOfProcessors=4, verbose=False, includeLabels=False, keepExons=False, transcriptID='transcriptID', exonID='exonID', transcript_id_designator='transcript_id', self_=None)[source]

Split the genome into parts that are sent to workers using a defined number of procesors. Results are collected and returned.

For each genomic region the given ‘func’ is called using the following parameters:

chrom, start, end, staticArgs

The arg are static, pickable variables that need to be sent to workers.

The genome chunk length corresponds to a fraction of the genome, in bp, that is send to each of the workers for processing.

Depending on the type of process a larger or shorter regions may be preferred

Parameters:

chromSize – A list of duples containing the chromosome name and its length
region – The format is chr:start:end:tileSize (see function getUserRegion)
staticArgs – tuple of arguments that are sent to the given ‘func’
func – function to call. The function is called using the following parameters (chrom, start, end, staticArgs)
bedFile – Is a bed file is given, the args to the func to be called are extended to include a list of bed defined regions.
blackListFileName – A list of regions to exclude from all computations. Note that this has genomeChunkLength resolution…
self – In case mapreduce should make a call to an object the self variable has to be passed.
includeLabels – Pass group and transcript labels into the calling function. These are added to the static args (groupLabel and transcriptName).

If “includeLabels” is true, a tuple of (results, labels) is returned

deeptools.sumCoveragePerBin

class deeptools.sumCoveragePerBin.SumCoveragePerBin(bamFilesList, binLength=50, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, ignoreDuplicates=False, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, out_file_for_raw_data=None, bed_and_bin=False, statsList=[], mappedList=[])[source]

Bases: CountReadsPerBin

This is an extension of CountReadsPerBin for use with plotFingerprint. There, we need to sum the per-base coverage.

get_coverage_of_region(bamHandle, chrom, regions, fragmentFromRead_func=None)[source]

Returns a numpy array that corresponds to the number of reads that overlap with each tile.

>>> test = Tester()
>>> import pysam
>>> c = SumCoveragePerBin([], stepSize=1, extendReads=300)

For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested. Note that reads are NOT extended, due to there being a 0 length input list of BAM files!

>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000833, 5000834), (5000834, 5000835)])
array([4., 5.])

In the following case the reads length is 50. Reads are not extended.

>>> c.extendReads=False
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', [(148, 150), (150, 152), (152, 154)])
array([2., 4., 4.])

class deeptools.sumCoveragePerBin.Tester[source]: Bases: object

deeptools.utilities module

deeptools.utilities.bam_blacklisted_reads(bam_handle, chroms_to_ignore, blackListFileName=None, numberOfProcessors=1)[source]

deeptools.utilities.bam_blacklisted_worker(args)[source]

deeptools.utilities.bam_total_reads(bam_handle, chroms_to_ignore, stats)[source]: Count the total number of mapped reads in a BAM file, filtering the chromosome given in chroms_to_ignore list

deeptools.utilities.convertCmap(c, vmin=0, vmax=1)[source]

deeptools.utilities.copyFileInMemory(filePath, suffix='')[source]: copies a file into the special /dev/shm device which moves the file into memory. This process speeds ups the multiprocessor access to such files

deeptools.utilities.getCommonChrNames(bamFileHandles, verbose=True)[source]

Compares the names and lengths of a list of bam file handles. The input is list of pysam file handles.

The function returns a duple containing the common chromosome names and the common chromome lengths.

Hopefully, only _random and chrM are not common.

deeptools.utilities.getGC_content(tb, chrom, fragStart, fragEnd, fraction=True)[source]

deeptools.utilities.getTLen(read, notAbs=False)[source]: Get the observed template length of a read. For a paired-end read, this is normally just the TLEN field. For SE reads this is the observed coverage of the genome (excluding splicing).

deeptools.utilities.getTempFileName(suffix='')[source]: Return a temporary file name. The calling function is responsible for deleting this upon completion.

deeptools.utilities.gtfOptions(allArgs=None)[source]: This is used a couple places to setup arguments to mapReduce

deeptools.utilities.mungeChromosome(chrom, chromList)[source]

A generic chromosome munging function. “chrom” is munged by adding/removing “chr” such that it appears in chromList

On error, None is returned, but a common chromosome list should be used beforehand to avoid this possibility

deeptools.utilities.smartLabel(label)[source]: Given a file name, likely with a path, return the file name without the path and with the file extension removed. Thus, something like /path/to/some.special.file should return some.special, since only the first extension (if present) should be stripped.

deeptools.utilities.smartLabels(labels)[source]

deeptools.utilities.tbitToBamChrName(tbitNames, bamNames)[source]: checks if the chromosome names from the two-bit and bam file coincide. In case they do not coincide, a fix is tried. If successful, then a mapping table is returned. tbitNames and bamNames should be lists

deeptools.utilities.toBytes(s)[source]: Like toString, but for functions requiring bytes in python3

deeptools.utilities.toString(s)[source]: This takes care of python2/3 differences

deeptools.writeBedGraph module

class deeptools.writeBedGraph.WriteBedGraph(bamFilesList, binLength=50, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, ignoreDuplicates=False, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, out_file_for_raw_data=None, bed_and_bin=False, statsList=[], mappedList=[])[source]

Bases: CountReadsPerBin

Reads bam files coverages and writes a bedgraph or bigwig file

Extends the CountReadsPerBin object such that the coverage of bam files is writen to multiple bedgraph files at once.

The bedgraph files are later merge into one and converted into a bigwig file if necessary.

The constructor arguments are the same as for CountReadsPerBin. However, when calling the run method, the following parameters have to be passed

Examples

Given the following distribution of reads that cover 200 on a chromosome named ‘3R’:

  0                              100                           200
  |------------------------------------------------------------|
A                                ===============
                                                ===============


B                 ===============               ===============
                                 ===============
                                                ===============

>>> import tempfile
>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"

>>> outFile = tempfile.NamedTemporaryFile()
>>> bam_file = test_path +  "testA.bam"

For the example a simple scaling function is going to be used. This function takes the coverage found at each region and multiplies it to the scaling factor. In this case the scaling factor is 1.5

>>> function_to_call = scaleCoverage
>>> funcArgs = {'scaleFactor': 1.5}

Restrict process to a region between positions 0 and 200 of chromosome 3R

>>> region = '3R:0:200'

Set up such that coverage is computed for consecutive bins of length 25 bp >>> bin_length = 25 >>> step_size = 25

>>> num_sample_sites = 0 #overruled by step_size
>>> c = WriteBedGraph([bam_file], binLength=bin_length, region=region, stepSize=step_size)
>>> c.run(function_to_call, funcArgs, outFile.name)
>>> f = open(outFile.name, 'r')
>>> f.readlines()
['3R\t0\t100\t0\n', '3R\t100\t200\t1.5\n']
>>> f.close()
>>> outFile.close()

run(func_to_call, func_args, out_file_name, blackListFileName=None, format='bedgraph', smoothLength=0)[source]

Given a list of bamfiles, a function and a function arguments, this method writes a bedgraph file (or bigwig) file for a partition of the genome into tiles of given size and a value for each tile that corresponds to the given function and that is related to the coverage underlying the tile.

Parameters

func_to_callstr: function name to be called to convert the list of coverages computed for each bam file at each position into a single value. An example is a function that takes the ratio between the coverage of two bam files.
func_argsdict: dict of arguments to pass to func. E.g. {‘scaleFactor’:1.0}
out_file_namestr: name of the file to save the resulting data.
smoothLengthint: Distance in bp for smoothing the coverage per tile.

writeBedGraph_worker(chrom, start, end, func_to_call, func_args, bed_regions_list=None)[source]

Writes a bedgraph based on the read coverage found on bamFiles

The given func is called to compute the desired bedgraph value using the funcArgs

Parameters

chromstr: Chrom name
startint: start coordinate
endint: end coordinate
func_to_callstr: function name to be called to convert the list of coverages computed for each bam file at each position into a single value. An example is a function that takes the ratio between the coverage of two bam files.
func_argsdict: dict of arguments to pass to func.
smoothLengthint: Distance in bp for smoothing the coverage per tile.
bed_regions_list: list: List of tuples of the form (chrom, start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

Returns

A list of [chromosome, start, end, temporary file], where the temporary file contains the bedgraph results for the region queried.

Examples

>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bamFile1 = test_path +  "testA.bam"
>>> bin_length = 50
>>> number_of_samples = 0 # overruled by step_size
>>> func_to_call = scaleCoverage
>>> funcArgs = {'scaleFactor': 1.0}

>>> c = WriteBedGraph([bamFile1], bin_length, number_of_samples, stepSize=50)
>>> tempFile = c.writeBedGraph_worker( '3R', 0, 200, func_to_call, funcArgs)
>>> f = open(tempFile[3], 'r')
>>> f.readlines()
['3R\t0\t100\t0\n', '3R\t100\t200\t1\n']
>>> f.close()
>>> os.remove(tempFile[3])

deeptools.writeBedGraph.bedGraphToBigWig(chromSizes, bedGraphFiles, bigWigPath)[source]: Takes a sorted list of bedgraph files and write them to a single bigWig file using pyBigWig. The order of bedGraphFiles must match that of chromSizes!

deeptools.writeBedGraph.getGenomeChunkLength(bamHandles, tile_size, mappedList)[source]

Tries to estimate the length of the genome sent to the workers based on the density of reads per bam file and the number of bam files.

The chunk length should be a multiple of the tileSize

deeptools.writeBedGraph.ratio(tile_coverage, args)[source]: tileCoverage should be an list of two elements

deeptools.writeBedGraph.scaleCoverage(tile_coverage, args)[source]: tileCoverage should be an list with only one element

deeptools.writeBedGraph.writeBedGraph_wrapper(args)[source]: Passes the arguments to writeBedGraph_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the WriteBedGraph object

deeptools.writeBedGraph_bam_and_bw module

deeptools.writeBedGraph_bam_and_bw.getCoverageFromBigwig(bigwigHandle, chrom, start, end, tileSize, missingDataAsZero=False)[source]

deeptools.writeBedGraph_bam_and_bw.writeBedGraph(bamOrBwFileList, outputFileName, fragmentLength, func, funcArgs, tileSize=25, region=None, blackListFileName=None, numberOfProcessors=1, format='bedgraph', extendPairedEnds=True, missingDataAsZero=False, skipZeroOverZero=False, smoothLength=0, fixedStep=False, verbose=False)[source]: Given a list of bamfiles, a function and a function arguments, this method writes a bedgraph file (or bigwig) file for a partition of the genome into tiles of given size and a value for each tile that corresponds to the given function and that is related to the coverage underlying the tile.

deeptools.writeBedGraph_bam_and_bw.writeBedGraph_worker(chrom, start, end, tileSize, defaultFragmentLength, bamOrBwFileList, func, funcArgs, extendPairedEnds=True, smoothLength=0, skipZeroOverZero=False, missingDataAsZero=False, fixedStep=False)[source]

Writes a bedgraph having as base a number of bam files.

The given func is called to compute the desired bedgraph value using the funcArgs

tileSize

deeptools.writeBedGraph_bam_and_bw.writeBedGraph_wrapper(args)[source]

Module contents

deepTools Galaxy.

code @ github.