deeptools package modules¶

deeptools.SES_scaleFactor module¶

class deeptools.SES_scaleFactor.Tester[source]¶: Bases: object

deeptools.SES_scaleFactor.estimateScaleFactor(bamFilesList, binLength, numberOfSamples, normalizationLength, avg_method='median', blackListFileName=None, numberOfProcessors=1, verbose=False, chrsToSkip=[])[source]¶

Subdivides the genome into chunks to be analyzed in parallel using several processors. The code handles the creation of workers that compute fragment counts (coverage) for different regions and then collect and integrates the results.

Parameters:

bamFilesList : list

list of bam files to normalize

binLength : int

the window size in bp, where reads are going to be counted.

numberOfSamples : int

number of sites to sample from the genome. For more info see the documentation of the CountReadsPerBin class

normalizationLength : int

length, in bp, to normalize the data. For a value of 1, on average 1 read per base pair is found

avg_method : str

defines how the different values are to be summarized. The options are ‘mean’ and ‘median’

chrsToSkip : list

name of the chromosomes to be excluded from the scale estimation. Usually the chrX is included.

blackListFileName : str

BED file containing blacklisted regions

Returns:

dict

Dictionary with the following keys::

‘size_factors’ ‘size_factors_based_on_mapped_reads’ ‘size_factors_SES’ ‘size_factors_based_on_mean’ ‘size_factors_based_on_median’ ‘mean’ ‘meanSES’ ‘median’ ‘reads_per_bin’ ‘std’ ‘sites_sampled’

Examples

>>> test = Tester()
>>> bin_length = 50
>>> num_samples = 4
>>> _dict = estimateScaleFactor([test.bamFile1, test.bamFile2], bin_length, num_samples,  1)
>>> _dict['size_factors']
array([ 1. ,  0.5])
>>> _dict['size_factors_based_on_mean']
array([ 1. ,  0.5])

deeptools.bamHandler module¶

deeptools.bamHandler.openBam(bamFile)[source]¶

deeptools.correctReadCounts module¶

deeptools.correctReadCounts.computeCorrectedReadcounts(tileCoverage, args)[source]¶

This function is called by the writeBedGraph workers for every tile in the genome that is considered

It computes a pvalue based on an expected lambda comming from the correction of treatment when the input is considered.

deeptools.correctReadCounts.computeLambda(tileCoverage, args)[source]¶: This function is called by the writeBedGraph workers for every tile in the genome that is considered

deeptools.correctReadCounts.computePvalue(tileCoverage, args)[source]¶

This function is called by the writeBedGraph workers for every tile in the genome that is considered

It computes a pvalue based on an expected lambda comming from the correction of treatment when the input is considered.

deeptools.correctReadCounts.controlLambda(tileCoverage, args)[source]¶

deeptools.correctReadCounts.correctReadCounts(bamFilesList, binLength, numberOfSamples, defaultFragmentLength, outFileName, outFileFormat, outFileNameCorr=None, region=None, extendPairedEnds=True, numberOfProcessors=1, Nsigmas=2, maxSignalRatio=10, blackListFileName=None, verbose=False)[source]¶

deeptools.correlation module¶

class deeptools.correlation.Correlation(matrix_file, corr_method=None, labels=None, remove_outliers=False, skip_zeros=False, log1p=False)[source]¶

class to work with matrices having sample data to compute correlations, plot them and make scatter plots

compute_correlation()[source]¶

computes spearman or pearson correlation for the samples in the matrix

The matrix should contain the values of each sample per column that’s why the transpose is used.

>>> matrix = np.array([[1, 2, 3, np.nan],
...                    [1, 2, 3, 4],
...                    [6, 4, 3, 1]]).T
>>> np.savez_compressed("/tmp/test_matrix.npz", matrix=matrix, labels=['a', 'b', 'c'])

>>> c = Correlation("/tmp/test_matrix.npz", corr_method='pearson')

the results should be as in R

>>> c.compute_correlation().filled(np.nan)
array([[ 1.        ,  1.        , -0.98198051],
       [ 1.        ,  1.        , -0.98198051],
       [-0.98198051, -0.98198051,  1.        ]])
>>> c.corr_method = 'spearman'
>>> c.corr_matrix = None
>>> c.compute_correlation()
array([[ 1.,  1., -1.],
       [ 1.,  1., -1.],
       [-1., -1.,  1.]])

static get_outlier_indices(data, max_deviation=200)[source]¶

The method is based on the median absolute deviation. See Boris Iglewicz and David Hoaglin (1993), “Volume 16: How to Detect and Handle Outliers”, The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.

returns the list, without the outliers

The max_deviation=200 is like selecting a z-score larger than 200, just that it is based on the median and the median absolute deviation instead of the mean and the standard deviation.

load_matrix(matrix_file)[source]¶: loads a matrix file saved using the numpy savez method. Two keys are expected: ‘matrix’ and ‘labels’. The matrix should contain one sample per row

plot_correlation(plot_fiilename, plot_title='', vmax=None, vmin=None, colormap='jet', image_format=None, plot_numbers=False)[source]¶: plots a correlation using a symmetric heatmap

plot_pca(plot_filename, plot_title='', image_format=None, log1p=False)[source]¶: Plot the PCA of a matrix

plot_scatter(plot_fiilename, plot_title='', image_format=None, log1p=False)[source]¶: Plot the scatter plots of a matrix in which each row is a sample

remove_outliers(verbose=True)[source]¶

get the outliers per column using the median absolute deviation method

Returns the filtered matrix

remove_rows_of_zeros()[source]¶

save_corr_matrix(file_handle)[source]¶: saves the correlation matrix

deeptools.correlation_heatmap module¶

deeptools.correlation_heatmap.plot_correlation(corr_matrix, labels, plotFileName, vmax=None, vmin=None, colormap='jet', image_format=None, plot_numbers=False, plot_title='')[source]¶

deeptools.countReadsPerBin module¶

class deeptools.countReadsPerBin.CountReadsPerBin(bamFilesList, binLength=50, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, blackListFileName=None, minMappingQuality=None, ignoreDuplicates=False, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, smoothLength=0, out_file_for_raw_data=None)[source]¶

Bases: object

Collects coverage over multiple bam files using multiprocessing

This function collects read counts (coverage) from several bam files and returns an numpy array with the results. This class uses multiprocessing to compute the coverage.

Parameters:

bamFilesList : list

List containing the names of indexed bam files. E.g. [‘file1.bam’, ‘file2.bam’]

binLength : int

Length of the window/bin. This value is overruled by bedFile if present.

numberOfSamples : int

Total number of samples. The genome is divided into numberOfSamples, each with a window/bin length equal to binLength. This value is overruled by stepSize in case such value is present and by bedFile in which case the number of samples and bins are defined in the bed file

numberOfProcessors : int

Number of processors to use. Default is 4

verbose : bool

Output messages. Default: False

region : str

Region to limit the computation in the form chrom:start:end.

bedFile : file_handle

File handle of a bed file containing the regions for wich to compute the coverage. This option overrules binLength, numberOfSamples and stepSize.

blackListFileName : str

A string containing a BED file with blacklist regions.

extendReads : bool, int

Whether coverage should be computed for the extended read length (i.e. the region covered by the two mates or the regions expected to be covered by single-reads). If the value is ‘int’, then then this is interpreted as the fragment length to extend reads that are not paired. For Illumina reads, usual values are around 300. This value can be determined using the peak caller MACS2 or can be approximated by the fragment lengths computed when preparing the library for sequencing. If the value is of the variable is true and not value is given, the fragment size is sampled from the library but only if the library is paired-end. Default: False

minMappingQuality : int

Reads of a mapping quality less than the give value are not considered. Default: None

ignoreDuplicates : bool

Whether read duplicates (same start, end position. If paired-end, same start-end for mates) are to be excluded. Default: false

chrToSkip: list

List with names of chromosomes that do not want to be included in the coverage computation. This is useful to remove unwanted chromosomes (e.g. ‘random’ or ‘Het’).

stepSize : int

the positions for which the coverage is computed are defined as follows: range(start, end, stepSize). Thus, a stepSize of 1, will compute the coverage at each base pair. If the stepSize is equal to the binLength then the coverage is computed for consecutive bins. If seepSize is smaller than the binLength, then teh bins will overlap.

center_read : bool

Determines if reads should be centered with respect to the fragment length.

samFlag_include : int

Extracts only those reads having the SAM flag. For example, to get only reads that are the first mates a samFlag of 64 could be used. Similarly, the samFlag_include can be used to select only reads mapping on the reverse strand or to get only properly paired reads.

samFlag_exclude : int

Removes reads that match the SAM flag. For example to get all reads that map to the forward strand a samFlag_exlude 16 should be used. Which translates into exclude all reads that map to the reverse strand.

zerosToNans : bool

If true, zero values encountered are transformed to Nans. Default false.

out_file_for_raw_data : str

File name to save the raw counts computed

Returns:

numpy array

Each row correspond to each bin/bed region and each column correspond to each of the bamFiles.

Examples

The test data contains reads for 200 bp.

>>> test = Tester()

The transpose function is used to get a nicer looking output. The first line corresponds to the number of reads per bin in bam file 1

>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4)
>>> np.transpose(c.run())
array([[ 0.,  0.,  1.,  1.],
       [ 0.,  1.,  1.,  2.]])

count_reads_in_region(chrom, start, end, bed_regions_list=None)[source]¶

Counts the reads in each bam file at each ‘stepSize’ position within the interval (start, end) for a window or bin of size binLength.

The stepSize controls the distance between bins. For example, a step size of 20 and a bin size of 20 will create bins next to each other. If the step size is smaller than the bin size the bins will overlap.

If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.

Parameters:

chrom : str

Chrom name

start : int

start coordinate

end : int

end coordinate

bed_regions_list: list

List of tuples of the form (chrom, start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

Returns:

numpy array

The result is a numpy array that as rows each bin and as columns each bam file.

Examples

Initialize some useful values

>>> test = Tester()
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 25, 0, stepSize=50)

The transpose is used to get better looking numbers. The first line corresponds to the number of reads per bin in the first bamfile.

>>> _array, __ = c.count_reads_in_region(test.chrom, 0, 200)
>>> _array
array([[ 0.,  0.],
       [ 0.,  1.],
       [ 1.,  1.],
       [ 1.,  2.]])

getReadLength(read)[source]¶

getSmoothRange(tileIndex, tileSize, smoothRange, maxPosition)[source]¶

Given a tile index position and a tile size (length), return the a new indices over a larger range, called the smoothRange. This region is centered in the tileIndex an spans on both sizes to cover the smoothRange. The smoothRange is trimmed in case it is less than zero or greater than maxPosition

---------------|==================|------------------
           tileStart
      |--------------------------------------|
      |    <--      smoothRange     -->      |
      |
tileStart - (smoothRange-tileSize)/2

Test for a smooth range that spans 3 tiles.

Examples

>>> c = CountReadsPerBin([], 1, 1, 1, 0)
>>> c.getSmoothRange(5, 1, 3, 10)
(4, 7)

Test smooth range truncated on start.

>>> c.getSmoothRange(0, 10, 30, 200)
(0, 2)

Test smooth range truncated on start.

>>> c.getSmoothRange(1, 10, 30, 4)
(0, 3)

Test smooth range truncated on end.

>>> c.getSmoothRange(5, 1, 3, 5)
(4, 5)

Test smooth range not multiple of tileSize.

>>> c.getSmoothRange(5, 10, 24, 10)
(4, 6)

get_coverage_of_region(bamHandle, chrom, start, end, tileSize, fragmentFromRead_func=None)[source]¶

Returns a numpy array that corresponds to the number of reads that overlap with each tile.

>>> test = Tester()
>>> import pysam
>>> c = CountReadsPerBin([], stepSize=1, extendReads=300)

For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested.

>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... 5000833, 5000835, 1)
array([ 4.,  5.])

In the following example a paired read is extended to the fragment length which is 100 The first mate starts at 5000000 and the second at 5000064. Each mate is extended to the fragment length independently At position 500090-500100 one fragment of length 100 overlap, and after position 5000101 there should be zero reads.

>>> c.zerosToNans = True
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2', 5000090, 5000110, 10)
array([  1.,  nan])

In the following case the reads length is 50. Reads are not extended.

>>> c.extendReads=False
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', 148, 154, 2)
array([ 1.,  2.,  2.])

get_fragment_from_read(read)[source]¶

Get read start and end position of a read. If given, the reads are extended as follows: If reads are paired end, each read mate is extended to match the fragment length, otherwise, a default fragment length is used. If reads are split (give by the CIGAR string) then the multiple positions of the read are returned. When reads are extended the cigar information is skipped.

Parameters:

read : pysam read object

Returns:

list of tuples

[(fragment start, fragment end)]

>>> test = Tester()

>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True)

>>> c.defaultFragmentLength=100

>>> c.get_fragment_from_read(test.getRead("paired-forward"))

[(5000000, 5000100)]

>>> c.get_fragment_from_read(test.getRead("paired-reverse"))

[(5000000, 5000100)]

>>> c.defaultFragmentLength = 200

>>> c.get_fragment_from_read(test.getRead("single-forward"))

[(5001491, 5001691)]

>>> c.get_fragment_from_read(test.getRead("single-reverse"))

[(5001536, 5001736)]

>>> c.defaultFragmentLength = 'read length'

>>> c.get_fragment_from_read(test.getRead("single-forward"))

[(5001491, 5001527)]

>>> c.defaultFragmentLength = 'read length'

>>> c.extendReads = False

>>> c.get_fragment_from_read(test.getRead("paired-forward"))

[(5000000, 5000036)]

Tests for read centering.

>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True, center_read=True)

>>> c.defaultFragmentLength = 100

>>> c.get_fragment_from_read(test.getRead("paired-forward"))

[(5000032, 5000068)]

>>> c.defaultFragmentLength = 200

>>> c.get_fragment_from_read(test.getRead("single-reverse"))

[(5001618, 5001654)]

run()[source]¶

class deeptools.countReadsPerBin.Tester[source]¶

Bases: object

getRead(readType)[source]¶: prepare arguments for test

deeptools.countReadsPerBin.countReadsInRegions_wrapper(args)[source]¶: Passes the arguments to countReadsInRegions_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the countReadsPerBin object

deeptools.countReadsPerBin.remove_row_of_zeros(matrix)[source]¶

deeptools.getFragmentAndReadSize module¶

deeptools.getFragmentAndReadSize.getFragmentLength_worker(chrom, start, end, bamFile, distanceBetweenBins)[source]¶

Queries the reads at the given region for the distance between reads and the read length

Parameters:

chrom : str

chromosome name

start : int

region start

end : int

region end

bamFile : str

BAM file name

distanceBetweenBins : int

the number of bases at the end of each bin to ignore

Returns:

np.array

an np.array, where first column is fragment length, the second is for read length

deeptools.getFragmentAndReadSize.getFragmentLength_wrapper(args)[source]¶

deeptools.getFragmentAndReadSize.get_read_and_fragment_length(bamFile, return_lengths=False, blackListFileName=None, binSize=50000, distanceBetweenBins=1000000, numberOfProcessors=None, verbose=False)[source]¶

Estimates the fragment length and read length through sampling

Parameters:

bamFile : str

BAM file name

return_lengths : bool

numberOfProcessors : int

verbose : bool

binSize : int

distanceBetweenBins : int

Returns:

d : dict

tuple of two dictionaries, one for the fragment length and the other for the read length. The dictionaries summarise the mean, median etc. values

deeptools.getRatio module¶

deeptools.getRatio.compute_ratio(value1, value2, args)[source]¶

deeptools.getRatio.getRatio(tileCoverage, args)[source]¶

The mapreduce method calls this function for each tile. The parameters (args) are fixed in the main method.

>>> funcArgs= {'valueType': 'ratio', 'scaleFactors': (1,1), 'pseudocount': 1}
>>> getRatio([9, 19], funcArgs)
0.5
>>> getRatio([0, 0], funcArgs)
1.0
>>> getRatio([np.nan, np.nan], funcArgs)
nan
>>> getRatio([np.nan, 1.0], funcArgs)
nan
>>> funcArgs['valueType'] ='subtract'
>>> getRatio([20, 10], funcArgs)
10
>>> funcArgs['scaleFactors'] = (1, 0.5)
>>> getRatio([10, 20], funcArgs)
0.0

The reciprocal ratio is of a and b is: is a/b if a/b > 1 else -1* b/a >>> funcArgs[‘valueType’] =’reciprocal_ratio’ >>> funcArgs[‘scaleFactors’] = (1, 1) >>> funcArgs[‘pseudocount’] = 0 >>> getRatio([2, 1], funcArgs) 2.0 >>> getRatio([1, 2], funcArgs) -2.0 >>> getRatio([1, 1], funcArgs) 1.0

deeptools.getScorePerBigWigBin module¶

class deeptools.getScorePerBigWigBin.Tester[source]¶: Bases: object

deeptools.getScorePerBigWigBin.countFragmentsInRegions_worker(chrom, start, end, bigWigFiles, stepSize, binLength, save_data, bedRegions=None)[source]¶

returns the average score in each bigwig file at each ‘stepSize’ position within the interval start, end for a ‘binLength’ window. Because the idea is to get counts for window positions at different positions for sampling the bins are equally spaced and not adjacent.

If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.

Test dataset with two samples covering 200 bp. >>> test = Tester()

Fragment coverage. >>> np.transpose(countFragmentsInRegions_worker(test.chrom, 0, 200, [test.bwFile1, test.bwFile2], 50, 25, False)[0]) array([[ 1., 1., 2., 2.],

[ 1., 1., 1., 3.]])

>>> np.transpose(countFragmentsInRegions_worker(test.chrom, 0, 200, [test.bwFile1, test.bwFile2], 200, 200, False)[0])
array([[ 1.5],
       [ 1.5]])

BED regions: >>> bedRegions = [(test.chrom, 45, 55), (test.chrom, 95, 105), (test.chrom, 145, 155)] >>> np.transpose(countFragmentsInRegions_worker(test.chrom, 0, 200,[test.bwFile1, test.bwFile2], 200, 200, False, ... bedRegions=bedRegions)[0]) array([[ 1. , 1.5, 2. ],

[ 1. , 1. , 2. ]])

deeptools.getScorePerBigWigBin.countReadsInRegions_wrapper(args)[source]¶

deeptools.getScorePerBigWigBin.getChromSizes(bigwigFilesList)[source]¶

Get chromosome sizes from bigWig file with pyBigWig

Test dataset with two samples covering 200 bp. >>> test = Tester()

Chromosome name(s) and size(s). >>> getChromSizes([test.bwFile1, test.bwFile2]) ([(‘3R’, 200L)], set([]))

deeptools.getScorePerBigWigBin.getScorePerBin(bigWigFiles, binLength, numberOfProcessors=1, verbose=False, region=None, bedFile=None, blackListFileName=None, stepSize=None, chrsToSkip=[], out_file_for_raw_data=None)[source]¶

This function returns a matrix containing scores (median) for the coverage of fragments within a region. Each row corresponds to a sampled region. Likewise, each column corresponds to a bigwig file.

Test dataset with two samples covering 200 bp. >>> test = Tester() >>> np.transpose(getScorePerBin([test.bwFile1, test.bwFile2], 50, 3)) array([[ 1., 1., 2., 2.],

[ 1., 1., 1., 3.]])

deeptools.heatmapper module¶

deeptools.heatmapper.compute_sub_matrix_wrapper(args)[source]¶

class deeptools.heatmapper.heatmapper[source]¶

Bases: object

Class to handle the reading and plotting of matrices.

static change_chrom_names(chrom)[source]¶: Changes UCSC chromosome names to ensembl chromosome names and vice versa.

computeMatrix(score_file_list, regions_file, parameters, blackListFileName=None, verbose=False)[source]¶: Splits into multiple cores the computation of the scores per bin for each region (defined by a hash ‘#’ in the regions (BED/GFF) file.

static compute_sub_matrix_worker(score_file_list, regions, parameters)[source]¶

Parameters:

score_file_list : list

List of strings. Contains the list of bam or bigwig files to be used.

regions : list of dictionaries

Each item in the list is a dictionary containing containing the fields: ‘chrom’, ‘start’, ‘end’, ‘name’ and ‘strand.

parameters : dict

Contains values that specify the length of bins, the number of bp after a reference point etc.

Returns:

numpy matrix

A numpy matrix that contains per each row the values found per each of the regions given

static coverage_from_array(valuesArray, zones, binSize, avgType)[source]¶

static coverage_from_bam(bamfile, chrom, zones, binSize, avgType, verbose=True)[source]¶: currently this method is deactivated because is too slow. It is preferred to create a coverage bigiwig file from the bam file and then run heatmapper.

static coverage_from_big_wig(bigwig, chrom, zones, binSize, avgType, nansAsZeros=False, verbose=True)[source]¶

uses bigwig file reader from pyBigWib to query a region defined by chrom and zones. The output is an array that contains the bigwig value per base pair. The summary over bins is done in a later step when coverage_from_array is called. This method is more reliable than querying the bins directly from the bigwig, which should be more efficient.

By default, any region, even if no chromosome match is found on the bigwig file, produces a result. In other words no regions are skipped.

zones: array as follows zone0: region before the region start,: zone1: 5’ unscaled region (if present) zone2: the body of the region (not always present) zone3: 3’ unscaled region (if present) zone4: the region from the end of the region downstream

each zone is a tuple containing start, end, and number of bins

This is useful if several matrices wants to be merged or if the sorted BED output of one computeMatrix operation needs to be used for other cases

get_individual_matrices(matrix)[source]¶: In case multiple matrices are saved one after the other this method splits them appart. Returns a list containing the matrices

get_num_individual_matrix_cols()[source]¶: returns the number of columns that each matrix should have. This is done because the final matrix that is plotted can be composed of smaller matrices that are merged one after the other.

static get_regions_and_groups(regions_file, onlyMultiplesOf=1, default_group_name='genes', blackListFileName=None, verbose=None)[source]¶

Reads a bed file. In case is hash sign ‘#’ is found in the file, this is considered as a delimiter to split the heatmap into groups

Returns a list of regions with a label index appended to each and a list of labels

static matrix_avg(matrix, avgType='mean')[source]¶

matrix_from_dict(matrixDict, regionsDict, parameters)[source]¶

static my_average(valuesArray, avgType='mean')[source]¶: computes the mean, median, etc but only for those values that are not Nan

read_matrix_file(matrix_file)[source]¶

save_BED(file_handle)[source]¶

save_matrix(file_name)[source]¶

saves the data required to reconstruct the matrix the format is: A header containing the parameters used to create the matrix encoded as: @key:value key2:value2 etc... The rest of the file has the same first 5 columns of a BED file: chromosome name, start, end, name, score and strand, all separated by tabs. After the fifth column the matrix values are appended separated by tabs. Groups are separated by adding a line starting with a hash (#) and followed by the group name.

The file is gzipped.

save_matrix_values(file_name)[source]¶

save_tabulated_values(file_handle, reference_point_label='TSS', start_label='TSS', end_label='TES', averagetype='mean')[source]¶

Saves the values averaged by col using the avg_type given

Args:: file_handle: file name to save the file reference_point_label: Name of the reference point label start_label: Name of the star label end_label: Name of the end label averagetype: average type (e.g. mean, median, std)

deeptools.heatmapper_utilities module¶

deeptools.heatmapper_utilities.getProfileTicks(hm, referencePointLabel, startLabel, endLabel)[source]¶: returns the position and labelling of the xticks that correspond to the heatmap

deeptools.heatmapper_utilities.plot_single(ax, ma, average_type, color, label, plot_type='simple')[source]¶

Adds a line to the plot in the given ax using the specified method

Parameters:

ax : matplotlib axis

matplotlib axis

ma : numpy array

numpy array The data on this matrix is summarized according to the average_type argument.

average_type : str

string values are sum mean median min max std

color : str

a valid color: either a html color name, hex (e.g #002233), RGB + alpha tuple or list or RGB tuple or list

label : str

label

plot_type: str

type of plot. Either ‘se’ for standard error, ‘std’ for standard deviation, ‘overlapped_lines’ to plot each line of the matrix, fill to plot the area between the x axis and the value or None, just to plot the average line.

Returns:

ax

matplotlib axis

Examples

>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> matrix = np.array([[1,2,3],
...                    [4,5,6],
...                    [7,8,9]])
>>> ax = plot_single(ax, matrix -2, 'mean', color=[0.6, 0.8, 0.9], label='fill light blue', plot_type='fill')
>>> ax = plot_single(ax, matrix, 'mean', color='blue', label='red')
>>> ax = plot_single(ax, matrix + 5, 'mean', color='red', label='red', plot_type='std')
>>> ax =plot_single(ax, matrix + 10, 'mean', color='#cccccc', label='gray se', plot_type='se')
>>> ax = plot_single(ax, matrix + 20, 'mean', color=(0.9, 0.5, 0.9), label='violet', plot_type='std')
>>> ax = plot_single(ax, matrix + 30, 'mean', color=(0.9, 0.5, 0.9, 0.5), label='violet with alpha', plot_type='std')
>>> leg = ax.legend()
>>> plt.savefig("/tmp/test.pdf")
>>> plt.close()
>>> fig = plt.figure()

deeptools.mapReduce module¶

deeptools.mapReduce.BED_to_interval_tree(BED_file)[source]¶

Creates an index of intervals, using an interval tree, for each BED entry

Parameters:	BED_file – file handler of a BED file

:return interval tree

deeptools.mapReduce.blOverlap(t, chrom, chunk)[source]¶

Test for an overlap between an IntervalTree and a given genomic chunk.

This attempts to account for differences in chromosome naming.

Returns the overlaps

deeptools.mapReduce.blSubtract(t, chrom, chunk)[source]¶

If a genomic region overlaps with a blacklisted region, then subtract that region out

returns a list of lists

deeptools.mapReduce.getUserRegion(chrom_sizes, region_string, max_chunk_size=1000000.0)[source]¶

Verifies if a given region argument, given by the user is valid. The format of the region_string is chrom:start:end:tileSize where start, end and tileSize are optional.

Parameters:	chrom_sizes – dictionary of chromosome/scaffold size. Key=chromosome name region_string – a string of the form chr:start:end max_chunk_size – upper limit for the chunk size
Returns:	tuple chrom_size for the region start, region end, chunk size

#>>> data = getUserRegion({‘chr2’: 1000}, “chr1:10:10”) #Traceback (most recent call last): # ... #NameError: Unknown chromosome: chr1 #Known chromosomes are: [‘chr2’]

If the region end is biger than the chromosome size, this value is used instead >>> getUserRegion({‘chr2’: 1000}, “chr2:10:1001”) ([(‘chr2’, 1000)], 10, 1000, 990)

Test chunk and regions size reduction to match tile size >>> getUserRegion({‘chr2’: 200000}, “chr2:10:123344:3”) ([(‘chr2’, 123344)], 9, 123345, 123336)

deeptools.mapReduce.mapReduce(staticArgs, func, chromSize, genomeChunkLength=None, region=None, bedFile=None, blackListFileName=None, numberOfProcessors=4, verbose=False, self_=None)[source]¶

Split the genome into parts that are sent to workers using a defined number of procesors. Results are collected and returned.

For each genomic region the given ‘func’ is called using the following parameters:

chrom, start, end, staticArgs

The arg are static, pickable variables that need to be sent to workers.

The genome chunk length corresponds to a fraction of the genome, in bp, that is send to each of the workers for processing.

Depending on the type of process a larger or shorter regions may be preferred

Parameters:

chromSize – A list of duples containing the chromosome name and its length
region – The format is chr:start:end:tileSize (see function getUserRegion)
staticArgs – tuple of arguments that are sent to the given ‘func’
func – function to call. The function is called using the following parameters (chrom, start, end, staticArgs)
bedFile – Is a bed file is given, the args to the func to be called are extended to include a list of bed defined regions.
blackListFileName – A list of regions to exclude from all computations. Note that this has genomeChunkLength resolution...
self – In case mapreduce should make a call to an object the self variable has to be passed.

deeptools.readBed module¶

class deeptools.readBed.BedInterval(field_data_dict, line)[source]¶

Bases: object

simple place holder for the line data of a bed file

class deeptools.readBed.ReadBed(file_handle)[source]¶

Bases: object

Reads a bed file. Based on the number of fields it tries to guess the type of bed file used. Current options are bed3, bed6 and bed12

Examples

bed = readBed(open(“file.bed”, ‘r’)) for interval in bed: ... print bed[‘start’]

get_no_comment_line()[source]¶: Skips comment lines starting with ‘#’ “track” or “browser” in the bed files

get_values_from_line(bed_line)[source]¶: Processes each bed line from a bed file and casts the values

guess_file_type(line_values)[source]¶: try to guess type of bed file by counting the fields

next()[source]¶

Returns:	bedInterval object

deeptools.utilities module¶

deeptools.utilities.copyFileInMemory(filePath, suffix='')[source]¶: copies a file into the special /dev/shm device which moves the file into memory. This process speeds ups the multiprocessor access to such files

deeptools.utilities.getCommonChrNames(bamFileHandlers, verbose=True)[source]¶

Compares the names and lengths of a list of bam file handlers. The input is list of pysam file handlers.

The function returns a duple containing the common chromosome names and the common chromome lengths.

Hopefully, only _random and chrM are not common.

deeptools.utilities.getGC_content(dnaString, as_fraction=True)[source]¶

deeptools.utilities.getTempFileName(suffix='')[source]¶: returns a temporary file name. If the special /dev/shm device is available, the temporary file would be located in that folder. /dv/shm is a folder that resides in memory and which has much faster accession.

deeptools.utilities.tbitToBamChrName(tbitNames, bamNames)[source]¶: checks if the chromosome names from the two-bit and bam file coincide. In case they do not coincide, a fix is tried. If successful, then a mapping table is returned. tbitNames and bamNames should be lists

deeptools.utilities.which(program)[source]¶: method to identify if a program is on the user PATH variable. From: http://stackoverflow.com/questions/377017/test-if-executable-exists-in-python

deeptools.writeBedGraph module¶

class deeptools.writeBedGraph.WriteBedGraph(bamFilesList, binLength=50, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, blackListFileName=None, minMappingQuality=None, ignoreDuplicates=False, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, smoothLength=0, out_file_for_raw_data=None)[source]¶

Bases: deeptools.countReadsPerBin.CountReadsPerBin

Reads bam files coverages and writes a bedgraph or bigwig file

Extends the CountReadsPerBin object such that the coverage of bam files is writen to multiple bedgraph files at once.

The bedgraph files are later merge into one and converted into a bigwig file if necessary.

The constructor arguments are the same as for CountReadsPerBin. However, when calling the run method, the following parameters have to be passed

Examples

Given the following distribution of reads that cover 200 on a chromosome named ‘3R’:

  0                              100                           200
  |------------------------------------------------------------|
A                                ===============
                                                ===============

B                 ===============               ===============
                                 ===============
                                                ===============

>>> import tempfile
>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"

>>> outFile = tempfile.NamedTemporaryFile()
>>> bam_file = test_path +  "testA.bam"

For the example a simple scaling function is going to be used. This function takes the coverage found at each region and multiplies it to the scaling factor. In this case the scaling factor is 1.5

>>> function_to_call = scaleCoverage
>>> funcArgs = {'scaleFactor': 1.5}

Restrict process to a region between positions 0 and 200 of chromosome 3R

>>> region = '3R:0:200'

Set up such that coverage is computed for consecutive bins of length 25 bp >>> bin_length = 25 >>> step_size = 25

>>> num_sample_sites = 0 #overruled by step_size
>>> c = WriteBedGraph([bam_file], binLength=bin_length, region=region, stepSize=step_size)
>>> c.run(function_to_call, funcArgs, outFile.name)
>>> open(outFile.name, 'r').readlines()
['3R\t0\t100\t0.00\n', '3R\t100\t200\t1.5\n']
>>> outFile.close()

run(func_to_call, func_args, out_file_name, blackListFileName=None, format='bedgraph', smoothLength=0)[source]¶

Given a list of bamfiles, a function and a function arguments, this method writes a bedgraph file (or bigwig) file for a partition of the genome into tiles of given size and a value for each tile that corresponds to the given function and that is related to the coverage underlying the tile.

Parameters:

func_to_call : str

function name to be called to convert the list of coverages computed for each bam file at each position into a single value. An example is a function that takes the ratio between the coverage of two bam files.

func_args : dict

dict of arguments to pass to func. E.g. {‘scaleFactor’:1.0}

out_file_name : str

name of the file to save the resulting data.

smoothLength : int

Distance in bp for smoothing the coverage per tile.

writeBedGraph_worker(chrom, start, end, func_to_call, func_args, bed_regions_list=None)[source]¶

Writes a bedgraph based on the read coverage found on bamFiles

The given func is called to compute the desired bedgraph value using the funcArgs

Parameters:

chrom : str

Chrom name

start : int

start coordinate

end : int

end coordinate

func_to_call : str

function name to be called to convert the list of coverages computed for each bam file at each position into a single value. An example is a function that takes the ratio between the coverage of two bam files.

func_args : dict

dict of arguments to pass to func.

smoothLength : int

Distance in bp for smoothing the coverage per tile.

bed_regions_list: list

List of tuples of the form (chrom, start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

Returns:

temporary file with the bedgraph results for the region queried.

Examples

>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bamFile1 = test_path +  "testA.bam"
>>> bin_length = 50
>>> number_of_samples = 0 # overruled by step_size
>>> func_to_call = scaleCoverage
>>> funcArgs = {'scaleFactor': 1.0}

>>> c = WriteBedGraph([bamFile1], bin_length, number_of_samples, stepSize=50)
>>> tempFile = c.writeBedGraph_worker( '3R', 0, 200, func_to_call, funcArgs)
>>> open(tempFile, 'r').readlines()
['3R\t0\t100\t0.00\n', '3R\t100\t200\t1.0\n']
>>> os.remove(tempFile)

deeptools.writeBedGraph.bedGraphToBigWig(chromSizes, bedGraphPath, bigWigPath, sort=True)[source]¶: takes a bedgraph file, orders it and converts it to a bigwig file using pyBigWig.

deeptools.writeBedGraph.getGenomeChunkLength(bamHandlers, tile_size)[source]¶

Tries to estimate the length of the genome sent to the workers based on the density of reads per bam file and the number of bam files.

The chunk length should be a multiple of the tileSize

deeptools.writeBedGraph.ratio(tile_coverage, args)[source]¶: tileCoverage should be an list of two elements

deeptools.writeBedGraph.scaleCoverage(tile_coverage, args)[source]¶: tileCoverage should be an list with only one element

deeptools.writeBedGraph.writeBedGraph_wrapper(args)[source]¶: Passes the arguments to writeBedGraph_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the WriteBedGraph object

deeptools.writeBedGraph_bam_and_bw module¶

deeptools.writeBedGraph_bam_and_bw.getCoverageFromBigwig(bigwigHandle, chrom, start, end, tileSize, missingDataAsZero=False)[source]¶

deeptools.writeBedGraph_bam_and_bw.writeBedGraph(bamOrBwFileList, outputFileName, fragmentLength, func, funcArgs, tileSize=25, region=None, blackListFileName=None, numberOfProcessors=None, format='bedgraph', extendPairedEnds=True, missingDataAsZero=False, smoothLength=0, fixed_step=False)[source]¶: Given a list of bamfiles, a function and a function arguments, this method writes a bedgraph file (or bigwig) file for a partition of the genome into tiles of given size and a value for each tile that corresponds to the given function and that is related to the coverage underlying the tile.

deeptools.writeBedGraph_bam_and_bw.writeBedGraph_worker(chrom, start, end, tileSize, defaultFragmentLength, bamOrBwFileList, func, funcArgs, extendPairedEnds=True, smoothLength=0, missingDataAsZero=False, fixed_step=False)[source]¶

Writes a bedgraph having as base a number of bam files.

The given func is called to compute the desired bedgraph value using the funcArgs

tileSize

deeptools.writeBedGraph_bam_and_bw.writeBedGraph_wrapper(args)[source]¶

Module contents¶

deepTools Galaxy.

code @ github.