oncodrivefml package¶

Subpackages¶

oncodrivefml.executors package

Submodules¶

oncodrivefml.compute module¶

oncodrivefml.compute.gmean(a)[source]¶

oncodrivefml.compute.gmean_weighted(vectors, weights)[source]¶

oncodrivefml.compute.random_scores(num_samples, sampling_size, background, signature, statistic_name)[source]¶

oncodrivefml.config module¶

This module contains code related with the configuration file (see Configuration).

Additionally, it includes other file realted code, specially from bgconfig.

oncodrivefml.config.load_configuration(config_file, override=None)[source]¶

Load the configuration file and checks the format.

Parameters:	config_file – configuration file path
Returns:	configuration as a `dict`
Return type:	`bgconfig.BGConfig`

oncodrivefml.config.possible_extensions = [‘.gz’, ‘.xz’, ‘.bz2’, ‘.tsv’, ‘.txt’]¶: Some expected extensions

oncodrivefml.config.remove_extension_and_replace_special_characters(file_path)[source]¶

Modifies the name of a file by removing any extension in possible_extensions and replacing any character in special_characters for -.

Parameters:	file_path – path to a file
Returns:	file name modified
Return type:	str

oncodrivefml.config.special_characters = [‘.’, ‘_’]¶: Some special characters

oncodrivefml.indels module¶

This module contains all utilities to process insertions and deletions.

Currently 3 methods have been implemented to compute the impact of the indels.

As a set of substitutions (‘max’):

The indel is treated as set of substitutions. It is used for non-coding regions

The functional impact of the observed mutation is the maximum of all the substitutions. The background is simulated as substitutions are.
As a stop (‘stop’):

The indel is expected to produce a stop in the genome, unless it is a frame-shift indel. It is used for coding regions.

The functional impact is derived from the function impact of the stops of the gene. The background is simulated also as stops.

class oncodrivefml.indels.Indel(scores, strand)[source]¶

Bases: object

Methods to compute the impact of indels for the observed and the background

Parameters:	scores (`Scores`) – functional impact per position signature (dict) – see signature signature_id (str) – classifier for the signatures method (str) – identifies which method to use to compute the functional impact (see methods) strand (str) – if the element being analysed has positive, negative or unknown strand (+,-,.)

compute_scores(reference, alternation, initial_position, size)[source]¶

Compute the scores of all substitution between the reference and altered sequences

Parameters:	reference (str) – sequence alternation (str) – sequence initial_position (int) – position where the indel occurs size (int) – number of position to look
Returns:	Scores of the substitution in the indel. `nan` when it is not possible to compute a value.
Return type:	list

get_background_indel_scores_as_stops()[source]¶

Returns:	Values of the stop scores of the gene
Return type:	list

get_background_indel_scores_as_substitutions_without_signature()[source]¶: Return the values of scores of all possible substitutions :returns: list.

get_indel_score_from_stop(mutation)[source]¶

Compute the indel score as a stop

A function is applied to the values of the scores in the gene

Parameters:	mutation (dict) – a mutation object as in here
Returns:	Score value. `nan` if is not possible to compute it
Return type:	float

get_indel_score_max_of_subs(mutation)[source]¶

Compute the score of an indel by treating each alteration as a substitution.

Parameters:	mutation (dict) – a mutation object as in here
Returns:	Maximum value of all substitutions
Return type:	float

get_mutation_sequences(mutation, size)[source]¶

Get the reference and altered sequence of the indel along the window size

Parameters:	mutation (dict) – a mutation object as in here size (int) – window length
Returns:	Reference and alternated sequences
Return type:	tuple

static is_frameshift(size)[source]¶

Parameters:	size (int) – length of the indel
Returns:	bool. Whether the size is multiple of 3 (in the frames have been enabled in the configuration)

is_in_repetitive_region(mutation)[source]¶

Check if an indel falls in a repetitive region

Looking in the window with the indel in the middle, check if the same sequence of the indel appears at least a certain number of times specified in the configuration. The window where to look has twice the size of the indel multiplied by the number of times already mentioned.

Parameters:	mutation (dict) – a mutation object as in here
Returns:	Whether the indel falls in a repetitive region or not
Return type:	bool

not_found(mutation)[source]¶

class oncodrivefml.indels.StopsScore(funct_type)[source]¶

Bases: object

choose(x)[source]¶

function(x)[source]¶

mean(x)[source]¶

median(x)[source]¶

random(x)[source]¶

oncodrivefml.indels.init_indels_module(indels_config)[source]¶

Initialize the indels module

Parameters:	indels_config (dict) – configuration of how to compute the impact of indels

oncodrivefml.load module¶

This module contains the methods used to load and parse the input files: elements and mutations

elements (dict)

contains all the segments related to one element. The information is taken from the elements_file. Basic structure:

{ element_id:
    [
        {
        'CHROMOSOME': chromosome,
        'START': start_position_of_the_segment,
        'STOP': end_position_of_the_segment,
        'STRAND': strand (+ -> positive | - -> negative)
        'ELEMENT': element_id,
        'SEGMENT': segment_id,
        'SYMBOL': symbol_id
        }
    ]
}

mutations (dict)

contains all the mutations for each element. Most of the information is taken from the mutations_file but the element_id and the segment that are taken from the elements. More information is added during the execution. Basic structure:

{ element_id:
    [
        {
        'CHROMOSOME': chromosome,
        'POSITION': position_where_the_mutation_occurs,
        'REF': reference_sequence,
        'ALT': alteration_sequence,
        'SAMPLE': sample_id,
        'ALT_TYPE': type_of_the_mutation,
        'CANCER_TYPE': group to which the mutation belongs to,
        'SIGNATURE': a different grouping category,
        }
    ]
}

mutations_data (dict)

contains the mutations dict and some metadata information about the mutations. Currently, the number of substitutions and indels. Basic structure:

{
    'data':
        {
            `mutations dict`_
        },
    'metadata':
        {
            'snp': amount of SNP mutations
            'mnp': amount of MNP mutations
            'mnp_length': total length of the MNP mutations
            'indel': amount of indels
        }
}

oncodrivefml.load.build_regions_tree(regions)[source]¶

Generates a binary tree with the intervals of the regions

Parameters:	regions (dict) – segments grouped by elements.
Returns:	for each chromosome, it get one `IntervalTree` which is a binary tree. The leafs are intervals [low limit, high limit) and the value associated with each interval is the `tuple` (element, segment). It can be interpreted as: { chromosome: (start_position, stop_position +1): (element, segment) }
Return type:	dict of `IntervalTree`

oncodrivefml.load.load_and_map_variants(variants_file, elements_file, blacklist=None, save_pickle=False)[source]¶

From the elements and variants file, get dictionaries with the segments grouped by element ID and the mutations grouped in the same way, as well as some information related to the mutations.

Parameters:

variants_file – mutations file (see OncodriveFML)
elements_file – elements file (see OncodriveFML)
blacklist (optional) – file with blacklisted samples (see OncodriveFML). Defaults to None. If the blacklist option is passed, the mutations are not loaded from a pickle file.
save_pickle (bool, optional) – save pickle files

Returns:

mutations and elements

Elements: elements dict

Mutations: mutations data dict

Return type:

tuple

The process is done in 3 steps:

load_regions()
build_regions_tree().
each mutation (load_mutations()) is associated with the right element ID

oncodrivefml.load.load_mutations(file, blacklist=None, metadata_dict=None)[source]¶

Parsed the mutations file

Parameters:	file – mutations file (see `OncodriveFML`) metadata_dict (dict) – dict that the function will fill with useful information blacklist (optional) – file with blacklisted samples (see `OncodriveFML`). Defaults to None.
Yields:	One line from the mutations file as a dictionary. Each of the inner elements of mutations

oncodrivefml.main module¶

oncodrivefml.mtc module¶

Module containing functions related to multiple test correction

oncodrivefml.mtc.multiple_test_correction(results, num_significant_samples=2)[source]¶

Performs a multiple test correction on the analysis results

Parameters:	results (dict) – dictionary with the results num_significant_samples (int) – mininum samples that a gene must have in order to perform the correction
Returns:	`DataFrame`. DataFrame with the q-values obtained from a multiple test correction

oncodrivefml.scores module¶

This module contains the methods associated with the scores that are assigned to the mutations.

The scores are read from a file.

Information about the stop scores.

As of December 2016, we have only measured the stops using CADD1.0.

The stops of a gene retrieved only if there are ast least 3 stops in the regions being analysed. If not, a formula is applied to derived the value of the stops from the rest of the values.

Note

This formula was obtained using the CADD scores of the coding regions. Using a different regions or scores files will make the function to return totally nonsense values.

class oncodrivefml.scores.PackScoresReader(conf)[source]¶

Bases: object

BIT_TO_REF = {(1, 0, 0): ‘G’, (0, 1, 1): ‘C’, (0, 1, 0): ‘A’, (0, 0, 0): ‘?’, (0, 0, 1): ‘T’}¶

SCORE_ALT = {‘G’: ‘ACT’, ‘C’: ‘AGT’, ‘T’: ‘ACG’, ‘A’: ‘CGT’}¶

SCORE_ORDER = {‘G’: {‘C’: 1, ‘T’: 2, ‘A’: 0}, ‘C’: {‘G’: 1, ‘T’: 2, ‘A’: 0}, ‘T’: {‘G’: 2, ‘C’: 1, ‘A’: 0}, ‘A’: {‘C’: 0, ‘T’: 2, ‘G’: 1}}¶

STRUCT_SIZE = 6¶

get(chromosome, start, stop, *args, **kwargs)[source]¶

unpack(block)[source]¶

exception oncodrivefml.scores.ReaderError(msg)[source]¶: Bases: Exception

exception oncodrivefml.scores.ReaderGetError(chr, start, stop)[source]¶: Bases: oncodrivefml.scores.ReaderError

class oncodrivefml.scores.ScoreValue(ref, alt, value, ref_triplet, alt_triplet)¶

Bases: tuple

Tuple that contains the reference, the alteration, the score value and the triplets

Parameters:	ref (str) – reference base alt (str) – altered base value (float) – score value of that substitution ref_triplet (str) – reference triplet alt_triplet (str) – altered triplet

alt¶: Alias for field number 1

alt_triplet¶: Alias for field number 4

ref¶: Alias for field number 0

ref_triplet¶: Alias for field number 3

value¶: Alias for field number 2

class oncodrivefml.scores.Scores(element: str, segments: list, config: dict)[source]¶

Bases: object

Parameters:	element (str) – element ID segments (list) – list of the segments associated to the element config (dict) – configuration

scores_by_pos¶

dict – for each positions get all possible changes, and for each change the triplets

{ position:
    [
        ScoreValue(
            ref,
            alt_1,
            value,
            ref_triplet,
            alt_triple
        ),
        ScoreValue(
            ref,
            alt_2,
            value,
            ref_triplet,
            alt_triple
        ),
        ScoreValue(
            ref,
            alt_3,
            value,
            ref_triplet,
            alt_triple
        )
    ]
}

get_all_positions() → typing.List[int][source]¶

Get all positions in the element

Returns:	list of positions
Return type:	`list` of `int`

get_score_by_position(position: int) → typing.List[oncodrivefml.scores.ScoreValue][source]¶

Get all ScoreValue objects that are asocated with that position

Parameters:	position (int) – position
Returns:	list of all ScoreValue related to that positon
Return type:	`list` of `ScoreValue`

get_stop_scores()[source]¶: Get the scores of the stops in a gene that fall in the regions being analyzed

class oncodrivefml.scores.ScoresTabixReader(conf)[source]¶

Bases: object

get(chromosome, start, stop, element=None)[source]¶

oncodrivefml.scores.init_scores_module(conf)[source]¶

oncodrivefml.scores.null(x)[source]¶

oncodrivefml.scores.stop_function(x)¶

oncodrivefml.signature module¶

This module contains information related with the signature.

The signature is a way of assigning probabilities to certain mutations that have some relation amongst them (e.g. cancer type, sample…).

This relation is identified by the signature_id.

The classifier parameter in the configuration of the signature specifies which column of the mutations file (MUTATIONS_HEADER) is used as the identifier for the different signature groups. If the column does not exist the classifier itself is used as value for the signature_id.

The probabilities are taken only from substitutions. For them, the two bases that surround the mutated one are taken into account. This is called the triplet. For a certain mutation in a position x the reference triplet is the base in the reference genome in position x-1, the base in x and the base in the x+1. The altered triplet of the same mutation is equal for the bases in x-1 and x+1 but the base in x is the one observed in the mutation.

signature (dict)

{ signature_id:
    {
        (ref_triplet, alt_triplet): prob
    }
}

oncodrivefml.signature.change_ref_build(build)[source]¶

Modify the default build fo the reference genome

Parameters:	build (str) – genome reference build

oncodrivefml.signature.chunkizator(iterable, size=1000)[source]¶

Creates chunks from an iterable

Parameters:	iterable – size (int) – elements in the chunk
Returns:	list. Chunk

oncodrivefml.signature.collapse_complementaries(signature)[source]¶

Add to the amount of a certain pair (ref_triplet, alt_triplet) the amount of the complementary.

Parameters:	signature (dict) – { (ref_triplet, alt_triplet): amount }
Returns:	{ (ref_triplet, alt_triplet): new_amount }. New_amount is the addition of the amount for (ref_triplet, alt_triplet) and the amount for (complementary_ref_triplet, complementary_alt_triplet)
Return type:	dict

oncodrivefml.signature.complementary_sequence(seq)[source]¶

Parameters:	seq (str) – sequence of bases
Returns:	complementary sequence
Return type:	str

oncodrivefml.signature.compute_regions_signature(elements, cores)[source]¶

Counts triplets in the elements

Parameters:	elements – cores (int) – cores to use
Returns:	`collections.Counter`. Counts of the triplets in the elements

oncodrivefml.signature.compute_signature(signature_function, classifier, collapse=False, include_mnp=False)[source]¶

Gets the probability of each substitution that occurs for a certain signature_id.

Each substitution is identified by the pair (reference_triplet, altered_triplet).

The signature_id is taken from the mutations field corresponding to the classifier.

Parameters:

signature_function – function that yields one mutation each time
classifier (str) – passed to load_mutations() as parameter signature_classifier.
collapse (bool) – consider one substitutions and the complementary one as the same. Defaults to True.
include_mnp (bool) – use MNP mutation in the signature computation or not

Returns:

probability of each substitution (measured by the triplets) grouped by the signature_classifier

{ signature_id:
    {
        (ref_triplet, alt_triplet): prob
    }
}

Return type:

dict

Warning

Only substitutions (MNP are optional) are taken into account

oncodrivefml.signature.correct_signature_by_triplets_frequencies(signature, triplets_frequencies)[source]¶

Normalized de signature by the frequency of the triplets

Parameters:	signature (dict) – see signature triplets_frequencies (dict) – {triplet: frequency}
Returns:	dict. Normalized signature

oncodrivefml.signature.count_valid_trinucleotides(trinucleotides_dict)[source]¶

Count how many trinucleotides are valid

Parameters:	trinucleotides_dict (dict) – trinucleotides counts
Returns:	int. Valid trinucleotides

oncodrivefml.signature.get_alternate_signature(line)[source]¶

Parameters:	line (dict) – contains the previous base, the alteration and the next base
Returns:	triplet with the central base replaced by the alteration indicated in the line
Return type:	str

oncodrivefml.signature.get_build()[source]¶

oncodrivefml.signature.get_normalized_frequencies(signature, triplets_frequencies)[source]¶

Divides the frequency of each triplet alteration by the frequency of the reference triplet to get the normalized signature

Parameters:	signature (dict) – {(ref_triplet, alt_triplet): counts} triplets_frequencies (dict) – {triplet: frequency}
Returns:	dict. Normalized signature

oncodrivefml.signature.get_ref(chromosome, start, size=1)[source]¶

Gets a sequence from the reference genome

Parameters:	chromosome (str) – chromosome start (int) – start position where to look size (int) – number of bases to retrieve
Returns:	str. Sequence from the reference genome

oncodrivefml.signature.get_ref_triplet(chromosome, start)[source]¶

Parameters:	chromosome (str) – chromosome identifier start (int) – starting position
Returns:	3 bases from the reference genome
Return type:	str

oncodrivefml.signature.get_reference_signature(line)[source]¶

Parameters:	line (dict) – contatins the chromosome and the position
Returns:	triplet around certain positions
Return type:	str

oncodrivefml.signature.is_valid_trinucleotides(trinucleotide)[source]¶

Check if a trinucleotide has a nucleotide distinct than A, C, G, T :param trinucleotide: triplet :type trinucleotide: str

Returns:	bool.

oncodrivefml.signature.load_signature(signature_config, signature_function, trinucleotides_counts=None, load_pickle=None, save_pickle=False)[source]¶

Computes the probability that certain mutation occurs.

Parameters:

signature_config (dict) – information of the signature (see configuration)
signature_function – function that yields one mutation each time
trinucleotides_counts (dict, optional) – counts of trincleotides used to correct the signature
load_pickle (str, optional) – path to the pickle file
save_pickle (str, optional) – path to pickle file

Returns:

probability of each substitution (measured by the triplets) grouped by the signature_id

{ signature_id:
    {
        (ref_triplet, alt_triplet): prob
    }
}

Return type:

dict

Before computing the signature, it is checked whether a pickle file with the signature already exists or not.

oncodrivefml.signature.load_trinucleotides_counts(region)[source]¶

Get the trinucleotides counts for a precomputed region: whole exome or whole genome

Parameters:	region (str) – whole genome or whole exome
Returns:	dict. Counts of the different trinucleotides

oncodrivefml.signature.ref_build = ‘hg19’¶: Build of the Reference Genome

oncodrivefml.signature.sum2one_dict(signature_counts)[source]¶

Associates to each key (tuple(reference_tripet, altered_triplet)) the value divided by the total amount

Parameters:	signature_counts (dict) – pair key-amount {(ref_triplet, alt_triplet): value}
Returns:	pair key-(amount/total_amount)
Return type:	dict

oncodrivefml.signature.triplet_counter_executor(elements)[source]¶

For a list of regions, get all the triplets present in all the segments

Parameters:	elements (`list` of `list`) – list of lists of segments
Returns:	`collections.Counter`. Count of each triplet in the regions

oncodrivefml.signature.triplets(sequence)[source]¶

Parameters:	sequence (str) – sequence of nucleotides
Yields:	str. Triplet

oncodrivefml.signature.yield_mutations(mutations)[source]¶

Yields one mutation each time from a list of mutations

Parameters:	mutations (dict) – mutations
Yields:	Mutation

oncodrivefml.stats module¶

This modules contains different statistical methods used to compare the observed and the simulated scores

class oncodrivefml.stats.ArithmeticMean[source]¶

Bases: object

static calc(values)[source]¶

Computes the arithmetic mean

Parameters:	values (`list`, `array`) – array of values
Returns:	mean value
Return type:	float

static calc_observed(values, observed)[source]¶

Measure how many times the mean of the values is higher than the mean of the observed values

Parameters:

values (array) – m x n matrix with scores (m: number of randomizations; n: number of mutations)
observed (list, array) – n size vector with the observed scores (n: number of mutations)

Returns:

the number of times that the mean value of a randomization is greater or equal than the mean observed value: (as int) and the number of times that the mean value of a randomization is equal or lower than the mean observed value (as int).

Return type:

tuple

class oncodrivefml.stats.ArithmeticMeanHeteroscedasticScores[source]¶

Bases: object

static calc_observed(values, observed)[source]¶

class oncodrivefml.stats.GeometricMean[source]¶

Bases: object

The geometric mean used is not the standard.

$(\prod \limits_{i=1}^n (x_i+1))^{1/n}-1 &= \sqrt[n]{(x_1+1)(x_2+1) \cdots (x_n+1)} -1$

static calc(values)[source]¶

Computes the geometric mean of a set of values.

Parameters:	values (`list`, `array`) – set of values
Returns:	geometric mean (array): geometric mean by columns (if the input is a matrix)
Return type:	(float)

static calc_observed(values, observed)[source]¶

Measure how many times the geometric mean of the values is higher than the geometric mean of the observed values

Parameters:

values (array) – m x n matrix with scores (m: number of randomizations; n: number of mutations)
observed (list, array) – n size vector with the observed scores (n: number of mutations)

Returns:

the number of times that the mean value of a randomization is greater or equal than the mean observed value: (as int) and the number of times that the mean value of a randomization is equal or lower than the mean observed value (as int).

Return type:

tuple

class oncodrivefml.stats.Maximum[source]¶

Bases: object

static calc(values)[source]¶

static calc_observed(values, observed)[source]¶

oncodrivefml.store module¶

This module contains the methods used to store the results.

3 different types of output are available:

tsv file

png graph: uses the tsv file and matplotlib

html graph: uses the tsv file and bokeh

class oncodrivefml.store.QQPlot(input_file, cutoff=True, rename_fields=None, extra_fields=None)[source]¶

Bases: object

Parameters:	input_file – tsv file with the data cutoff (bool) – add cutoffs to the figure rename_fields (dict) – column names from the input file can be renamed providing a dictionary {old_name : new_name} extra_fields (list) – list of column names that want to be passed to the figure data. Need for example to search by them.

add_search_widget(fields)[source]¶

Add text input for each field.

Parameters:	fields (`str` or `list`) – list of fields to do a search.

add_tooltip()[source]¶: Adds tooltip to show the parameters of each glyph in the figure

add_tooltip_enhanced()[source]¶: The tooltip is shown via JavaScript to avoid been block in areas with a high density of points

show(output_path, showit=True, notebook=False)[source]¶

Show the figure

Parameters:	output_path – file where to store the figure showit (bool) – the figure is displayed (widgets and the like are not shown) or is fully saved. Defaults to True. notebook (bool) – if is is called form a notebook or not. Defaults to False.

oncodrivefml.store.add_symbol(df)[source]¶

oncodrivefml.store.eliminate_duplicates(df)[source]¶

oncodrivefml.store.store_html(input_file, output_path)[source]¶

Create the QQPlot and save it.

Parameters:	input_file – tsv filw with the data output_path – file where to store the graph showit (bool) – defaults to False. See `show()`.

oncodrivefml.store.store_png(input_file, output_file, showit=False)[source]¶

Creates a figure from the resutls.

Parameters:	input_file – tsv file with the results output_file – file where to store the figure showit (bool) – calls `show()` before returning. Defaults to False.

oncodrivefml.store.store_tsv(results, result_file)[source]¶

Saves the results in a tsv file sorted by pvalue

Parameters:	results (`DataFrame`) – results of the analysis result_file – file where to store the results

oncodrivefml.utils module¶

This module contains some useful methods

oncodrivefml.utils.defaultdict_list()[source]¶

Shortcut

Returns:	`defaultdict` of `list`

oncodrivefml.utils.executor_run(executor)[source]¶

Method to call the run method

Parameters:	executor (`ElementExecutor`) –
Returns:	`run()`

oncodrivefml.utils.exists_path(path)[source]¶

oncodrivefml.utils.loop_logging(iterable, size=None, step=1)[source]¶

Loop through an iterable object displaying messages using info()

Parameters:	iterable – size (int) – Defaults to None. step (int) – Defaults to 1.
Yields:	The iterable element

oncodrivefml package¶

Subpackages¶

Submodules¶

oncodrivefml.compute module¶

oncodrivefml.config module¶

oncodrivefml.indels module¶

oncodrivefml.load module¶

oncodrivefml.main module¶

oncodrivefml.mtc module¶

oncodrivefml.scores module¶

oncodrivefml.signature module¶

oncodrivefml.stats module¶

oncodrivefml.store module¶

oncodrivefml.utils module¶

oncodrivefml.walker module¶

oncodrivefml.walker_cython module¶

Module contents¶