oncodrivefml package¶
Submodules¶
oncodrivefml.compute module¶
oncodrivefml.config module¶
This module contains code related with the configuration file (see Configuration).
Additionally, it includes other file realted code, specially from bgconfig
.
-
oncodrivefml.config.
load_configuration
(config_file, override=None)[source]¶ Load the configuration file and checks the format.
Parameters: config_file – configuration file path Returns: configuration as a dict
Return type: bgconfig.BGConfig
-
oncodrivefml.config.
possible_extensions
= ['.gz', '.xz', '.bz2', '.tsv', '.txt']¶ Some expected extensions
-
oncodrivefml.config.
remove_extension_and_replace_special_characters
(file_path)[source]¶ Modifies the name of a file by removing any extension in
possible_extensions
and replacing any character inspecial_characters
for-
.Parameters: file_path – path to a file Returns: file name modified Return type: str
-
oncodrivefml.config.
special_characters
= ['.', '_']¶ Some special characters
oncodrivefml.indels module¶
This module contains all utilities to process insertions and deletions.
Currently 3 methods have been implemented to compute the impact of the indels.
As a set of substitutions (‘max’):
The indel is treated as set of substitutions. It is used for non-coding regions
The functional impact of the observed mutation is the maximum of all the substitutions. The background is simulated as substitutions are.
As a stop (‘stop’):
The indel is expected to produce a stop in the genome, unless it is a frame-shift indel. It is used for coding regions.
The functional impact is derived from the function impact of the stops of the gene. The background is simulated also as stops.
-
class
oncodrivefml.indels.
Indel
(scores)[source]¶ Bases:
object
Methods to compute the impact of indels for the observed and the background
Parameters: - scores (
Scores
) – functional impact per position - signature (dict) – see signature
- signature_id (str) – classifier for the signatures
- method (str) – identifies which method to use to compute the functional impact (see methods)
- strand (str) – if the element being analysed has positive, negative or unknown strand (+,-,.)
-
compute_scores
(reference, alternation, initial_position, size)[source]¶ Compute the scores of all substitution between the reference and altered sequences
Parameters: Returns: Scores of the substitution in the indel.
nan
when it is not possible to compute a value.Return type:
-
get_background_indel_scores_as_stops
()[source]¶ Returns: Values of the stop scores of the gene Return type: list
-
get_background_indel_scores_as_substitutions_without_signature
()[source]¶ Return the values of scores of all possible substitutions :returns: list.
-
get_indel_score_from_stop
(mutation)[source]¶ Compute the indel score as a stop
A function is applied to the values of the scores in the gene
Parameters: mutation (dict) – a mutation object as in here Returns: Score value. nan
if is not possible to compute itReturn type: float
-
get_indel_score_max_of_subs
(mutation)[source]¶ Compute the score of an indel by treating each alteration as a substitution.
Parameters: mutation (dict) – a mutation object as in here Returns: Maximum value of all substitutions Return type: float
-
get_mutation_sequences
(mutation, size)[source]¶ Get the reference and altered sequence of the indel along the window size
Parameters: Returns: Reference and alternated sequences
Return type:
-
static
is_frameshift
(size)[source]¶ Parameters: size (int) – length of the indel Returns: bool. Whether the size is multiple of 3 (in the frames have been enabled in the configuration)
-
is_in_repetitive_region
(mutation)[source]¶ Check if an indel falls in a repetitive region
Looking in the window with the indel in the middle, check if the same sequence of the indel appears at least a certain number of times specified in the configuration. The window where to look has twice the size of the indel multiplied by the number of times already mentioned.
Parameters: mutation (dict) – a mutation object as in here Returns: Whether the indel falls in a repetitive region or not Return type: bool
- scores (
oncodrivefml.load module¶
This module contains the methods used to load and parse the input files: elements and mutations
- elements (
dict
) contains all the segments related to one element. The information is taken from the
elements_file
. Basic structure:{ element_id: [ { 'CHROMOSOME': chromosome, 'START': start_position_of_the_segment, 'END': end_position_of_the_segment, 'STRAND': strand (+ -> positive | - -> negative) 'ELEMENT': element_id, 'SEGMENT': segment_id, 'SYMBOL': symbol_id } ] }
- mutations (
dict
) contains all the mutations for each element. Most of the information is taken from the mutations_file but the element_id and the segment that are taken from the elements. More information is added during the execution. Basic structure:
{ element_id: [ { 'CHROMOSOME': chromosome, 'POSITION': position_where_the_mutation_occurs, 'REF': reference_sequence, 'ALT': alteration_sequence, 'SAMPLE': sample_id, 'ALT_TYPE': type_of_the_mutation, 'CANCER_TYPE': group to which the mutation belongs to, 'SIGNATURE': a different grouping category, } ] }
- mutations_data (
dict
) contains the mutations dict and some metadata information about the mutations. Currently, the number of substitutions and indels. Basic structure:
{ 'data': { `mutations dict`_ }, 'metadata': { 'snp': amount of SNP mutations 'mnp': amount of MNP mutations 'mnp_length': total length of the MNP mutations 'indel': amount of indels } }
-
oncodrivefml.load.
build_regions_tree
(regions)[source]¶ Generates a binary tree with the intervals of the regions
Parameters: regions (dict) – segments grouped by elements. Returns: for each chromosome, it get one IntervalTree
which is a binary tree. The leafs are intervals [low limit, high limit) and the value associated with each interval is thetuple
(element, segment). It can be interpreted as:{ chromosome: (start_position, end_position +1): (element, segment) }
Return type: dict of IntervalTree
-
oncodrivefml.load.
mutations
(file, blacklist=None, metadata_dict=None)[source]¶ Parsed the mutations file
Parameters: - file – mutations file (see
OncodriveFML
) - metadata_dict (dict) – dict that the function will fill with useful information
- blacklist (optional) – file with blacklisted samples (see
OncodriveFML
). Defaults to None.
Yields: One line from the mutations file as a dictionary. Each of the inner elements of mutations
- file – mutations file (see
-
oncodrivefml.load.
mutations_and_elements
(variants_file, elements_file, blacklist=None)[source]¶ From the elements and variants file, get dictionaries with the segments grouped by element ID and the mutations grouped in the same way, as well as some information related to the mutations.
Parameters: - variants_file – mutations file (see
OncodriveFML
) - elements_file – elements file (see
OncodriveFML
) - blacklist (optional) – file with blacklisted samples (see
OncodriveFML
). Defaults to None. If the blacklist option is passed, the mutations are not loaded from a pickle file.
Returns: mutations and elements
Elements: elements dict
Mutations: mutations data dict
Return type: - The process is done in 3 steps:
load_regions()
build_regions_tree()
.- each mutation (
mutations()
) is associated with the right element ID
- variants_file – mutations file (see
oncodrivefml.main module¶
oncodrivefml.mtc module¶
Module containing functions related to multiple test correction
oncodrivefml.oncodrivefml module¶
oncodrivefml.reference module¶
This module contains information related to the reference genome.
-
oncodrivefml.reference.
change_build
(build)[source]¶ Modify the default build fo the reference genome
Parameters: build (str) – genome reference build
-
oncodrivefml.reference.
count_valid_trinucleotides
(trinucleotides_dict)[source]¶ Count how many trinucleotides are valid
Parameters: trinucleotides_dict (dict) – trinucleotides counts Returns: int. Valid trinucleotides
-
oncodrivefml.reference.
get_ref
(chromosome, start, size=1)[source]¶ Gets a sequence from the reference genome
Parameters: Returns: str. Sequence from the reference genome
-
oncodrivefml.reference.
get_ref_triplet
(chromosome, start)[source]¶ Parameters: Returns: 3 bases from the reference genome
Return type:
-
oncodrivefml.reference.
is_valid_trinucleotides
(trinucleotide)[source]¶ Check if a trinucleotide has a nucleotide distinct than A, C, G, T :param trinucleotide: triplet :type trinucleotide: str
Returns: bool.
-
oncodrivefml.reference.
ref_build
= 'hg38'¶ Build of the Reference Genome
-
oncodrivefml.reference.
triplet_counter_executor
(elements)[source]¶ For a list of regions, get all the triplets present in all the segments
Parameters: elements ( list
oflist
) – list of lists of segmentsReturns: collections.Counter
. Count of each triplet in the regions
oncodrivefml.scores module¶
This module contains the methods associated with the scores that are assigned to the mutations.
The scores are read from a file.
Information about the stop scores.
As of December 2016, we have only measured the stops using CADD1.0.
The stops of a gene retrieved only if there are ast least 3 stops in the regions being analysed. If not, a formula is applied to derived the value of the stops from the rest of the values.
Note
This formula was obtained using the CADD scores of the coding regions. Using a different regions or scores files will make the function to return totally nonsense values.
-
class
oncodrivefml.scores.
PackScoresReader
(conf)[source]¶ Bases:
object
-
BIT_TO_REF
= {(0, 0, 0): '?', (0, 0, 1): 'T', (0, 1, 0): 'A', (0, 1, 1): 'C', (1, 0, 0): 'G'}¶
-
SCORE_ALT
= {'A': 'CGT', 'C': 'AGT', 'G': 'ACT', 'T': 'ACG'}¶
-
SCORE_ORDER
= {'A': {'C': 0, 'G': 1, 'T': 2}, 'C': {'A': 0, 'G': 1, 'T': 2}, 'G': {'A': 0, 'C': 1, 'T': 2}, 'T': {'A': 0, 'C': 1, 'G': 2}}¶
-
STRUCT_SIZE
= 6¶
-
-
class
oncodrivefml.scores.
ScoreValue
(ref, alt, value, change)¶ Bases:
tuple
Tuple that contains the reference, the alteration, the score value and the triplets
Parameters: -
alt
¶ Alias for field number 1
-
change
¶ Alias for field number 3
-
ref
¶ Alias for field number 0
-
value
¶ Alias for field number 2
-
-
class
oncodrivefml.scores.
Scores
(element: str, segments: list, config: dict)[source]¶ Bases:
object
Parameters: -
scores_by_pos
¶ for each positions get all possible changes, and for each change the triplets
{ position: [ ScoreValue( ref, alt_1, value, change ), ScoreValue( ref, alt_2, value, change ), ScoreValue( ref, alt_3, value, change ) ] }
Type: dict
-
get_all_positions
() → List[int][source]¶ Get all positions in the element
Returns: list of positions Return type: list
ofint
-
get_score_by_position
(position: int) → List[oncodrivefml.scores.ScoreValue][source]¶ Get all ScoreValue objects that are asocated with that position
Parameters: position (int) – position Returns: list of all ScoreValue related to that positon Return type: list
ofScoreValue
-
-
oncodrivefml.scores.
stop_function
(x)¶
oncodrivefml.signature module¶
This module contains information related with the signature.
The signature is a way of assigning probabilities to certain mutations that have some relation amongst them (e.g. cancer type, sample…). This relation is identified by the signature_id.
The classifier
parameter in the configuration of the signature
specifies which column of the mutations file (MUTATIONS_HEADER
) is used as
the identifier for the different signature groups.
If not provided, all mutations contribute to one global signature.
The probabilities are taken only from substitutions. For them, the two bases that surround the mutated one are taken into account. This is called the triplet. For a certain mutation in a position x the reference triplet is the base in the reference genome in position x-1, the base in x and the base in the x+1. The altered triplet of the same mutation is equal for the bases in x-1 and x+1 but the base in x is the one observed in the mutation.
signature (dict
)
{ signature_id: { (ref_triplet, alt_triplet): prob } }
oncodrivefml.stats module¶
This modules contains different statistical methods used to compare the observed and the simulated scores
-
class
oncodrivefml.stats.
ArithmeticMean
[source]¶ Bases:
object
-
static
calc
(values)[source]¶ Computes the arithmetic mean
Parameters: values ( list
,array
) – array of valuesReturns: mean value Return type: float
-
static
-
class
oncodrivefml.stats.
GeometricMean
[source]¶ Bases:
object
The geometric mean used is not the standard.
-
static
calc
(values)[source]¶ Computes the geometric mean of a set of values.
Parameters: values ( list
,array
) – set of valuesReturns: geometric mean (array): geometric mean by columns (if the input is a matrix) Return type: (float)
-
static
oncodrivefml.store module¶
This module contains the methods used to store the results.
3 different types of output are available:
- tsv file
- png graph: uses the tsv file and matplotlib
- html graph: uses the tsv file and bokeh
-
class
oncodrivefml.store.
QQPlot
(input_file, cutoff=True, rename_fields=None, extra_fields=None)[source]¶ Bases:
object
Parameters: - input_file – tsv file with the data
- cutoff (bool) – add cutoffs to the figure
- rename_fields (dict) – column names from the input file can be renamed providing a dictionary {old_name : new_name}
- extra_fields (list) – list of column names that want to be passed to the figure data. Need for example to search by them.
-
add_search_widget
(fields)[source]¶ Add text input for each field.
Parameters: fields ( str
orlist
) – list of fields to do a search.
-
add_tooltip_enhanced
()[source]¶ The tooltip is shown via JavaScript to avoid been block in areas with a high density of points
-
oncodrivefml.store.
store_html
(input_file, output_path)[source]¶ Create the QQPlot and save it.
Parameters:
oncodrivefml.utils module¶
This module contains some useful methods
-
oncodrivefml.utils.
defaultdict_list
()[source]¶ Shortcut
Returns: defaultdict
oflist