Configuration¶
The method behaviour can be modified through a configuration file.
Warning
Using the command line interface overwrites some setting in the configuration file. Check how the command line interface changes the configuration in the command line interface section.
Check the oncodrivefml_v2.conf.template
that is included in the package
to find an example of the configuration file.
This section will explain each of the parameters in the configuration file:
Genome¶
[genome]
# Build of the reference genome
# Currently supported: hg19, hg38 and hg18
build = 'hg19'
The genome section makes reference to the reference genome used by OncodriveFML.
The reference genome has been obtained from http://hgdownload.cse.ucsc.edu/downloads.html.
Currently, only HG19
is fully supported. Use build = 'hg19'
to use it.
There is a partial support for HG18
and HG38
.
The support is only partial because the values for the position and alterations
of the stops in the these genomes have not been computed yet. If you want to
run OncodriveFML with any of these genomes, make sure you do not use
the stop
method for the indels (ref).
Warning
If you decide to use a reference genome other than HG19
, make sure that the
scores file you use is compatible with it.
Signature¶
[signature]
# "full" : Use a 192 matrix with all the possible signatures
method = 'full'
# Choose the classifier (categorical value for the signature:
# Choose the classifier (categorical value for the signature:
# The classifier is a column in the dataset and must be one of these:
# classifier = 'SIGNATURE'
# classifier = 'SAMPLE'
# classifier = 'CANCER_TYPE'
# if the column is missing, all mutations contribute to the signature
# Include/exclude MNP mutations in the signature computation
include_mnp = True
# Choose if the signature must be computed using the whole cohort or
# only the elements that fall into the regions you are analysing:
only_mapped_mutations = False
# The frequency of trinucleotides can be normalized by the frequency of sites
# None: do not correct (comment the option)
# normalize_by_sites = ''
The signature represents the probability of a certain nucleotide to mutate taking into account its context [1].
You can choose one of the following options for the signature:
To not use any signature, which is equivalent to assume that all changes have equal probability to happen:
method = 'none'
. This approach is recommended for small datasets.OncodriveFML can also compute the signatures using the provided dataset. This option contains a set of parameters that you can use to decide how this computation is done.
Select one of the methods to compute the signatures from the dataset:
method = 'full'
to count each mutation once andmethod = 'complement'
to collapse complementary mutations.Note
The option
method = 'bysample'
is equivalent tomethod = 'complement'
but forces the classifier (see below) to beSAMPLE
.The classifier parameter indicates which column from the mutations file is used to group the mutations when computing the signatures. E.g. grouping by
SAMPLE
generates one signature for each sample. OnlySAMPLE
,CANCER_TYPE
andSIGNATURE
columns can be used.You can decide to use only SNP (
include_mnp = False
) or also use MNP mutations (include_mnp = True
).You can choose between using only the mutations that are mapped to the regions under analysis (
only_mapped_mutations = True
) or use all the mutations (SNPs and optionally MNPs) in the dataset.The signatures can be corrected by the frequencies of sites. If you do not specify anything, OncodriveFML will not correct the signatures. Use
normalize_by_sites = 'whole_genome'
or'wgs'
to correct by the frequencies in the whole genome. Usenormalize_by_sites = 'whole_exome'
or'wes'
or'wxs'
to correct by the frequencies in the exome. If you have specifiedonly_mapped_mutations = True
, then the correction will be done by the frequencies of trinuceotides found in the regions under analysis, as long as you indicate one of the above mentioned values.Note
The frequencies have been computed for genome build
HG19
. If you want to check the values, use the bgdata package.
The recommended approach is to use your own signatures. OncodriveFML has the option
method = 'file'
to load precomputed signatures from a file. This option requires a few additional parameters:path
: path to the file containing the signaturecolum_ref
: column that contains the reference tripletcolumn_alt
: column that contains the alternate tripletcolumn_probability
: column that contains the probabilityWarning
Probabilities must sum to one.
Score¶
The score section is used to know which scores are going to be used.
[score]
# Path to score file
file = "%(bgdata://genomicscores/caddpack/1.0)"
# Format of the file
format = 'pack'
# Column that has the chromosome
chr = 0
# If the chromosome has a prefix like 'chr'. Example: chrX chr1 ...
chr_prefix = ''
# Column that has the position
pos = 1
# Column that has the reference allele
ref = 2
# Column that has the alternative allele
alt = 3
# Column that has the score value
score = 5
# Minimum number of stops per element to infer a for the stops using the mean of all scores
minimum_number_of_stops = 3
# Function to infer the value of the stops in an element using the mean (x is the mean value of the scores)
mean_to_stop_function = '8.9168668946147314*np.exp(0.082688007694096191*x)'
The scores should be a file that for a given position, in a given chromosome, gives a value to every possible alteration.
Some of the parameters in this section are optional, while others are mandatory.
file
is a string and represents the path to the scores file.format = 'tabix'
indicates that the file is a tab separated file compressed with bgzip. This means that a .tbi index file should be present in the same location. The other option currently supported isformat = 'pack'
which is a binary format we have implemented to reduce the file size. Thus, if you want to use your own file, use the tabix format.chr
column in the file where the chromosome is indicated.chr_prefix
: when querying the tabix file for a specif chromosome OncodriveFML only uses the number of the chromosome or ‘X’ or ‘Y’. If the tabix file requires a prefix before the chromosome, use this option. For instance, if the chromosomes in the tabix file are labeled aschr1
,chr2
, ..,chrY
, set this option to:chr_prefix = 'chr'
. If this is not the case, use an empty string:chr_prefix = ''
.pos
column that indicates the position of the scored alteration in the chromosome.ref
column that contains the reference allele. It is optional.alt
column that contains the alternate allele. It is optional. If is not specified, it is assumed that the 3 possible changes have the same score.score
column that contains the score.element
column that contains the element identifier. It is optional. If it is provided and the value does not match with the one from the regions, these scores are discarded.
OncodriveFML uses two additional parameters,
which are related only to the stop
method
for computing the indels.
When analysing a certain gene, OncodriveFML might need to score an indel according to the value of the stops in the gene. It might happen that the number of stops is 0 or is below a certain threshold. In such cases, OncodriveFML uses the function specified in this parameter to assign a score from the mean value of all the stops in the gene.
Download the
IPython notebook
that has been created with the functions computed for CADD1.0 and CADD1.3, or see it.When analysing a certain gene, OncodriveFML gets all the scores associated with the mutations that produce a stop in that gene.
minimum_number_of_stops
indicates the minimum number of stops that a gene is required to have in order to avoid using the function above.
Statistic¶
The statistic section is related to the configuration of the analysis
[statistic]
# Mathematical method to use to compare observed and simulated values
method = 'amean'
# Do not use/use MNP mutations in the analysis
discard_mnp = False
# Minimum sampling
sampling = 100000
# Maximum sampling
sampling_max = 1000000
# Sampling chunk (in millions)
sampling_chunk = 100
# Minimum number of observed (if not reached, keeps computing)
sampling_min_obs = 10
There a different parameters you can configure:
method
represents the type of operation that is applied to observed and simulated scores before comparing them. The arithmetic mean (method = 'amean'
) and the geometric mean (method = 'gmean'
) are supported. The recommended one is the arithmetic mean.- In some cases, you might be interested in performing the
analysis per sample. This means that all the mutations that come
from the same sample are reduced to a single score. This score
can be the maximum (
per_sample_analysis = 'max'
), the arithmetic mean (per_sample_analysis = 'amean'
) or the geometric mean (per_sample_analysis = 'gmean'
) of all the mutation’s scores that come from the sample sample. Comment this option if you are not interested in this type of analysis. - MNP mutations can optionally be included in the analysis.
Use
discard_mnp = False
to include them anddiscard_mnp = True
to discard them.
OncodriveFML includes a few more parameters that are related to how many simulations are performed.
sampling
represents the minimum number of simulations to be performed.sampling_max
represents the maximum number of simulations to be performed.sampling_chunk
represents the maximum size (in millions) that a single process can handle. This value is used to keep the memory usage within certain limits.Note
With a value of 100, each process takes less than 4 GB of RAM. We have not considered the memory taken by the main process.
sampling_min_obs
represents the minimum number of observations [2]. When it is reached, no more simulations are performed.
Indels¶
The indels subsection of statistic contains the configuration for the analysis of indels.
[[indels]]
# Include/exclude indels from your analysis
include = True
# Method used to simulate indels
# Treat them as a set of substitutions and take the maximum
# method = 'max'
# Number of consecutive times the indel appears to consider it falls in a repetitive region
# Looking from the indel position and in the direction of the strand
max_consecutive = 7
# Indels simulated as substitutions take into account signature or not
simulate_with_signature = True
# Use exomic probabilities of frameshift indels in the dataset for the simulation
gene_exomic_frameshift_ratio = False
# Function applied to the scores of the stops in the gene to compute the observed score
# Arithmetic mean
stops_function = 'mean'
OncodriveFML accepts various parameters related to the indels:
- The main option is
include
, which indicates whether to include indels in the anlysis or not. Useinclude = True
to include indels andinclude = False
to exclude them. - OncodriveFML can simulate indels in two ways.
method = 'max'
simulates indels as a set of substitutions.method = 'stop'
simulates indels as stops. This option is recommended for simulating indels in coding regions. Check the analysis of indels section to find more details. - OncodriveFML discards indels that fall in
repetitive regions. OncodriveFML considers that
an indel is in a repetitive region when the
same sequence of the indel appears consecutively
in a genomic element a certain number of times
(or even more) following the direction of the strand.
The maximum number of consecutve repetitions can be
set with the
max_consecutive
option. OncodriveFML will not discard any indel due to repetitive regions if you setmax_consecutive = 0
. - Indels that are simulated as substitutions [3]
can be simulated assigning to all the positions of the genomic element
under analysis the same probability to be mutated. Alternatively the
probability of each position to be mutated can depend on the mutational
signature. For instance if the signature is represented by the cancer type,
indels coming from a breast cancer dataset will be simulated
with the signature of that cancer type.
Indels do not contribute to the signature of a cancer type, therefore through
this option you can decide whether indels should be simulated following
the mutational signature or not.
Use
simulate_with_signature = True
to use the signature orsimulate_with_signature = False
to simulate indels with the same probabilities.
gene_exomic_frameshift_ratio
is a flag that indicates OncodriveFML which mutations influence the probabilities for frameshift indels and substitutions. Whengene_exomic_frameshift_ratio = False
the probabilities are taken from the mapped mutations discarding those whose length is multiple of 3. Note that in order to work properly, this option should be set when the regions file corresponds to coding regions. Ifgene_exomic_frameshift_ratio = True
, the probabilities are taken from the observed mutations rate in each region. This option is harmless whenmethod = 'max'
.- The observed score of an indel that is computed with the
method = 'stop'
option is related to the score of the stops in its gene. You can decide how this relation is by choosing a function that is applied to all stops scores in the gene. E.g.stops_function = 'mean'
associates the indel to a value that is equal to the mean of all stop scores in the gene. The options you can choose are: -'mean'
for arithmetic mean -'median'
for the median -'random'
for a random value between the maximum and the minimum -'random_choice'
for choosing a random value between all the possible ones
Settings¶
To configure the system where the analysis is performed OncodriveFML includes the setting section:
[settings]
# Number of cores to use in the analysis
# Comment this option to use all avaliable cores
# cores = 6
Use the cores
option to indicate how many cores to
use. You can comment this option in order to use
all the available cores.
Note
OncodriveFML works on shared memory systems
using the multiprocessing
module.
Logging¶
The logging section is used to configure the logging system of OncodriveFML.
# Configuration for the logging system
[logging]
version = 1
disable_existing_loggers = False
# Configuration for the handlers
[[handlers]]
# Log to stdout
[[[console]]]
class = 'logging.StreamHandler'
formatter = 'bgformat'
level = 'INFO'
stream = 'ext://sys.stdout'
# log to a file
[[[file]]]
class = 'logging.FileHandler'
formatter = 'bgformat'
filename = 'log.txt'
mode = 'w'
# Configuration for the formatters
[[formatters]]
[[[bgformat]]]
format ='%(asctime)s %(levelname)s: %(message)s'
datefmt ='%H:%M:%S'
# Configuration for the loggers
[[loggers]]
# OncodriveFML logger
[[[oncodrivefml]]]
handlers = ['console', 'file']
level = 'DEBUG'
propagate = 0
OncodriveFML uses the logging
module.
In particular it loads the configuration file into a dictionary
and passes this section to dictConfig()
.
You can change this section to other compatible configurations to fit your needs.
All the logs are done using a logger named oncodrivefml
.
The logging system can be configured through the logging section of the
configuration file.
Warning
If OncodriveFML detects that the run has already been calculated, the warning informing the user uses the root logger.
OncodriveFML does override the configuration in two ways:
- If the
debug
flag is set, the console logger level is set toDEBUG
. Otherwise, it is set toINFO
. - If one of the handlers is named
file
, its filename is set to<mutations file name>__log.txt
and saved in the same folder as the OncodriveFML output.
[1] | Previous and posterior nucleotides |
[2] | An observation is counted when a simulated value,
after applying the function in method to the simulated scores,
is higher than the result of applying the same function to the
observed scores. |
[3] | All indels are simulated as substitutions when
method = 'max' . Indels that are in-frame
are also simulated as substitutions when method = 'stop' . |