Readme

OncodriveFML

Recent years saw the development of methods to detect signals of positive selection in the pattern of somatic mutations in genes across cohorts of tumors, and the discovery of hundreds of driver genes. The next major challenge in tumor genomics is the identification of non-coding regions which may also drive tumorigenesis. We present OncodriveFML, a method that estimates the accumulated functional impact bias of somatic mutations in any genomic region of interest based on a local simulation of the mutational process affecting it. It may be applied to all genomic elements to detect likely drivers amongst them. OncodriveFML can discover signals of positive selection when only a small fraction of the genome, like a panel of genes, has been sequenced.

License

OncodriveFML is made available to the general public subject to certain conditions described in its LICENSE. For the avoidance of doubt, you may use the software and any data accessed through UPF software for academic, non-commercial and personal use only, and you may not copy, distribute, transmit, duplicate, reduce or alter in any way for commercial purposes, or for the purpose of redistribution, without a license from the Universitat Pompeu Fabra (UPF). Requests for information regarding a license for commercial use or redistribution of OncodriveFML may be sent via e-mail to innovacio@upf.edu.

Usage

OncodriveFML is meant to be used through the command line.

By default, OncodriveFML is prepared to analyse mutations using HG19 reference genome. For other genomes, update the configuration accordingly.

Installation

OncodriveFML depends on Python 3.5 and some external libraries. The easiest way to install all this software stack is using the well known Anaconda Python distribution:

$ conda install -c bbglab oncodrivefml

OncodriveFML can also be installed using pip:

pip install oncodrivefml

Finally, you can get the latest code from the repository and install with pip:

$ git clone git@bitbucket.org:bbglab/oncodrivefml.git
$ cd oncodrivefml
$ pip install .

Note

OncodriveFML has a set up dependency with Cython, which is required to compile the *.pyx files.

The first time that you run OncodriveFML it will download the genome reference from our servers. By default the downloaded datasets go to ~/.bgdata if you want to move these datasets to another folder you have to define the system environment variable BGDATA_LOCAL with an export command.

The following command will show you the help:

$ oncodrivefml --help

Run the example

Download and extract the example files (if you cloned the repository skip this step):

$ wget https://bitbucket.org/bbglab/oncodrivefml/downloads/oncodrivefml-examples_v2.2.tar.gz
$ tar xvzf oncodrivefml-examples_v2.2.tar.gz

To run this example OncodriveFML needs all the precomputed CADD scores, that is a 17Gb file. It will be automatically downloaded the first time you run OncodriveFML, but if you want to speed up the process it is better to first download it using our data package management tool (BgData) that is also installed when you install OncodriveFML.

Run this command to download the CADD scores file to the default bgdata folder ~/.bgdata:

$ bg-data genomicscores caddpack 1.0

Warning

CADD scores are originally from http://cadd.gs.washington.edu/ and are freely available for all non-commercial applications. If you are planning on using them in a commercial application, please contact them at http://cadd.gs.washington.edu/contact.

Additonally, if you want to speed up the download of the genome reference that is also needed, run this command:

$ bg-data datasets genomereference hg19

To run the example, we have included a bash script (run.sh) than will execute OncodriveFML. The script should be executed in the folder where the files have been extracted:

$ ./run.sh

The results will be saved in a folder named cds.

Documentation

Find OncodriveFML documentation in ReadTheDocs.

You can also compile the documentation yourself using Sphinx, if you have cloned the repository. To do so, install the optional packages in optional-requirements.txt and build the documentation in the docs folder:

$ cd docs
$ make html

License

OncodriveFML is the property of the Universitat Pompeu Fabra (UPF), which hold the copyright thereto.
Copyright (C) 2016  Universitat Pompeu Fabra

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>


Notices/Contact

  UPF Unitat d’Innovació,
  Edifici Mercè Rodoreda,
  C/. Ramon Trias Fargas, 25-27,
  08005 Barcelona, Spain

Att.:

Telephone: +34 93 542 15 67
Email:     innovacio@upf.edu

Configuration template specifications

[genome]
build = string(default='hg19')

[signature]
method = option('none', 'full', 'complement', 'bysample', 'file', default='complement')
classifier = option('CANCER_TYPE', 'SAMPLE', 'SIGNATURE', default=None)
normalize_by_sites = option('whole_genome', 'wgs', 'whole_exome', 'wxs', 'wes', default=None)
path = string(default=None)

[score]
file = string
format = option('tabix', 'pack')
chr = integer
chr_prefix = string
pos = integer
ref = integer(default=None)
alt = integer(default=None)
score = integer
element = integer(default=None)
extra = integer(default=None)

minimum_number_of_stops = integer(default=3)
mean_to_stop_function = string(default=None)


[statistic]
method = option('amean', 'gmean', default='amean')
discard_mnp = boolean(default=False)

sampling = integer(default=100000)
sampling_max = integer(default=1000000)
sampling_chunk = integer(default=100)
sampling_min_obs = integer(default=10)

per_sample_analysis = option('amean', 'gmean', 'max', default=None)

    [[indels]]
        include = boolean(default=True)
        method = option('stop', 'max', default='max')
        max_consecutive = integer(default=0)

        gene_exomic_frameshift_ratio = boolean(default=False)
        stops_function = option('mean', 'median', 'random', 'random_choice', default='mean')



[settings]
cores = integer(default=None)

Configuration template

[genome]
# Build of the reference genome
# Currently human genomes supported: hg19 and hg38
build = 'hg19'
# It might work with hg18 and mouse genomes: c3h and mm10
# if indels are not computed and signature not corrected



[signature]
# Choose the method to calculate the trinuclotide singature:

# "full" : Use a 192 matrix with all the possible signatures
# method = 'full'

# "complemented" : Use a 96 matrix with the signatures complemented
method = 'complement'

# "none": Don't use signature
# method = 'none'

# "bysample": Compute a 96 matrix signature for each sample
# method = 'bysample'

# "file": Provide a file with the signature to use
# The file should be created using bgsignatures package
# method = 'file'


# Choose the classifier (categorical value for the signature):
# The classifier is a column in the dataset and must be one of these:
# classifier = 'SIGNATURE'
# classifier = 'SAMPLE'
# classifier = 'CANCER_TYPE'
# by default, all mutations contribute to the signature
# If the signature is loaded from the a file, the same classifier must have been used.


# The frequency of trinucleotides can be normalized by the frequency of sites

# whole_genome/wgs: correct the signature for the whole genome frequencies
# normalize_by_sites = 'whole_genome'

# whole_exome/wxs/wes: correct the signature for frequencies in coding regions
# normalize_by_sites = 'whole_exome'

# None: do not correct (comment the option)
# normalize_by_sites = ''



[score]
# Path to score file
file = "%(bgdata://genomicscores/caddpack/1.0)"
# WARNING: The %(bgdata:...) will download (the first time that you use it) a score file from
# our servers and install it into the ~/.bgdata folder.

# WARNING: CADD 1.0 scores are original from http://cadd.gs.washington.edu/ and are freely
# available for all non-commercial applications. If you are planning on using them in a
# commercial application, please contact them at http://cadd.gs.washington.edu/contact.

# Format of the file
# 'pack': binary format
format = 'pack'

# Column that has the chromosome
chr = 0

# If the chromosome has a prefix like 'chr'. Example: chrX chr1 ...
chr_prefix = ''

# Column that has the position
pos = 1

# Column that has the reference allele
ref = 2

# Column that has the alternative allele
alt = 3

# Column that has the score value
score = 5

# If you have different scores at the same position, and each score applies to a
# different region element, then uncomment this line and set the value to the column
# that has the element id to match.
# element = 6

# Minimum number of stops per element to infer a for the stops using the mean of all scores
minimum_number_of_stops = 3

# Function to infer the value of the stops in an element using the mean (x is the mean value of the scores)
mean_to_stop_function = '8.9168668946147314*np.exp(0.082688007694096191*x)'



[statistic]

# Mathematical method to use to compare observed and simulated values
# Arithmetic mean
method = 'amean'

# Gemoetric mean
# method = 'gmean'


# Do not use/use MNP mutations in the analysis
discard_mnp = False
#dicard_mnp = True


# Compute the observed values using only 1 mutation per sample
#per_sample_analysis = 'max'
#per_sample_analysis = 'amean'
#per_sample_analysis = 'gmean'


# Minimum sampling
sampling = 100000

# Maximum sampling
sampling_max = 1000000

# Sampling chunk (in millions)
sampling_chunk = 100

# Minimum number of observed (if not reached, keeps computing)
sampling_min_obs = 10


[[indels]]
# Include/exclude indels from your analysis
include = True
# include = False


# Method used to simulate indels

# Treat them as stops (for coding regions)
# method = 'stop'

# Treat them as a set of substitutions and take the maximum
method = 'max'


# Number of consecutive times the indel appears to consider it falls in a repetitive region
max_consecutive = 7

# Do not discard indels that fall in repetitive regions
# max_consecutive = 0


# Use exomic probabilities of frameshift indels in the dataset for the simulation
gene_exomic_frameshift_ratio = False
# or probabilities of each gene
# gene_exomic_frameshift_ratio = True


# Function applied to the scores of the stops in the gene to compute the observed score

# Arithmetic mean
stops_function = 'mean'

# Median
# stops_function = 'median'

# Random value between the max and the minimum
# stops_function = 'random'

# Random choice amongst the values
# stops_function = 'random_choice'



[settings]
# Number of cores to use in the analysis
# Comment this option to use all avaliable cores
# cores = 6

Scores function notebook