How it works

This section will try to give an overview of how OncodriveFML carries on the analysis.

The command line interface

By typing oncodrivefml -h you will have a brief description of how to use OncodriveFML:

Options:
-i, --input MUTATIONS_FILE

Variants file [required] (see format)

-e, --elements ELEMENTS_FILE

Genomic elements to analyse [required] (see format)

-o, --output OUTPUT_FOLDER

Output folder. Default to regions file name without extensions.

-c, --configuration CONFIG_FILE

Configuration file. Default to ‘oncodrivefml_v2.conf’ in the current folder if exists or to ~/.config/bbglab/oncodrivefml_v2.conf if not.

--samples-blacklist SAMPLES_BLACKLIST

Remove these samples when loading the input file.

--signature SIGNATURE

File with the signatures to use

See details about the command line interface to find more information about this option.

–signature-correction [wg|wx] Correct the computed signutares by genomic

or exomic signtures. Only valid for human genomes (hg19 and hg38)

  • wg: correction using whole genome counts

  • wx: correction using whole exome counts

See details about the command line interface to find more information about this option.

--no-indels

Discard indels in your analysis

--cores INTEGER

Cores to use. Default: all

--seed INTEGER

Set up an initial random seed to have reproducible results

--debug

Show more progress details

--version

Show the version and exit.

-h, --help

Show this message and exit.

The files

Input files

OncodriveFML makes use of three files:

Variants

Also named as input. This file contains the observed mutations for the analysis.

Regions

File containing the regions for the analysis. Only mutations that fall in these regions are analysed and only the genomic positions defined in this file are used for the simulation.

You can define your own regions file based on your criteria. You can check an example of a regions file downloading our example.

Warning

It is not recommended to mix coding and non-coding regions in your regions file. In fact this will likely produce artifacts in the results as coding and non-coding regions of the genome have a very different functional impact scores. A good set of genomic regions should include elements that share biological functions (e.g. CDS, UTRs, promoters, enhancers, etc.).

Check the formats for the input files.

Configuration

The configuration file is also a key part of the run, and understanding how to adapt it to your needs is important. Check this section to find more details about it.

Output files

Find information about the output output files section.

Workflow

  1. The first thing that is done by OncodriveFML is to load the configuration file.

  2. The output is checked. The default behaviour is that OncodriveFML creates an output folder in the current directory with the same name as the elements file (without extension).

    If an output is provided and it exists and is a folder, OncodriveFML checks whether a file with the expected output name exits and, if so, it does not run. Otherwise, it assumes it is a path name an uses that as output.

    Note

    If the output does not exits, OncodriveFML only computes the tsv file with the results and skips the plots.

  3. The regions file is loaded, and a tree with the intervals is created. This tree is used to find which mutations fall in the regions being analysed.

  4. Loads the mutations file and keeps only the ones that fall into the regions being analysed.

  5. Computes the signature (see the signature section), if not provided as an external file.

  6. Analyses each region separately (only the ones that have mutations). In each region the analysis is as follow:

    1. Computes the score of each of the observed mutations.

    2. Simulates the same number of mutations in the segments of the region under analysis. Save the scores of each of the simulated mutations. The simulation is done several times.

    3. Applies a predefined function to the observed scores and to each of the simulated groups of scores. Counts how many times the simulated value is higher than, or equal to, the observed.

    4. From these counts, computes a P-value by dividing the counts by the number of simulations performed.

    You can find more details in the analysis section.

  7. Joins the results and performs a multiple test correction. The multiple test correction is only done for regions with mutations from at least two samples.

  8. Creates the output files.

  9. Checks that the output file does not contain missing or repeated genomic regions.