How it works

This section will try to give an overview of how OncodriveFML carries on the analysis.

The command line interface

By typing oncodrivefml -h you will have a brief description of how to use OncodriveFML:

Options:
-i, --input MUTATIONS_FILE
 Variants file [required] (see format)
-e, --elements ELEMENTS_FILE
 Genomic elements to analyse [required] (see format)
-s, --sequencing
 

Type of sequencing [required]:

  • wgs: whole genome sequencing
  • wes: whole exome sequencing
  • targeted: targeted sequencing

See details about the command line interface to find more information about this option.

-o, --output OUTPUT_FOLDER
 Output folder. Default to regions file name without extensions.
-c, --configuration CONFIG_FILE
 Configuration file. Default to ‘oncodrivefml_v2.conf’ in the current folder if exists or to ~/.config/bbglab/oncodrivefml_v2.conf if not.
--samples-blacklist SAMPLES_BLACKLIST
 Remove these samples when loading the input file.
--signature SIGNATURE
 

File with the signatures to use

See details about the command line interface to find more information about this option.

--no-indels Discard indels in your analysis
--debug Show more progress details
--version Show the version and exit.
-h, --help Show this message and exit.

If you prefer to call OncodriveFML from a Python script, you can download the source code, install it and call the main() function.

Note

You might have notice that the main() function accepts less parameters than the command line interface. This is because the command line interface modifies some parameters in the configuration, while calling directly the Python code does not. Check what is modified by the command line interface.

This implies that you should adapt the configuration file to your needs.

The files

Input files

OncodriveFML makes use of three files:

Variants
Also named as input. This file contains the observed mutations for the analysis.
Regions

File containing the regions for the analysis. Only mutations that fall in these regions are analysed and only the genomic positions defined in this file are used for the simulation.

You can define your own regions file based on your criteria. You can check an example of a regions file downloading our example.

Warning

It is not recommended to mix coding and non-coding regions in your regions file. In fact this will likely produce artifacts in the results as coding and non-coding regions of the genome have a very different functional impact scores. A good set of genomic regions should include elements that share biological functions (e.g. CDS, UTRs, promoters, enhancers, etc.).

Check the formats for the input files.

Configuration
The configuration file is also a key part of the run, and understanding how to adapt it to your needs is important. Check this section to find more details about it.

Output files

Find information about the output output files section.

Workflow

  1. The first thing that is done by OncodriveFML is to load the configuration file and to create the output folder if it does not exist.

    Note

    If you have not provided any output folder, OncodriveFML will create one in the current directory with the same name as the elements file (without extension).

    If the output folder exits, OncodriveFML checks whether a file with the expected output name exits and, if so, it does not run.

  2. The regions file is loaded, and a tree with the intervals is created. This tree is used to find which mutations fall in the regions being analysed.

  3. Loads the mutations file and keeps only the ones that fall into the regions being analysed.

  4. Computes the signature (see the signature section).

  5. Analyses each region separately (only the ones that have mutations). In each region the analysis is as follow:

    1. Computes the score of each of the observed mutations.
    2. Simulates the same number of mutations in the segments of the region under analysis. Save the scores of each of the simulated mutations. The simulation is done several times.
    3. Applies a predefined function to the observed scores and to each of the simulated groups of scores. Counts how many times the simulated value is higher than, or equal to, the observed.
    4. From these counts, computes a P-value by dividing the counts by the number of simulations performed.

    You can find more details in the analysis section.

  6. Joins the results and performs a multiple test correction. The multiple test correction is only done for regions with mutations from at least two samples.

  7. Creates the output files.

  8. Checks that the output file does not contain missing or repeated genomic regions.