Output¶

OncodriveFML generates 3 output files:

A .tsv with the analysis results
A .png image with the most significant genes labeled.
A .html interactive plot which can be used to search for specific genes.

Naming¶

All the 3 files generated by OncodriveFML have the same name. They only differ in the extension. The name given to the files is the same as the name of the mutations file followed by -oncodrivefml and the extension.

The `.tsv` file¶

This tabulated file is the most important (as the others are just plots using the data in this one) and contains the results of the analysis.

In the file, the following columns can be found:

index: Gene ID from Ensembl
MUTS: number of mutations found in the dataset for that gene
MUTS_RECURRENCE: number of mutations that do not occur in the same position
SAMPLES: number of mutated samples in the gene
P_VALUE: times that the observed value is higher than or equal to the expected value, divided by the number of randomizations
Q_VALUE: pvalue corrected using the Benjamini/Hochberg correction (for samples with at least 2 samples_mut)
P_VALUE_NEG: times that the observed value is lower than or equal to the expected value, divided by the number of randomizations
Q_VALUE_NEG: pvalue_neg corrected using the Benjamini/Hochberg correction (for samples with at least 2 samples_mut)
SNP: number of mutations that are Single Nucleotide Polymorphisms
MNP: number of mutations that are Multi Nucleotide Polymorphisms (two or more)
INDELS: number of mutations that are insertions or deletions
SYMBOL: HGNC Symbol

The plots¶

Both plots (.png and .html) represent the same. They are similar to Q-Q plots where in the Y axis the $-log10$ of the computed P-values are represented (sorted) and in the X axis the $-log10$ of the expected P-values are reported (sorted).

The expected P-values represent the null distribution: $-log10(i/N)$ where $i \in [1, N]$ and N represents the number of computed P-values.

Note

The P-values of OncodriveFML are always > 0, even when all the simulated functional impact scores are lower than the observed functional impact score. In this case, a pseudocount is added.

The genomic elements that have a lighter color in the plot are the ones for which the number of mutated sample does not reach the minimum required to perform the multiple test correction.

All the genomic regions above the red line in the plot represent those with a Q-value below 0.1. The ones between the green line and the red line are the ones with a Q-value between 0.25 and 0.1.

Output¶

Naming¶

The .tsv file¶

The plots¶

The `.tsv` file¶