Installation¶
The IntOGen pipeline requires Nextflow and Singularity in order to run.
Beside them, a number of different datasets need to be downloaded, and other pieces of software installed (as Singularity containers).
For information on how to download and build all these requirements, check the README file in the build folder.
Usage¶
Once all the prerequisites are available, running the pipeline
only requires to execute the intogen.nf
file with the appropiate
parameters. E.g.:
nextflow run intogen.nf --input <input>
There are a number of parameters and options that can be added.
- -resume
Nextflow feature to allow for resumable executions.
- –input <path>
Path of the input. See below for more details.
- –output <path>
Path where to store the output. Default:
intogen_analysis
.- –containers <path>
Path to the folder containing the singularity images. Default:
containers
.- –datasets <path>
Path to the folder the datasets from Default:
containers
.- –annotations <file>
Path to the default annotations file. Default:
config/annotations.txt
. See the input section for more details.- –debug true
Ask methods for a more verbose output if the option is available.
Input & output¶
Input¶
Although the pipeline does most of its computations at the cohort level, the pipeline is prepared to work with multiple cohorts at the same time.
Each cohort must contain, at least, the chromosome, position, ref, alt and sample. Files are expected to be TSV files with a header line.
Important
All mutations should be mapped to the positive strand. The strand value is ignored.
In addition, each cohort must be associated with:
cohort ID (
DATASET
): a unique identifier for each cohort.a cancer type (
CANCER
): although any acronym can be used here, we recommend to restrict to the acronyms that can be found inextra/data/dictionary_long_name.json
.a sequencing platform (
PLATFORM
):WXS
for whole exome sequencing andWGS
for whole genome sequencinga reference genome (
GENOMEREF
): onlyHG38
andHG19
are supported
Cohort file names, as well as the fields mentioned above must not contain dots.
The way to provide those values is through the use of annotations and datasource files supported by the bgparsers package.
In the case that you are interesting in analysing a single cohort,
the cohort filename until the first dot will be used as cohort ID.
The rest of the fields can be easily provided with an annotations file.
See config/annotations.txt
for an example.
When you plan to run multiple cohorts, it is recommended to
combine the simple annotations that can be achieved with
an annotations file with the power of bginfo
files
(with .bginfo
extension).
One such file (which can include others), allows to create
smart annotations and input certain files from your directory structure.
The pipeline can receive the bginfo
file or the folder where it is.
In this case, the annotations file (passed using --annotations
)
allows to make some general definitions of variables that
are used in all bginfo
files.
See examples of bginfo
in the
bgparsers documentation.
Output¶
By default this pipeline outputs 4 files:
cohorts.tsv
: summary of the cohorts that have been analyzeddrivers.tsv
: summary of the results of the driver discovery by cohortmutations.tsv
: summary of all the mutations analyzed by cohortunique_drivers.tsv
: information on the genes reported as drivers (in any cohort)
Those files can be found in the path indicated with the
--output
options.
Moreover, the --debug true
options will generate a
debug
folder under the output folder, in which
all the input and output files of the different methods are
linked.