What is a NG6Workflow

The NG6 interface is build upon 3 main objects:

  • The Analysis - which is the presentation of the execution of specific analyzes softwares using input data files.
  • The Run - which can be viewed as a container of analysis. It describes the analyzed input data.
  • The Project - which is a container of Runs, it can also contain analyzes.

A NG6Workflow allows to create a single Run object with input data files and populate this run with analyzes regarding those data. It's an extension of ng6.ng6workflow.NG6Workflow class and can be viewed as a collection of analysis. It also lists all the inputs and parameters that should be requested to the final user and build the execution process by adding analysis and linking them to each others.

There are two classes that can be used to create a workflow in NG6:

  • ng6.ng6workflow.NG6Workflow - This class adds required parameters for the description of a Run and a Sample.
  • ng6.ng6workflow.CasavaNG6Workflow - An extension of ng6.ng6workflow.NG6Workflow which add the support of illumina CASAVA output directories for Samples description.

Where to add a new NG6Workflow

New wokflow must be added as a new python package in the workflows package. The implementation of a workflows must be written in the package __init__.py file. The developper can also create:

  • a components package, where all the workflow specific components and analysis can be stored,
  • a lib package to import specific libraries within its workflow,
  • a bin folder with the binaries used in the workflow.
nG6/
├── bin/
├── docs/
├── src/
├── workflows/
│   ├── myworkflow/       [ the new ng6workflow package ]
│   │   ├── components/   [ specific components and analyses]
│   │   ├── lib/          [ specific libraries ]
│   │   ├── bin/          [ specific binairies ]
│   │   └── __init__.py   [ the ng6workflow implementation ]
│   ├── components/
│   ├── extparsers/
│   ├── __init__.py
│   ├── formats.py
│   └── types.py
├── applications.properties
└── README

NG6Workflow classes

NG6Workflow class

A NG6Workflow is a class defined in the __init__.py file. In order to add a new one, the developper has to:

  • implement a class inheriting from the ng6.ng6workflow.NG6Workflow class,
  • overload the get_description() method to provide to the final user a description of the workflow,
  • overload the define_parameters() method to add the workflow inputs and parameters,
  • overload the process() method by adding analyses or components and setting their arguments,
  • link the components inputs and outputs.

The class skeleton is given by

from ng6.ng6workflow import NG6Workflow

class MyWorkflow (NG6Workflow):

    def get_description(self):
        return "a description"

    def define_parameters(self, function="process"):
        # define the parameters

    def process(self):
        # add and link the components

By inheriting ng6.ng6workflow.NG6Workflow class, default parameters regarding the description of a run and also the description of samples, are added automatically and will be requested.

Run parameters

Flag Help Required type
--admin-login Who is the project administrator true adminlogin
--project-name The project name the run belongs to true existingproject
--name Give a name to your run true string
--description Give a description to your run true string
--date When were the data produced true date
--data-nature Are Sequences cDNA, genomique, RNA, ... true string
--sequencer Which sequencer produced the data true string
--species Which species has been sequenced true string
--type What type of data is it (1 lane, 1 region) true string

Sample parameters

The sample parameter is a multiple parameter list. It must be set as --sample [subparam=value ...]. Subparameters are :

Flag Help Required type
sample_id The uniq identifier of the sample. false string
sample_name A descriptive name for the sample. false string
sample_description A brief description of the sample. false string
type Read orientation and type. Choose a value from :
  • pe - paired end
  • se - single end
  • ose - oriented paired end
  • ope - oriented single end
  • mp - mate pair
false string
insert_size Insert size for paired end reads. false integer
metadata Add metadata to the sample. A sample metadata must be set as --sample metatada=key:value . false samplemetadata
read1 Read 1 data file path. true inputfile list
read2 Read 2 data file path. false inputfile list

CasavaNG6Workflow class

CasavaNG6Workflow is an extension of NG6Workflow, it has the exact same requirements for the run description except that it overloads the sample definition parameter to parse illumina CASAVA output directory. With this parameter, the final user does not have to define his sample directly in the command line.

Sample parameters

Flag Help Required type
--casava-directory Path to the CASAVA directory to use true string
--casava-lane The lane number to be retrieved from the casava directory true integer
--mismatch-index Set this value to true if the index sequence in the sample fastq files allows at least 1 mismatch false boolean

NG6Workflow.define_parameters()

The define_parameters() method is used to add workflow parameters and inputs. To do so, several methods are available. Once defined, the new parameters are available as object attibuts, thus they are accessible through self.parameter_name.

Several types of parameters can be added, all described in the following sections. All have two required positional arguments: name and help. The other arguments are optional and can be given to the method by using their keywords.

Parameters

Parameters can be added to handle a single element or a list of elements. Thus, the add_parameter() method can be used to force the final user to provide one and only one value, where the add_parameter_list() method allows the final user to give as many values he wants.

add_parameter()

Example

In the following example, a parameter named sequencer is added to the workflow. It has a list of choices and the default value is "HiSeq2000".

self.add_parameter("sequencer",
    		   "The sequencer type.", 
    		   choices = ["HiSeq2000", "ILLUMINA","SLX","SOLEXA","454","UNKNOWN"], 
    		   default="HiSeq2000")

Options

There are two positional arguments: name and help. All other options are keyword options

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name NB: "-" in parameter name will be automatically replace by "_", so --name-param become self.name_param in the code.
help string true None The parameter help message.
default - false None The default parameter value. It's type depends on the parameter type.
type string false "str" The parameter type. The value provided by the final user will be casted and checked against this type. All built-in Python types are available "int", "str", "float", "bool", "date", ... To create customized types, refere to the Add a data type documentation.
choices list false [] A list of the allowed values.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
add_to string false None If this parameter is part of a multiple parameter, add_to allows to define to which "parent" parameter it should be linked to.

add_parameter_list()

The add_parameter_list() method takes the same arguments as add_parameter(). However, adding this parameter, the final user will be allowed to enter multiple values for this parameter and the object attribut self.parameter_name will be settled as a Python list.

Inputs

Just like parameters, inputs can be added to handle a single file or a list of files. Thus, the add__input_file() method can be used to force the final user to provide one and only one file, where the add__input_file_list() method allows the final user to give as many files as he wants.

add_input_file()

Example

In the following example, an input named reads is added to the workflow. The provided file is required and should be in fastq format. No file size limitation is specified.

self.add_input_file_list("reads", 
                         "Which read files should be used", 
                         file_format="fastq", 
                         required=True)

Options

There are two positional argument : name and help. All other options are keyword options.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default string false None The default path value.
file_format string false "any" The file format is checked before running the workflow. Available format are "any", "bam", "fasta", "fastq", and "sff". To create customized format, refere to the Add a file format documentation.
type string false "inputfile" The type can be "inputfile", "localfile", "urlfile" or "browsefile". An "inputfile" allows the final user to provide a "localfile" or an "urlfile" or a "browsefile". A "localfile" restricts the final user to provide a path to a file visible by ng6. An "urlfile" only permits the final user to give an URL as input, where a "browsefile" force the final user to upload a file from its own computer. This last option is only available from the GUI and is considered as a "localfile" from the command line. All the uploading process is handled by ng6.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
add_to string false None If this parameter is part of a multiple parameter, add_to allows to define to which "parent" parameter it should be linked to.
size_limit string false "0" Which maximum file size is allowed. If the value is "0", the file size allowed is unlimited. The given value should also provides the file size units between "bytes", "Kb", "Mb", "Gb", "Tb", "Pb", "Eb" and "Zb". A value of 10Mb will restrict the user to upload a file of 10 Mega Bytes.

add_input_file_list()

This method takes the same arguments as add_input_file(). However, adding this parameter, the final user will be allowed to provide multiple files and the object attribut self.parameter_name will be settled as a Python list.

add_input_directory()

The add_input_directory() method allows the user to select files from a specific directory. This kind of input can be useful for tools outputing not only files but an organized directory. The parameter get_files_fn specify the function that will be used to retrieve the files. This method can take as many arguments as required, but the first argument has to be a string representing the folder path. By default all files will be selected. From the workflow process() function, the files can be retrieved by using the get_files() method.

Example

In the following example, the add_input_directory() method is used to parse a directory and retrieve only fasta files inside this directory. get_files() will browse the directory and get all fasta files.

import os
from ng6.ng6workflow import NG6Workflow

def fasta_files(folder):
    res = []
    for file in os.listdir(folder):
        if file.endswith(".fasta"):
            res.append(file)
    return res

class WF(NG6Workflow):
    def define_parameters(self, function="process"):
        self.add_input_directory("fastadir", "Path to folder with fasta files", 
            get_files_fn=fasta_files)

    def process(self):
        # to retrieve the files
        for fastafile in self.fastadir.get_files():
            # do something

Options

There are two positional argument : name and help. All other options are keyword options.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default string false None The default path value.
get_files_fn function false - get_files_fn will be the method called when executing param.get_files(). All argument from get_files() will be used as arguments in get_files_fn
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
add_to string false None If this parameter is part of a multiple parameter, add_to allows to define to which "parent" parameter it should be linked to.

Multiple parameters

The developper has the possibility to structure the input data by using the notion of multiple parameters. A multi parameter is a collection of parameters linked together. Just like parameters and inputs, it can be added to handle a single collection or a list of collections. Thus, the add_multiple_parameter() method can be used to force the final user to provide one and only one collection, where the add_multiple_parameter_list() method allows the final user to give as many collection he wants. To add a parameter within the multiple parameter, it only requires to set the option add_to of any methods previously described. The accessible object attribut self.multi_parameter_name is then a Python dictionary gathering all the values of the different parameters under the format {"sub_parameter1":value}

add_multiple_parameter()

Example

The following example creates a multiple parameter named library which contains two input files R1 (which is mandatory) and R2 and a sequencer parameter. The parameter R1 is required only if a library is defined.

self.add_multiple_parameter("library", "Library.", required=False)
self.add_input_file("R1", "Path to R1 file.", required=True, add_to="library")
self.add_input_file("R2", "Path to R2 file.", add_to="library")
self.add_parameter("sequencer", "The sequencer type.", choices=["HiSeq2000", 
    "ILLUMINA", "UNKNOWN"], default="HiSeq2000", add_to="library")

Options

There are two positional arguments : name and help. All other options are keyword options.

Name Type Required Default value Description
name string true None The name of the multi parameter. The parameter value is accessible within the workflow object through the attribute named self.multi_parameter_name. And its sub parameters using self.multi_parameter_name["sub_parameter_name"].
help string true None The parameter help message.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name). The sub parameters can be set as following --name sub1=... sub2=...
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.

add_multiple_parameter_list()

This method takes the same arguments as add_multiple_parameter(). However, adding this parameter, the final user will be allowed to provide multiple collection and the object attribut self.multi_parameter_name will be settled as a Python list of Python dictionary.

Exclusion rules

There is a possibility to exclude some rules from each others. To do so, the method add_exclusion_rule() is available. It only works with simple parameter.

add_exclution_rule()

Example

In the following example, the final user will not be allowed to provide both fasta_file and fastq_file parameters.

self.add_input_file("fasta_file", "Path to the fasta file.", format="fasta")
self.add_input_file("fastq_file", "Path to the fastq file.", format="fastq")
self.add_exclution_rule("fasta_file", "fastq")

Options

The method accept the following options

Name Type Required Default value Description
*args2exclude string true None The name of the parameter to exclude.

NG6Workflow.process()

The process() method is in charge of building the workflow by adding analyses and components (using the method add_component()) and linking their inputs and their outputs. A analaysis and a component are classes representing a workflow step. See the analyses documentation for more information.

add_component()

The add_component() method add an analysis or a component to the workflow by building respectively a ng6.analysis.Analysis or a jflow.component.Component object and returning it. All attributs defined within this object, such as the outputs, are then available from the workflow and can be used as inputs of other components.

Example

In the following example, the first component BWAIndex is built and returned in the bwaindex object. The output bwaindex.databank is accessible as an object attribut and can be used as input of the BWAmem component.

def process(self):
        # index the reference genome
        bwaindex = self.add_component("BWAIndex", [self.reference_genome])
        # align reads against the indexed genome
        bwamem = self.add_component("BWAmem", [bwaindex.databank, self.reads])

Options

There is one positional argument : component_name. All other options are keyword options.

Name Type Required Default value Description
component_name string true None The component class name to add to the workflow.
args list false [] The component's arguments (see here for more details).
kwargs dict false {} The component's keyword arguments (see here for more details).
component_prefix string false "default" The prefix is used to name the component at the execution. The prefix allows to add multiple components of the same class within the same workflow.
You must choose between args and kwargs options. kwargs dictionnary allow you to define subset of component options instead of all of them thanks to the args list.

get_resource()

The method get_resource(), giving a specific resource, returns the defined value within the resource section of the jflow configuration file : application.properties.

Options

There is one required argument : resource.

Name Type Required Default value Description
resource string true None The resource name for which is requested the configured value.