What is an Analysis

An analysis presents the results of the execution of one or multiple software, which can be external scripts or python code, using NGS data. It is represented by two main objects :

  • An analysis component represented by a Python class inheriting from ng6.analysis.Analysis, lists all the inputs, outputs and parameters required to run the command line(s) and defines its structure.
  • A template file written in Smarty and which will presents the results of the analysis to the user.

Where to add a new Analysis

The new analysis must be added in a Python package. Two different locations are possible in order to be imported by ng6 :

  • workflows.components: the analysis will be visible by all workflows,
  • workflows.myWorkflow.components: the analysis will only be available formyWorkflow.

The template file presenting the results of the analysis must be added in ui/nG6/pi1/analyzes/. The template file must have the same name as the Python class of the Analysis. The following code represents the structure of the source, and the location where to add specific files.

nG6/
├── bin/
├── docs/
├── src/
├── ui/
│   └── nG6/
│       └── pi1/
│           └── analyzes/
│               ├── MyAnalysis.tpl   [ the analysis template file ]
│               └── MyAnalysis.js    [ the analysis specific javascript file ]
├── workflows/
│   ├── myWorkflow/
│   │   ├── components/              [ workflow specific analysis and components ]
│   │   │   └── MyAnalysis.py        [ the analysis code ]
│   │   └── __init__.py
│   ├── components/                  [ general analyzes and components ]
│   │   ├── __init__.py
│   │   └── MyAnalysis.py            [ the analysis code ]
│   ├── extparsers/
│   ├── __init__.py
│   ├── formats.py
│   └── types.py
├── applications.properties
└── README

The Analysis class

An analysis is class defined in the MyAnalysis.py file. In order to add a new analysis, the developper has to:

  • implement a class inheriting from the ng6.analysis.Analysis class,
  • overload the define_parameters() method to add the inputs, outputs and parameters,
  • overload the define_analysis() method to describe the analysis (name, software, options ...).
  • overload the get_version() method to get the version of the analysis.
  • overload the process() method to define the command line(s) structure.
  • overload the post_process() method to define post processing traitment, update the database with analysis data.

The class skeleton is given by

from ng6.analysis import Analysis

class MyComponent (Analysis):

    def define_parameters(self, param1, param2, ...):
        # define the parameters

    def process(self):
        # define the command line(s) structure
    
    def get_version(self):
        # return a string with the version of the analysis
        
    def define_analysis(self):
        # define the analysis
        self.name = "-"
        self.description = "-"
        self.software = "-"
        self.options = "-"
    
    def post_process(self):
        # database operations

Analysis.define_parameters()

The define_parameters() method is used to add component parameters, inputs and outputs. To do so, several methods are available. Once defined, the new parameters are available as object attributs, thus they are accessible through self.parameter_name.

Several types of parameters can be added, all described in the following sections. All have two required positional arguments: name and help. The other arguments are optional and can be given to the method by using their keywords. Arguments cmd_format and argpos may simplified the process() function (see Analysis.process() ).

Parameters

Parameters can be added to handle a single element or a list of elements. Thus, the add_parameter() method can be used to force the final user to provide one and only one value, where the add_parameter_list() method allows the final user to give as many values he wants.

add_parameter()

Example

In the following example, a parameter named sequencer is added to the workflow. It has a list of choices and the default value is "HiSeq2000".

self.add_parameter("sequencer",
                   "The sequencer type.", 
                   choices = ["HiSeq2000", "ILLUMINA","SLX","SOLEXA","454","UNKNOWN"], 
                   default="HiSeq2000")

Options

There are two positional arguments: name and help. All other options are keyword options

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default - false None The default parameter value. It's type depends on the parameter type.
type string false "str" The parameter type. The value provided by the final user will be casted and checked against this type. All built-in Python types are available "int", "str", "float", "bool", "date", ... To create customized types, refere to the Add a data type documentation.
choices list false [] A list of the allowed values.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_parameter_list()

The add_parameter_list() method takes the same arguments as add_parameter(). However, adding this parameter, the final user will be allowed to enter multiple values for this parameter and the object attribut self.parameter_name will be settled as a Python list.

Inputs

Just like parameters, inputs can be added to handle a single file or a list of files. Thus, the add__input_file() method can be used to force the component to take one and only one file, where the add__input_file_list() method allows the component to take as many files as possible.

add_input_file()

Example

In the following example, an input named reads is added to the workflow. The provided file is required and should be in fastq format. No file size limitation is specified.

self.add_input_file_list("reads", 
                         "Which read files should be used", 
                         file_format="fastq", 
                         required=True)

Options

There are two positional argument : name and help. All other options are keyword options.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default string false None The default path value.
file_format string false "any" The file format is checked before running the workflow. Available format are "any", "bam", "fasta", "fastq", and "sff". To create customized format, refere to the Add a file format documentation.
type string false "inputfile" The type can be "inputfile", "localfile", "urlfile" or "browsefile". An "inputfile" allows the final user to provide a "localfile" or an "urlfile" or a "browsefile". A "localfile" restricts the final user to provide a path to a file visible by ng6. An "urlfile" only permits the final user to give an URL as input, where a "browsefile" force the final user to upload a file from its own computer. This last option is only available from the GUI and is considered as a "localfile" from the command line. All the uploading process is handled by ng6.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_input_file_list()

This method takes the same arguments as add_input_file(). However, adding this parameter, the component can take a list file files and the object attribut self.parameter_name will be settled as a Python list.

Outputs

Just like inputs, outputs can be added to handle a single file or a list of files. Thus, the add_output_file() method can be used to force the component to produce one and only one file, where the add_output_file_list() method allows the component to produce as many files as possible. There are also two specific functions add_output_file_endswith() and add_output_file_pattern().

add_output_file()

Example

In the following example, an output named databank is defined. The process have to produce the file, otherwise the workflow will failed. The file written on the disk will be named with the same name as the one stored in the variable input_fasta.

self.add_output_file("databank", 
                     "The indexed databank", 
                     filename=os.path.basename(input_fasta))

Options

The two positional arguments name and help are always present.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
filename string false None The expected name of the output file.
file_format string false "any" The file format is checked before running the workflow. Available format are "any", "bam", "fasta", "fastq", and "sff". To create customized format, refere to the Add a file format documentation.
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_output_file_list()

Example

In the following example, an output named sam_files is defined. The files written on the disk will all have the pattern {basename_woext}.sam defined by the self.reads variable. The resulting list will gathers files with the same basename as self.reads but with an extension substituted by ".sam".

self.add_output_file_list("sam_files", 
                          "The BWA output files", 
                          pattern='{basename_woext}.sam', 
                          items=self.reads)

Options

The two positional arguments name and help are always present.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
items list false None A list of element through which add_output_file_list() should iterate to produce the output list.
pattern string false '{basename_woext}.out' The pattern is used to produce the output list. add_output_file_list() maps the pattern on each items to produce the final value. The pattern can accepts these predefined values:
  • {fullpath}, {FULL} for full input file path,
  • {basename}, {BASE} for base input file name,
  • {fullpath_woext}, {FULLWE} for full input file path without extension,
  • {basename_woext}, {BASEWE} for base input file name without extension. note that ".gz" is not considered as an extension.
file_format string false "any" The file format is checked before running the workflow. Available format are "any", "bam", "fasta", "fastq", and "sff". To create customized format, refere to the Add a file format documentation.
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_output_file_endswith()

Example

In the following example, an output named sam_files is defined. The list files will be made of all the files written on the disk with an extension equal to .sam. This is performed at the end of the execution and can be useful when the component produces an unknown number of ouptut files.

self.add_output_file_endswith("sam_files", 
                              "The BWA output files", 
                              pattern='.sam')

Options

This method is very different from add_output_file_list() because it should only be used when the number of output files returned by the component is unknown. Three options are required: name, help and pattern.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
pattern string true None The extension of the files to return.
behaviour string false "include" How to process selected files. Other values than "include" mean that all files not ending with the pattern will be selected.
file_format string false "any" The file format is checked before running the workflow. Available format are "any", "bam", "fasta", "fastq", and "sff". To create customized format, refere to the Add a file format documentation.
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_output_file_pattern()

Example

In the following example, an output named sam_files is defined. The list files will be made of all the files written on the disk with the pattern *_R1.sam. This is performed at the end of the execution and can be useful when the component produces an unknown number of ouptut files.

self.add_output_file_pattern("sam_files", 
                             "The BWA output files", 
                             pattern='*_R1.sam')

Options

This method is slightly different from add_output_file_endswith() because it returns all the file corresponding to the given pattern instead of only the extension. All the options are the same, only pattern differs.

Name Type Required Default value Description
pattern string true None The regepx string used to retrieve the ouput files.

Analysis.define_analysis()

This method is in charge of describing the analysis: its name, its description, its options and so on ...

Example

Here is the example of the descritpion of a fastqc analysis, it defines the name, description, software and options used

def define_analysis(self):
    self.name = "ReadsStats"
    self.description = "Statistics on reads and their qualities."
    self.software = "fastqc"
    self.options = " --nogroup --casava"

Options

The define_analysis() method does not take any option. This table presents the attributes that can be created in define_analysis(). All attributes must be strings.

Name Description
self.name Give a name to your analysis.
self.description Give a description to your analysis.
self.software Which software was used for this analysis.
self.options Which software options were used for this analysis.

Analysis.get_version()

The get version must return a string representing the version of this analysis.

Example

This example presents the get version of the fastqc analysis

def get_version(self):
    cmd = [self.get_exec_path("fastqc"), "--version"]
    p = Popen(cmd, stdout=PIPE, stderr=PIPE)
    stdout, stderr = p.communicate()
    return stdout.split()[1]

Analysis.process()

The process() method is in charge of the specification of the executables used to process the data (a command line or a Python function) and of the definition of the pattern of execution that determine how the functions are applied on the data, what is named hereunder an abstraction. To build the process, ng6 provides two main functions named ShellFunction and PythonFunction and two main abstractions: Map and MultiMap.

Functions

The two provided functions allows the developper to specify the executables used to process the data.

ShellFunction

The ShellFunction can be called when the workflow requires to run an external command line. This function allows to define the command line structure so ng6 can build and run it automaticly on the final user inputs.

Example

Considering the following blastall command line:
blastall -p [program_name] -i [query_file] -d [database] -o [file_out]

When using a ng6 function, the command format has to be given in order to set the inputs, outputs and arguments order. Let's fix it at cmd_format="{EXE} {IN} {OUT}", which is a classic value for this option. Doing so jflow will consider the following inputs and outputs order: query_file, database and then file_out. resulting to the following command structure:

blastall -p [program_name] -i [$1] -d [$2] -o [$3]

The ShellFunction can then be applied as following:

blast = ShellFunction("blastall -p blastn -i $1 -d $2 -o $3", cmd_format="{EXE} {IN} {OUT}")

And can be executed by calling the new created function

blast( inputs=[query_file, database], outputs=[file_out] )

Options

Name Type Required Default value Description
source string true None The command line structure defining inputs, outputs and arguments positions.
shell string false "sh" Which shell should be used to interpret the command line, the value can be "sh" | "ksh" | "bash" | "csh" | "tcsh".
cmd_format string false '{EXE} {ARG} {IN} > {OUT}' The cmd_format supports the following fields:
  • {executable}, {EXE} for the path to the executable,
  • {inputs}, {IN} for the inputs files,
  • {outputs}, {OUT} for the output files,
  • {arguments}, {ARG} for the arguments.

PythonFunction

The PythonFunction can be called when the workflow requires to run an internal Python code defined in a Python function. This function allows to define the way the function should be called so ng6 can call and run it automaticly on the final user inputs.

Example

Considering a function named fastq2fasta defined by:

def fastq2fasta(fastq_file, fasta_file):
    # python code goes here

When using a jflow function, the command format has to be given in order to set the inputs, outputs and arguments order. Let's fix it at cmd_format="{EXE} {IN} {OUT}", which is a classic value for this option. Doing so ng6 will consider the following variables order to pass to the function: fastq_file and then fasta_file. The PythonFunction can be used as following:

fastq2a = PythonFunction(fastq2fasta, cmd_format="{EXE} {IN} {OUT}")

And can be executed by calling the new created function

fasta2q( inputs=[fastq_file], outputs=[fasta_file] )

Options

Name Type Required Default value Description
function function true None The Python function use to process the data.
add_path string false None A path to a Python library required to run the function. This is useful in case the library is not in the path and not visible by jflow.
cmd_format string false '{EXE} {ARG} {IN} > {OUT}' The cmd_format supports the following fields:
  • {executable}, {EXE} for the path to the executable,
  • {inputs}, {IN} for the inputs files,
  • {outputs}, {OUT} for the output files,
  • {arguments}, {ARG} for the arguments.

Abstractions

The abstraction allows to define the pattern of execution that determine how the functions (ShellFunction or PythonFunction) are applied on the data.

__call__

This first abstraction is executed when calling the ShellFuntion or the PythonFunction as a basic Python function.

Example

fasta_trim = ShellFunction( "fastaTrim.pl --length 50 $1 > $2", cmd_format="{EXE} {IN} {OUT}" )
fasta_trim( inputs="splA.fasta", outputs="splA_trim.fasta" )

This abstraction will lead to the execution of the following command line:

fastaTrim.pl --length 50 splA.fasta > splA_trim.fasta

Options

Name Type Required Default value Description
inputs list | string true None The input files list required to run the function.
outputs list | string false None The output files list created by the function.
arguments string false None The arguments to provide to the function.
includes list | string false None Files to include for this task.
collect boolean false false Whether or not to mark files for garbage collection.
local boolean false false Whether or not to force local execution.

Map

The Map abstraction allows to map one input file list to one output file list.

Example

fasta_list = ["splA.fasta", "splB.fasta", "splC.fasta"]
out_list = ["splA_trim.fasta", "splB_trim.fasta", "splC_trim.fasta"]

fasta_trim = ShellFunction( "fastaTrim.pl --length 50 $1 > $2", cmd_format="{EXE} {IN} {OUT}" )
Map( fasta_trim, inputs=fasta_list, outputs=out_list )

This abstraction will lead to the execution of the following command lines:

fastaTrim.pl --length 50 splA.fasta > splA_trim.fasta
fastaTrim.pl --length 50 splB.fasta > splB_trim.fasta
fastaTrim.pl --length 50 splC.fasta > splC_trim.fasta

Options

Name Type Required Default value Description
function function true None The ShellFunction or the PythonFunction to use to process the data.
inputs list | string true None The input files list required to run the function.
outputs list | string false None The output files list created by the function.
includes list | string false None Files to include for each task.
collect boolean false false Whether or not to mark files for garbage collection.
local boolean false false Whether or not to force local execution.

MultiMap

The MultiMap abstraction allows to map n input file lists to n output file lists, all of the same length.

Example

fastq_list = ["splA.fastq", "splB.fastq", "splC.fastq"]
out_list = [ ["splA.fasta", "splB.fasta","splC.fasta"],
             [ "splA.qual", "splB.qual","splC.qual"  ] ]

fastq2fasta = ShellFunction( "fastq2fasta.py --input $1 --fasta $2 --qual $3", 
                             cmd_format="{EXE} {IN} {OUT}" )
MultiMap( fastq2fasta, inputs=fastq_list, outputs=out_list )
or
fastq_list = ["splA.fastq", "splB.fastq", "splC.fastq"]
out_fasta_list = ["splA.fasta", "splB.fasta","splC.fasta"]
out_qual_list =  [ "splA.qual", "splB.qual","splC.qual"  ] 

fastq2fasta = ShellFunction( "fastq2fasta.py --input $1 --fasta $2 --qual $3", 
                             cmd_format="{EXE} {IN} {OUT}" )
MultiMap( fastq2fasta, inputs=fastq_list, outputs=[out_fasta_list,out_qual_list] )

This abstraction will lead to the execution of the following command lines:

fastq2fasta.py --input splA.fastq --fasta splA.fasta --qual splA.qual
fastq2fasta.py --input splB.fastq --fasta splB.fasta --qual splB.qual
fastq2fasta.py --input splC.fastq --fasta splC.fasta --qual splC.qual

Options

Name Type Required Default value Description
function function true None The ShellFunction or the PythonFunction to use to process the data.
inputs list | string true None The input files list required to run the function.
outputs list | string false None The output files list created by the function.
includes list | string false None Files to include for each task.
collect boolean false false Whether or not to mark files for garbage collection.
local boolean false false Whether or not to force local execution.

Analysis.post_process()

The post_process() method is executed when the process() is done, i.e when output files are generated. It is possible to add information to the database using _add_result_element() inside post_process(). If the analysis generated some interest files than have to be used in the interface, the method _save_file() can be used to save a specific file on the storage directory. Adding information using _add_result_element() make them available for display.

_add_result_element()

This method is used to add a result entry in the database.

Example

The following example presents the define_parameters() and post_process() methods of an analysis. The analaysis creates outputs which contains a list of fastq files generated by sampling the input fastq files. Each input fastq file corresponding to a sample. In the post_process(), the total amount of sequence extracted from those fastq files are stored for each fastq and stored to the database.

def define_parameters(self, input_fastq):
    self.add_input_file_list( "input_fastq", "input_fastqs", default=input_fastq)
    self.add_output_file_list( "outputs", "outputs", pattern='{basename_woext}.sample', 
        items=input_fastq)

def get_nbseq(self, fastq):
    # method which returns the number of sequence in a fastq file

def post_process(self):
    for i,infastq in enumerate(self.input_fastqs) :
        outfastq = self.outputs[i]
        name = os.path.splitext(os.path.basename(infastq))[0]
        self._add_result_element(name, "nb_sequence", self.get_nbseq(infastq))
        self._add_result_element(name, "extracted_sequence", self.get_nbseq(outfastq))

Options

Name Type Required Default value Description
file string true None The file name, sample or any entry the result is linked to.
result_key string true None The result key.
result_value string true None the result value associated to the key.
result_group string false default the result group it belongs to.

The analysis display

The analysis results, added with the _add_result_element() method in the post_process() function, can be displayed using a template HTMl file. The template is an HTML file written using the PHP Smarty paradigm and must be stored in ui/nG6/pi1/analyses/ folder with the same name as your Analysis python file. To add interaction with the user, a javascript file can also be written and associated to the template (same folder, same name).

Template file

The template file is an HTML file written in Smarty. NG6 provides a global template named AnalysisTemplate.tpl that should be extended. This template defines multiple block that can be redefined by any templates inheriting from the NG6 parent template. The blocks than can be overloaded are:

  • description_update: can be used to extends the description in the top panel of the analysis,
  • results: results are displayed in this block. It is empty by default and have to be overloaded,
  • params_content: contains the parameters used in the analysis,
  • results_title: allows to change the title of the results tab,
  • params_title: allows to change the title of the parameters tab.

All analysis results added with _add_result_element() are available within the template in the {$analyse_results} variable. The result added by _add_result_element("myelement", "nbseq", "120", "default"), can be accessed within the template by {$analyse_results["myelement"]["default"].nbseq}.

Example

This example takes the post_process() function defined previously and write a template file that creates a table with the number of sequence and the number of extracted sequence for each sample. The template overwrite the results and results_title blocks. NG6 interface uses Bootstrap, so all bootstrap classes are available from a template. The jquery datatable plugin can also be used. The dataTable CSS class is used to format the results table.

def post_process(self):
    for i,infastq in enumerate(self.input_fastqs) :
        outfastq = self.outputs[i]
        name = os.path.splitext(os.path.basename(infastq))[0]
        self._add_result_element(name, "nb_sequence", self.get_nbseq(infastq))
        self._add_result_element(name, "extracted_sequence", self.get_nbseq(outfastq))
{extends file="AnalysisTemplate.tpl"}
{block name=results_title} Fastq results {/block}
{block name=results}
    <table class="table table-striped table-bordered dataTable analysis-result-table">
        <thead>
            <tr>
                <th class="string-sort" >Sample</th>
                <th class="string-sort" >Number of seqence</th>
                <th class="string-sort" >Number of extracted sequence</th>
            </tr>
        </thead>
        <tbody>
            {foreach from=$analyse_results key=sample item=sample_results}
                <tr>
                    <td> {$sample|get_description:$descriptions}</td>
                    <td>{$sample_results["default"].nb_sequence}</td>
                    <td>{$sample_results["default"].extracted_sequence}</td>
                </tr>
            {/foreach}
        </tbody>
    </table>
{/block}

Javascript file

It is also possible to add a javascript file to make the template more dynamic (for example adding buttons actions and so on ...). To do so, the developer can write a javascript file named as the Analysis class name.

Example

The previous template file has been updated to add checkboxes for each sample and a buton in the table footer. In the javascript file, jquery is used to retrieve selected lines.

{extends file="AnalysisTemplate.tpl"}
{block name=results_title} Fastq results {/block}
{block name=results}
    <table class="table table-striped table-bordered dataTable analysis-result-table">
        <thead>
            <tr>
                <th class="string-sort" >Sample</th>
                <th class="string-sort" >Number of seqence</th>
                <th class="string-sort" >Number of extracted sequence</th>
            </tr>
        </thead>
        <tbody>
            {foreach from=$analyse_results key=sample item=sample_results}
                <tr>
                    <td><center><input type="checkbox" id="chk_sample_{$i}" 
                        value="sample"/></center></td>	
                    <td> {$sample|get_description:$descriptions}</td>
                    <td>{$sample_results["default"].nb_sequence}</td>
                    <td>{$sample_results["default"].extracted_sequence}</td>
                </tr>
            {/foreach}
        </tbody>
        <tfoot>
            <tr>
                <th align="left" colspan="4">
                    <button type="button" class="btn btn-default" 
                        id="my_button">click</button>
                </th>
            </tr>
        </tfoot>
    </table>
{/block}
$(function () {
    $("#my_button").click(function() {
    	var samples = new Array() ;
		$("input[id^='chk_sample_']:checked").each( function() {
    		samples.push( parseInt($(this).attr("id").split('_')[2]) ) ;
    	}) ;
        // Do something with selected samples
    }) ;
});