An analysis presents the results of the execution of one or multiple software, which can be external scripts or python code, using NGS data. It is represented by two main objects :
ng6.analysis.Analysis
, lists all the inputs, outputs and parameters
required to run the command line(s) and defines its structure. jflow.component.Component
object. It is recommended to use Component
when the result of the execution has not to be presented to the final user (for example indexing a reference file) and Analysis
when
results must be presented to the user, like a fastqc report. The differences are the necessity to develop define_analysis(), post_process() functions and a template HTML file for Analysis
component.
The new analysis must be added in a Python package. Two different locations are possible in order to be imported by ng6 :
workflows.components
: the analysis will be visible by all workflows,workflows.myWorkflow.components
: the analysis will only be available formyWorkflow
.The template file presenting the results of the analysis must be added in ui/nG6/pi1/analyzes/. The template file must have the same name as the Python class of the Analysis. The following code represents the structure of the source, and the location where to add specific files.
nG6/
├── bin/
├── docs/
├── src/
├── ui/
│ └── nG6/
│ └── pi1/
│ └── analyzes/
│ ├── MyAnalysis.tpl [ the analysis template file ]
│ └── MyAnalysis.js [ the analysis specific javascript file ]
├── workflows/
│ ├── myWorkflow/
│ │ ├── components/ [ workflow specific analysis and components ]
│ │ │ └── MyAnalysis.py [ the analysis code ]
│ │ └── __init__.py
│ ├── components/ [ general analyzes and components ]
│ │ ├── __init__.py
│ │ └── MyAnalysis.py [ the analysis code ]
│ ├── extparsers/
│ ├── __init__.py
│ ├── formats.py
│ └── types.py
├── applications.properties
└── README
An analysis is class defined in the MyAnalysis.py
file. In order to add a new analysis, the developper has to:
ng6.analysis.Analysis
class,define_parameters()
method to add the inputs, outputs and parameters,define_analysis()
method to describe the analysis (name, software, options ...).get_version()
method to get the version of the analysis.process()
method to define the command line(s) structure.post_process()
method to define post processing traitment, update the database with analysis data.The class skeleton is given by
from ng6.analysis import Analysis
class MyComponent (Analysis):
def define_parameters(self, param1, param2, ...):
# define the parameters
def process(self):
# define the command line(s) structure
def get_version(self):
# return a string with the version of the analysis
def define_analysis(self):
# define the analysis
self.name = "-"
self.description = "-"
self.software = "-"
self.options = "-"
def post_process(self):
# database operations
The define_parameters()
method is used to add component parameters, inputs and outputs. To do so, several methods are available.
Once defined, the new parameters are available as object attributs, thus they are accessible through self.parameter_name
.
Several types of parameters can be added, all described in the following sections. All have two required positional
arguments: name
and help
. The other arguments are optional and can be given to the method by using their
keywords. Arguments cmd_format
and argpos
may simplified the process()
function (see Analysis.process() ).
Parameters can be added to handle a single element or a list of elements. Thus, the add_parameter()
method can be used to force
the final user to provide one and only one value, where the add_parameter_list()
method allows the final user to give as many values he
wants.
In the following example, a parameter named sequencer
is added to the workflow. It has a list of choices and the default value is "HiSeq2000".
self.add_parameter("sequencer",
"The sequencer type.",
choices = ["HiSeq2000", "ILLUMINA","SLX","SOLEXA","454","UNKNOWN"],
default="HiSeq2000")
There are two positional arguments: name
and help
. All other options are keyword options
Name | Type | Required | Default value | Description |
---|---|---|---|---|
name | string | true | None | The name of the parameter. The parameter value is accessible
within the workflow object through the attribute named self.parameter_name . |
help | string | true | None | The parameter help message. |
default | - | false | None | The default parameter value. It's type depends on the parameter type. |
type | string | false | "str" | The parameter type. The value provided by the final user will be casted and checked against this type. All built-in Python types are available "int", "str", "float", "bool", "date", ... To create customized types, refere to the Add a data type documentation. |
choices | list | false | [] | A list of the allowed values. |
required | boolean | false | false | Wether or not the parameter can be ommitted. |
flag | string | false | None | The command line flag (if the value is None, the flag will be --name ). |
group | string | false | "default" | The value is used to group a list of parameters in sections. The group is used in both command line and GUI. |
display_name | string | false | None | The parameter name that should be displayed on the final form. |
cmd_format | string | false | "" | The command format is the parameter skeleton required to build the final command line. |
argpos | integer | false | -1 | The parameter position in the command line. |
The add_parameter_list()
method takes the same arguments as add_parameter()
. However, adding this parameter,
the final user will be allowed to enter multiple values for this parameter and the object attribut self.parameter_name
will be
settled as a Python list.
Just like parameters, inputs can be added to handle a single file or a list of files. Thus, the add__input_file()
method can be used to force
the component to take one and only one file, where the add__input_file_list()
method allows the component to take as many files as possible.
In the following example, an input named reads
is added to the workflow. The provided file is required and should be in fastq format. No file size limitation is specified.
self.add_input_file_list("reads",
"Which read files should be used",
file_format="fastq",
required=True)
There are two positional argument : name
and help
. All other options are keyword options.
Name | Type | Required | Default value | Description |
---|---|---|---|---|
name | string | true | None | The name of the parameter. The parameter value is accessible
within the workflow object through the attribute named self.parameter_name . |
help | string | true | None | The parameter help message. |
default | string | false | None | The default path value. |
file_format | string | false | "any" | The file format is checked before running the workflow. Available format are "any", "bam", "fasta", "fastq", and "sff". To create customized format, refere to the Add a file format documentation. |
type | string | false | "inputfile" | The type can be "inputfile", "localfile", "urlfile" or "browsefile". An "inputfile" allows the final user to provide a "localfile" or an "urlfile" or a "browsefile". A "localfile" restricts the final user to provide a path to a file visible by ng6. An "urlfile" only permits the final user to give an URL as input, where a "browsefile" force the final user to upload a file from its own computer. This last option is only available from the GUI and is considered as a "localfile" from the command line. All the uploading process is handled by ng6. |
required | boolean | false | false | Wether or not the parameter can be ommitted. |
flag | string | false | None | The command line flag (if the value is None, the flag will be --name ). |
group | string | false | "default" | The value is used to group a list of parameters in sections. The group is used in both command line and GUI. |
display_name | string | false | None | The parameter name that should be displayed on the final form. |
cmd_format | string | false | "" | The command format is the parameter skeleton required to build the final command line. |
argpos | integer | false | -1 | The parameter position in the command line. |
This method takes the same arguments as add_input_file()
. However, adding this parameter,
the component can take a list file files and the object attribut self.parameter_name
will be
settled as a Python list.
Just like inputs, outputs can be added to handle a single file or a list of files. Thus, the add_output_file()
method can be used to force
the component to produce one and only one file, where the add_output_file_list()
method allows the component to produce as many files as possible.
There are also two specific functions add_output_file_endswith()
and add_output_file_pattern()
.
In the following example, an output named databank
is defined. The process have to produce the file, otherwise the workflow will failed. The file written on the disk will be named
with the same name as the one stored in the variable input_fasta
.
self.add_output_file("databank",
"The indexed databank",
filename=os.path.basename(input_fasta))
The two positional arguments name
and help
are always present.
Name | Type | Required | Default value | Description |
---|---|---|---|---|
name | string | true | None | The name of the parameter. The parameter value is accessible
within the workflow object through the attribute named self.parameter_name . |
help | string | true | None | The parameter help message. |
filename | string | false | None | The expected name of the output file. |
file_format | string | false | "any" | The file format is checked before running the workflow. Available format are "any", "bam", "fasta", "fastq", and "sff". To create customized format, refere to the Add a file format documentation. |
group | string | false | "default" | The value is used to group a list of parameters in sections. The group is used in both command line and GUI. |
display_name | string | false | None | The parameter name that should be displayed on the final form. |
cmd_format | string | false | "" | The command format is the parameter skeleton required to build the final command line. |
argpos | integer | false | -1 | The parameter position in the command line. |
In the following example, an output named sam_files
is defined. The files written on the disk will all have the pattern {basename_woext}.sam
defined by the
self.reads
variable. The resulting list will gathers files with the same basename as self.reads
but
with an extension substituted by ".sam".
self.add_output_file_list("sam_files",
"The BWA output files",
pattern='{basename_woext}.sam',
items=self.reads)
The two positional arguments name
and help
are always present.
Name | Type | Required | Default value | Description |
---|---|---|---|---|
name | string | true | None | The name of the parameter. The parameter value is accessible
within the workflow object through the attribute named self.parameter_name . |
help | string | true | None | The parameter help message. |
items | list | false | None | A list of element through which add_output_file_list() should iterate to produce the output
list. |
pattern | string | false | '{basename_woext}.out' | The pattern is used to produce the output list. add_output_file_list() maps the pattern on each items
to produce the final value.
The pattern can accepts these predefined values:
|
file_format | string | false | "any" | The file format is checked before running the workflow. Available format are "any", "bam", "fasta", "fastq", and "sff". To create customized format, refere to the Add a file format documentation. |
group | string | false | "default" | The value is used to group a list of parameters in sections. The group is used in both command line and GUI. |
display_name | string | false | None | The parameter name that should be displayed on the final form. |
cmd_format | string | false | "" | The command format is the parameter skeleton required to build the final command line. |
argpos | integer | false | -1 | The parameter position in the command line. |
In the following example, an output named sam_files
is defined. The list files will be made of all the files written on the disk with an extension equal to .sam
. This is performed
at the end of the execution and can be useful when the component produces an unknown number of ouptut files.
self.add_output_file_endswith("sam_files",
"The BWA output files",
pattern='.sam')
This method is very different from add_output_file_list()
because it should only be used when
the number of output files returned by the component is unknown. Three options are required:
name
, help
and pattern.
Name | Type | Required | Default value | Description |
---|---|---|---|---|
name | string | true | None | The name of the parameter. The parameter value is accessible
within the workflow object through the attribute named self.parameter_name . |
help | string | true | None | The parameter help message. |
pattern | string | true | None | The extension of the files to return. |
behaviour | string | false | "include" | How to process selected files. Other values than "include" mean that all files not ending with the pattern will be selected. |
file_format | string | false | "any" | The file format is checked before running the workflow. Available format are "any", "bam", "fasta", "fastq", and "sff". To create customized format, refere to the Add a file format documentation. |
group | string | false | "default" | The value is used to group a list of parameters in sections. The group is used in both command line and GUI. |
display_name | string | false | None | The parameter name that should be displayed on the final form. |
cmd_format | string | false | "" | The command format is the parameter skeleton required to build the final command line. |
argpos | integer | false | -1 | The parameter position in the command line. |
In the following example, an output named sam_files
is defined. The list files will be made of all the files written on the disk with the pattern *_R1.sam
. This is performed
at the end of the execution and can be useful when the component produces an unknown number of ouptut files.
self.add_output_file_pattern("sam_files",
"The BWA output files",
pattern='*_R1.sam')
This method is slightly different from add_output_file_endswith()
because it
returns all the file corresponding to the given pattern instead of only the extension. All the options are the same,
only pattern
differs.
Name | Type | Required | Default value | Description |
---|---|---|---|---|
pattern | string | true | None | The regepx string used to retrieve the ouput files. |
This method is in charge of describing the analysis: its name, its description, its options and so on ...
Here is the example of the descritpion of a fastqc analysis, it defines the name, description, software and options used
def define_analysis(self):
self.name = "ReadsStats"
self.description = "Statistics on reads and their qualities."
self.software = "fastqc"
self.options = " --nogroup --casava"
The define_analysis()
method does not take any option.
This table presents the attributes that can be created in define_analysis()
. All attributes must be strings.
Name | Description |
---|---|
self.name | Give a name to your analysis. |
self.description | Give a description to your analysis. |
self.software | Which software was used for this analysis. |
self.options | Which software options were used for this analysis. |
The get version must return a string representing the version of this analysis.
This example presents the get version of the fastqc analysis
def get_version(self):
cmd = [self.get_exec_path("fastqc"), "--version"]
p = Popen(cmd, stdout=PIPE, stderr=PIPE)
stdout, stderr = p.communicate()
return stdout.split()[1]
The process()
method is in charge of the specification of the executables used to process the data (a command
line or a Python function) and of the definition of the pattern of execution that determine how the functions are applied on the
data, what is named hereunder an abstraction. To build the process, ng6 provides two main functions named
ShellFunction
and PythonFunction
and two main abstractions: Map
and
MultiMap
.
process()
can be omitted. NG6 offers, for analyzes and components with easy command lines,
an automatic built of the process()
method.
In this case, options argpos
and cmd_format
must be provided by the developper for each parameter.
Also, two other methods must be overloaded. get_command()
, which must returns
the execution path and get_abstraction()
which returns the abstraction name to use.
The two provided functions allows the developper to specify the executables used to process the data.
The ShellFunction
can be called when the workflow requires to run an external command line. This function
allows to define the command line structure so ng6 can build and run it automaticly on the final user inputs.
blastall
command line:
blastall -p [program_name] -i [query_file] -d [database] -o [file_out]
When using a ng6 function, the command format has to be given in order to set the inputs, outputs and arguments order.
Let's fix it at cmd_format="{EXE} {IN} {OUT}"
, which is a classic value for this option. Doing so jflow will consider the following inputs and outputs order:
query_file
, database
and then file_out
. resulting to the following command structure:
blastall -p [program_name] -i [$1] -d [$2] -o [$3]
The ShellFunction
can then be applied as following:
blast = ShellFunction("blastall -p blastn -i $1 -d $2 -o $3", cmd_format="{EXE} {IN} {OUT}")
And can be executed by calling the new created function
blast( inputs=[query_file, database], outputs=[file_out] )
Name | Type | Required | Default value | Description |
---|---|---|---|---|
source | string | true | None | The command line structure defining inputs, outputs and arguments positions. |
shell | string | false | "sh" | Which shell should be used to interpret the command line, the value can be "sh" | "ksh" | "bash" | "csh" | "tcsh". |
cmd_format | string | false | '{EXE} {ARG} {IN} > {OUT}' | The cmd_format supports the following fields:
|
The PythonFunction
can be called when the workflow requires to run an internal Python code defined in a Python function. This function
allows to define the way the function should be called so ng6 can call and run it automaticly on the final user inputs.
Considering a function named fastq2fasta
defined by:
def fastq2fasta(fastq_file, fasta_file):
# python code goes here
When using a jflow function, the command format has to be given in order to set the inputs, outputs and arguments order.
Let's fix it at cmd_format="{EXE} {IN} {OUT}"
, which is a classic value for this option. Doing so ng6 will consider the following variables
order to pass to the function: fastq_file
and then fasta_file
. The PythonFunction
can be used as following:
fastq2a = PythonFunction(fastq2fasta, cmd_format="{EXE} {IN} {OUT}")
And can be executed by calling the new created function
fasta2q( inputs=[fastq_file], outputs=[fasta_file] )
Name | Type | Required | Default value | Description |
---|---|---|---|---|
function | function | true | None | The Python function use to process the data. |
add_path | string | false | None | A path to a Python library required to run the function. This is useful in case the library is not in the path and not visible by jflow. |
cmd_format | string | false | '{EXE} {ARG} {IN} > {OUT}' | The cmd_format supports the following fields:
|
The abstraction allows to define the pattern of execution that determine how the functions (ShellFunction
or PythonFunction
) are applied on the data.
This first abstraction is executed when calling the ShellFuntion
or the PythonFunction
as a basic Python function.
fasta_trim = ShellFunction( "fastaTrim.pl --length 50 $1 > $2", cmd_format="{EXE} {IN} {OUT}" )
fasta_trim( inputs="splA.fasta", outputs="splA_trim.fasta" )
This abstraction will lead to the execution of the following command line:
fastaTrim.pl --length 50 splA.fasta > splA_trim.fasta
Name | Type | Required | Default value | Description |
---|---|---|---|---|
inputs | list | string | true | None | The input files list required to run the function. |
outputs | list | string | false | None | The output files list created by the function. |
arguments | string | false | None | The arguments to provide to the function. |
includes | list | string | false | None | Files to include for this task. |
collect | boolean | false | false | Whether or not to mark files for garbage collection. |
local | boolean | false | false | Whether or not to force local execution. |
The Map
abstraction allows to map one input file list to one output file list.
fasta_list = ["splA.fasta", "splB.fasta", "splC.fasta"]
out_list = ["splA_trim.fasta", "splB_trim.fasta", "splC_trim.fasta"]
fasta_trim = ShellFunction( "fastaTrim.pl --length 50 $1 > $2", cmd_format="{EXE} {IN} {OUT}" )
Map( fasta_trim, inputs=fasta_list, outputs=out_list )
This abstraction will lead to the execution of the following command lines:
fastaTrim.pl --length 50 splA.fasta > splA_trim.fasta
fastaTrim.pl --length 50 splB.fasta > splB_trim.fasta
fastaTrim.pl --length 50 splC.fasta > splC_trim.fasta
Name | Type | Required | Default value | Description |
---|---|---|---|---|
function | function | true | None | The ShellFunction or the PythonFunction to use to process the data. |
inputs | list | string | true | None | The input files list required to run the function. |
outputs | list | string | false | None | The output files list created by the function. |
includes | list | string | false | None | Files to include for each task. |
collect | boolean | false | false | Whether or not to mark files for garbage collection. |
local | boolean | false | false | Whether or not to force local execution. |
The MultiMap
abstraction allows to map n input file lists to n output file lists, all of the same length.
fastq_list = ["splA.fastq", "splB.fastq", "splC.fastq"]
out_list = [ ["splA.fasta", "splB.fasta","splC.fasta"],
[ "splA.qual", "splB.qual","splC.qual" ] ]
fastq2fasta = ShellFunction( "fastq2fasta.py --input $1 --fasta $2 --qual $3",
cmd_format="{EXE} {IN} {OUT}" )
MultiMap( fastq2fasta, inputs=fastq_list, outputs=out_list )
or
fastq_list = ["splA.fastq", "splB.fastq", "splC.fastq"]
out_fasta_list = ["splA.fasta", "splB.fasta","splC.fasta"]
out_qual_list = [ "splA.qual", "splB.qual","splC.qual" ]
fastq2fasta = ShellFunction( "fastq2fasta.py --input $1 --fasta $2 --qual $3",
cmd_format="{EXE} {IN} {OUT}" )
MultiMap( fastq2fasta, inputs=fastq_list, outputs=[out_fasta_list,out_qual_list] )
This abstraction will lead to the execution of the following command lines:
fastq2fasta.py --input splA.fastq --fasta splA.fasta --qual splA.qual
fastq2fasta.py --input splB.fastq --fasta splB.fasta --qual splB.qual
fastq2fasta.py --input splC.fastq --fasta splC.fasta --qual splC.qual
Name | Type | Required | Default value | Description |
---|---|---|---|---|
function | function | true | None | The ShellFunction or the PythonFunction to use to process the data. |
inputs | list | string | true | None | The input files list required to run the function. |
outputs | list | string | false | None | The output files list created by the function. |
includes | list | string | false | None | Files to include for each task. |
collect | boolean | false | false | Whether or not to mark files for garbage collection. |
local | boolean | false | false | Whether or not to force local execution. |
The post_process()
method is executed when the process()
is done, i.e when output files are generated.
It is possible to add information to the database using _add_result_element()
inside post_process()
.
If the analysis generated some interest files than have to be used in the interface, the method _save_file()
can be
used to save a specific file on the storage directory.
Adding information using _add_result_element()
make them available for display.
This method is used to add a result entry in the database.
The following example presents the define_parameters()
and post_process()
methods of an analysis.
The analaysis creates outputs
which contains a list of fastq files generated by sampling the input fastq files.
Each input fastq file corresponding to a sample. In the post_process()
, the total amount of sequence extracted
from those fastq files are stored for each fastq and stored to the database.
def define_parameters(self, input_fastq):
self.add_input_file_list( "input_fastq", "input_fastqs", default=input_fastq)
self.add_output_file_list( "outputs", "outputs", pattern='{basename_woext}.sample',
items=input_fastq)
def get_nbseq(self, fastq):
# method which returns the number of sequence in a fastq file
def post_process(self):
for i,infastq in enumerate(self.input_fastqs) :
outfastq = self.outputs[i]
name = os.path.splitext(os.path.basename(infastq))[0]
self._add_result_element(name, "nb_sequence", self.get_nbseq(infastq))
self._add_result_element(name, "extracted_sequence", self.get_nbseq(outfastq))
Name | Type | Required | Default value | Description |
---|---|---|---|---|
file | string | true | None | The file name, sample or any entry the result is linked to. |
result_key | string | true | None | The result key. |
result_value | string | true | None | the result value associated to the key. |
result_group | string | false | default | the result group it belongs to. |
The analysis results, added with the _add_result_element()
method in the post_process()
function, can be displayed using a template HTMl file.
The template is an HTML file written using the PHP Smarty paradigm and must be stored in ui/nG6/pi1/analyses/
folder with the same name as your Analysis python file.
To add interaction with the user, a javascript file can also be written and associated to the template (same folder, same name).
The template file is an HTML file written in Smarty
. NG6 provides a global template named AnalysisTemplate.tpl
that should be extended. This template defines multiple block
that can be redefined by any templates inheriting from
the NG6 parent template. The blocks than can be overloaded are:
description_update
: can be used to extends the description in the top panel of the analysis,results
: results are displayed in this block. It is empty by default and have to be overloaded,params_content
: contains the parameters used in the analysis,results_title
: allows to change the title of the results tab,params_title
: allows to change the title of the parameters tab.
All analysis results added with _add_result_element()
are available within the template in the {$analyse_results}
variable. The result added by _add_result_element("myelement", "nbseq", "120", "default")
, can be accessed within the
template by {$analyse_results["myelement"]["default"].nbseq}
.
This example takes the post_process()
function defined previously and write a template file that creates a table with
the number of sequence and the number of extracted sequence for each sample. The template overwrite the results
and results_title
blocks. NG6 interface uses Bootstrap, so all bootstrap classes are available from
a template.
The jquery datatable plugin can also be used. The dataTable
CSS class is
used to format the results table.
def post_process(self):
for i,infastq in enumerate(self.input_fastqs) :
outfastq = self.outputs[i]
name = os.path.splitext(os.path.basename(infastq))[0]
self._add_result_element(name, "nb_sequence", self.get_nbseq(infastq))
self._add_result_element(name, "extracted_sequence", self.get_nbseq(outfastq))
{extends file="AnalysisTemplate.tpl"}
{block name=results_title} Fastq results {/block}
{block name=results}
<table class="table table-striped table-bordered dataTable analysis-result-table">
<thead>
<tr>
<th class="string-sort" >Sample</th>
<th class="string-sort" >Number of seqence</th>
<th class="string-sort" >Number of extracted sequence</th>
</tr>
</thead>
<tbody>
{foreach from=$analyse_results key=sample item=sample_results}
<tr>
<td> {$sample|get_description:$descriptions}</td>
<td>{$sample_results["default"].nb_sequence}</td>
<td>{$sample_results["default"].extracted_sequence}</td>
</tr>
{/foreach}
</tbody>
</table>
{/block}
It is also possible to add a javascript file to make the template more dynamic (for example adding buttons actions and so on ...). To do so, the developer can write a javascript file named as the Analysis class name.
The previous template file has been updated to add checkboxes for each sample and a buton in the table footer. In the javascript file, jquery is used to retrieve selected lines.
{extends file="AnalysisTemplate.tpl"}
{block name=results_title} Fastq results {/block}
{block name=results}
<table class="table table-striped table-bordered dataTable analysis-result-table">
<thead>
<tr>
<th class="string-sort" >Sample</th>
<th class="string-sort" >Number of seqence</th>
<th class="string-sort" >Number of extracted sequence</th>
</tr>
</thead>
<tbody>
{foreach from=$analyse_results key=sample item=sample_results}
<tr>
<td><center><input type="checkbox" id="chk_sample_{$i}"
value="sample"/></center></td>
<td> {$sample|get_description:$descriptions}</td>
<td>{$sample_results["default"].nb_sequence}</td>
<td>{$sample_results["default"].extracted_sequence}</td>
</tr>
{/foreach}
</tbody>
<tfoot>
<tr>
<th align="left" colspan="4">
<button type="button" class="btn btn-default"
id="my_button">click</button>
</th>
</tr>
</tfoot>
</table>
{/block}
$(function () {
$("#my_button").click(function() {
var samples = new Array() ;
$("input[id^='chk_sample_']:checked").each( function() {
samples.push( parseInt($(this).attr("id").split('_')[2]) ) ;
}) ;
// Do something with selected samples
}) ;
});