Step 3: Run GEMmaker¶
How to Launch GEMmaker¶
The demonstrate how to to use GEMmaker the Arabidopsis thaliana reference genome available from Ensembl Plants was prepared in Step 2. As an example, we will indicate 3 SRA files for automatic retrieval and processing by listing them in a file named SRAs.txt
:
SRR1058270
SRR1058271
SRR1058272
If you followed the example in the previous step you should have the reference genome already indexed.
Note
For the examples on this page, Singularity will be used. Singularity will automatically retrieve the GEMmaker Docker images and by default will store them in the work
folder that Nextflow creates. However, Nextflow may warn that a cache directory is not set. If you intend to run GEMmaker multipe times, you may wish to designate a permanent cache directory by seting the NXF_SINGULARITY_CACHEDIR
prior to running GEMmaker. You can learn more at the nf-core tools page
Use Kallisto¶
To run Kallisto you need to specify:
- The path to the genome reference indexed file
- A file containing a set of SRA run IDs you want to download or the path were FASTQ files are stored on the local system.
For example:
nextflow run systemsgenetics/gemmaker -profile singularity \
--pipeline kallisto \
--kallisto_index_path Arabidopsis_thaliana.TAIR10.kallisto.indexed \
--sras SRAs.txt
Use Salmon¶
To run Salmon you need to specify:
- The path to the directory containing the genome reference index files.
- A file containing a set of SRA run IDs you want to download or the path were FASTQ files are stored on the local system.
For example:
nextflow run systemsgenetics/gemmaker -profile singularity \
--pipeline salmon \
--salmon_index_path Arabidopsis_thaliana.TAIR10.salmon.indexed \
--sras SRAs.txt
Use Hisat2¶
To run Hiast2 you need to specify:
- The path to directory containing the Hisat2 genome reference indexed files
- The base name of the whole genome. All Hisat2 index files use this base name. For this example, the base name used is
Arabidopsis_thaliana.TAIR10
. - The GTF file containing the gene annotations.
- A file containing a set of SRA run IDs you want to download or the path were FASTQ files are stored on the local system.
For example:
nextflow run systemsgenetics/gemmaker -profile singularity \
--pipeline hisat2 \
--sras SRAs.txt \
--hisat2_base_name Arabidopsis_thaliana.TAIR10 \
--hisat2_index_dir hisat2_indexes \
--hisat2_gtf_file Arabidopsis_thaliana.TAIR10.gtf
Additionally, you can control the Trimmomatic trimming step by adding any of the following parameters:
--trimmomatic_clip_file
: the location for a custom file of sequences to clip. GEMmaker provides a default version so you only need to set this if you have custom sequences.--trimmomatic_MINLEN
: corresponds to theMINLEN
argument of Trimmomatic. Defaults to 0.7.--trimmomatic_SLIDINGWINDOW
: corresponds to theSLIDINGWINDOW
argument of Trimmomatic. Defaults to “4:15”--trimmomatic_LEADING
: corresponds to theLEADING
argument of Trimmomatic. Defults to 3.--trimmomatic_TRAILING
: correponds to tehTRAILING
argument of Trimmomatic. Defaults to 6.
Use STAR¶
To run STAR you need to specify:
- The path to directory containing the STAR genome reference indexed files
- The GTF file containing the gene annotations.
- A file containing a set of SRA run IDs you want to download or the path were FASTQ files are stored on the local system.
For example:
nextflow run systemsgenetics/gemmaker -profile singularity \
--pipeline star \
--sras SRAs.txt \
--star_index_dir Arabidopsis_thaliana.TAIR10.STAR.indexed \
--star_gtf_file Arabidopsis_thaliana.TAIR10.gtf
Additionally, you can control the Trimmomatic trimming step by adding any of the following parameters:
--trimmomatic_clip_file
: the location for a custom file of sequences to clip. GEMmaker provides a default version so you only need to set this if you have custom sequences.--trimmomatic_MINLEN
: corresponds to theMINLEN
argument of Trimmomatic. Defaults to 0.7.--trimmomatic_SLIDINGWINDOW
: corresponds to theSLIDINGWINDOW
argument of Trimmomatic. Defaults to “4:15”--trimmomatic_LEADING
: corresponds to theLEADING
argument of Trimmomatic. Defults to 3.--trimmomatic_TRAILING
: correponds to tehTRAILING
argument of Trimmomatic. Defaults to 6.
Use Local FASTQ Files¶
If your FASTQ files are local to your computer you must provide the --input
argument when launching Nextflow and indicate the GLOB pattern than is needed to find the files:
nextflow run systemsgenetics/gemmaker -profile singularity \
--pipeline kallisto \
--kallisto_index_path Arabidopsis_thaliana.TAIR10.kallisto.indexed \
--input "../../01-input_data/RNA-seq/fastq/*{1,2}.fastq"
In the example above the --input
argument indicates that FASTQ files are found in the ../../01-input_data/RNA-seq/fastq/
directory and GEMmaker should use all files that match the GLOB pattern *{1,2}.fastq
.
Note
GEMmaker currently expects that all fASTQ files have a 1 or 2 suffix. For paired files two files with the same name but each suffix respectively.
Use Both Local and SRA Files¶
You can combine data from the NCBI SRA with local files in a single run of GEMmaker by providing both the --sras
and --input
arguments.
nextflow run systemsgenetics/gemmaker -profile singularity \
--pipeline kallisto \
--kallisto_index_path Arabidopsis_thaliana.TAIR10.kallisto.indexed \
--input "../../01-input_data/RNA-seq/fastq/*{1,2}.fastq" \
--sras SRAs.txt
Using Paired-End Local Data¶
If your data is paired-end you must provide a GLOB pattern for the --input
argument that can distinguish between the sample name and the suffix that indicates the pair. Usually, paired-files have a 1.fastq
or 2.fastq
suffix on all file names. Therefore, the GLOB given example given above is appropriate: *{1,2}.fastq
. The {1,2}
indicates where the 1
and 2
are at in file name. However, if your files are named differently, be sure to use a GLOB pattern that can differentiate the pairs.
Warning
If the GLOB you provide cannot distinguish between pairs then GEMmaker will treat them as non-paired.
Using Non Paired-End Local Data¶
If your data is not paired-end then the GLOB pattern for the --input
argument simply needs to find all of the FASTQ files. For example, if your FASTQ files have a .fastq
suffix the following GLOB would be appropriate: *.fastq"
.
Using Both Paired-End and Non Paired Local Data¶
GEMmaker can work with both paired and non-paired data in the same data set. The only stipulation is that the non-paired data must follow the same naming convention as the paired data. See the section Using Paired-End Local data. For example, if your paired files have a 1.fastq
and 2.fastq
extension, then the non-paired files should have a 1.fastq
suffix as well.
Resuming After Failure¶
If for some reason GEMmaker fails to fully complete and Nextflow reports some form of error. You can resume execution of the workflow, afer correcting any problems, by passing the -resume
flag to GEMmaker. For example to resume a failed Kallisto run:
nextflow run systemsgenetics/gemmaker -profile singularity \
-resume \
--pipeline kallisto \
--kallisto_index_path Arabidopsis_thaliana.TAIR10.kallisto.indexed \
--sras SRAs.txt
GEMmaker should resume processing of samples without starting over.
Skipping Samples¶
You may find that a sample is problematic. It may be corrupt, does not align or has other problems that may cause GEMaker to fail. For such samples that cause GEMmaker to fail, you have two options. You can either remove the bad samples and restart GEMmaker or you can resume, as just described in the previous section, but first add the sample names to a new file, one per line, then, use the --skip_samples
argument to tell GEMmaker about this file. For example:
nextflow run systemsgenetics/gemmaker -profile singularity \
--pipeline kallisto \
--kallisto_index_path Arabidopsis_thaliana.TAIR10.kallisto.indexed \
--sras SRAs.txt \
--skip_samples samples2skip.txt
In the example above any samples that should be skipped should be added to the samples2skip.txt
file.
Warning
Note, when you provide SRA IDs to GEMmaker you provide the RUN IDs, but multiple run IDs can be contained in a single sample. To skip a sample, you must provide the sample ID. For SRA, these begin with the prefix SRX, DRX or ERX, where as run IDs begin with SRR, DRR or ERR.
Running on a Cluster¶
If you want to run GEMmaker on a local High Performance Computing Cluster (HPC) that uses a scheduler such as SLURM or PBS, you must first create a configuration file to help GEMmaker know how to submit jobs. The file should be named nextflow.config
and be placed in the same directory where you are running GEMmaker. Below is an example nextflow.config
file for executing GEMmaker on a cluster that uses the SLURM scheduler.
profiles {
my_cluster {
process {
executor = "slurm"
queue = "<queue name>"
clusterOptions = ""
}
executor {
queueSize = 120
}
}
}
In the example above we created a new profile named my_cluster
. The executor
is what type of cluster we will be running on, in this case a slurm cluster. Additional executor options for other HPC cluster types can be located in the nextflow executor documentation
Within the stanza, the placeholder text <queue name>
should be replaced with the name of the queue on which you are allowed to submit jobs. maxRetries
indicates how many times Nextflow will attempt to perform a process before giving an error. maxRetries
is an important option if you are running on the backfill queue of a HPC where you have the potential to be pre-empted by other jobs. If you need to provide specific options that you would normally provide in a SLURM submission script (such as an account or other node targetting settings) you can use the clusterOptions
setting.
Next, is an example SLURM submission script for submitting a job to run GEMmaker. Please note, this is just an example and your specific cluster may require slightly different configuration/usage. The script assumes your cluster uses the lmod system for specifying software.
#!/bin/sh
#SBATCH --partition=<queue_name>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=10:00:00
#SBATCH --job-name=GEMmaker
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
module add java nextflow singularity
nextflow run systemsgenetics/gemmaker \
-profile my_cluster,singularity \
-resume \
--pipeline kallisto \
--kallisto_index_path Arabidopsis_thaliana.TAIR10.kallisto.indexed \
--sras SRA_IDs.txt \
--max_cpus 120
Notice in the call to nextflow, the profile my_cluster
has been added along with singularity
, also, the --max_cpus
argument has been set to the same size as the queueSize
value in the config file. The default value of --max_cpus
is 4 and won’t allow the workflow to expand beyond 4 CPUs if it is not increased to match the config file.
Intermediate Files¶
GEMmaker was designed to limit the storage requirements in order to allow for processing of large numbers of FASTQ files without overrunning storage requirement. By default it will remove all large intermediate files to keep space usage to a minimum. However, you can indicate what intermediate files you would like to keep by providing any of the following arguments and setting them to true
. For example, to keep the downloaded SRA files the keep_sra
argument would be provided and set to true:
nextflow run systemsgenetics/gemmaker -profile singularity \
--pipeline salmon \
--salmon_index_path Arabidopsis_thaliana.TAIR10.salmon.indexed \
--sras SRAs.txt \
--keep_sra true
The following is a listing of all arguments that can control which intermediate files are kept.
SRA Files¶
The following arguments can be used if the --sras
option is used.
--keep_sra
: Set to true to keep all downloaded SRA files .--keep_retrieved_fastq
: Set to true to keep the FASTQ files that are derived from downloaded SRA files.
Kallisto Files¶
The following arguments can be used if the --pipeline kallisto
option is used.
--kallisto_keep_data
: Set to true to keep the intermediate files created by Kallisto.
Salmon Files¶
The following arguments can be used if the --pipeline salmon
option is used.
--kallisto_keep_data
: Set to true to keep the intermediate files created by Salmon.
Hisat2 Files¶
The following arguments can be used if the --pipeline hisat2
option is used.
--hisat2_keep_data
: Set to true to keep the stringtie output.--hisat2_keep_sam
: Set to true to keep the SAM files created by Hisat2.--hisat2_keep_bam
: Set to true to keep the BAM files created by Hisat2.--trimmomatic_keep_trimmed_fastq
: Set to true to keep the trimmed FASTQ files after trimmomatic is run.
Configuration¶
The instructions above provide details for running GEMmaker using Singularity. For most instances you probably won’t need to make customizations to the workflow configuration. However, should you need to, GEMmaker is a nf-core compatible workflow. Therefore, it follows the general approach for workflow configuration which is described at the nf-core Pipeline Configuration page. Please see those instructions for the various platforms and settings you can configure. However, below are some quick tips for tweaking GEMmaker.
In all cases, if you need to set some customizations you must first create a configuration file. The file should be named nextflow.config
and be placed in the same directory where you are running GEMmaker.
Configuration for a Cluster¶
To run GEMmaker on a computational cluster you will need to to create a custom configuration. Instructions and examples are provided in the Running on a Cluster section.
Increasing Resources¶
You may find that default resources are not adequate for the size of your data set. You can alter resources requested for each step of the GEMmaker workflow by using the withLabel
scope selector in a custom nextflow.config
file.
For example, if you have thousands of SRA data sets to process, you may need more memory allocated to the retrieve_sra_metadata
step of the workflow. All steps in the workflow have a “label” that you can use to indicate which step resources should be changed. Below is an example nextflow.config
file where a new profile named custom
is provided where the memory has been increased for the retrieve_sra_metadata
.
profiles {
custom {
process {
withLabel:retrieve_sra_metadata {
memory = "10.GB"
}
}
}
}
This new custom
profile can be used when calling GEMmaker. The following is an example Kallisto run of GEMmaker using the custom and singularity profiles:
nextflow run systemsgenetics/gemmaker -profile custom,singularity \
--pipeline kallisto \
--kallisto_index_path Arabidopsis_thaliana.TAIR10.kallisto.indexed \
--sras SRAs.txt
Nextflow provides many “directives”, such as memory
that you can use to alter or customize the resources of any step (or process) in the workflow. You can find more about these in the Nextflow documentation. Some useful directives are:
- memory: change the amount of memory allocated to the step.
- time: change the amount of time allocated to the step.
- disk: defines how much local storage is required.
- cpus: defines how many threads (or CPUs) the task can use.
The “labels” that GEMmaker provides and which you can set custom directives include:
retrieve_sra_metadata
: For the step that retrieves metadata from the NCBI web services for the SRR run IDs that were provided. This step can require more memory than the defaults if there are huge numbers of samples.download_runs
: For the step is used for downloading SRA files from NCBI.fastq_dump
: For the step that is used after downloading SRA files and converting them to FASTQ files.fastqc
: For the step where the FastQC program is used which generates quality reports on FASTQ files.kallisto
: For the step the runs the Kallisto tool.salmon
: For the step that runs the Salmon tool.trimmomatic
: For the step that runs the Trimmomatic step which only runs when hisat2 is the desired pipeline.hisat2
: For the step that runs the hisat2 tool.samtools
: For the step that runs when the samtools tool is used after Hisat2 runs. This step only runs when the hisat2 pipeline is used.stringtie
: For the step that runs the stringtie tool and which only runs when the hisat2 pipeline is used.multiqc
: For the step that runs the MultiQC results summary report.create_gem
: For the step that creates the final GEM files.multithreaded
: For all of the tools that support multithreading you can use this label to set a default number of CPUs using thecpus
directive. These tools include Salmon, Kallisto, Trimmomatic, Hisat2 and Stringtie. By using this label you set set the same number ofcpus
for all multithreaded steps at once.
Using the Development Version¶
New updates to GEMmaker, prior to issuing a formal release, are held in the dev
branch of the GEMmaker github repository. It is recommended to always use a formal release of GEMmaker, however, you can test the most recent improvements prior to release. To do so, use the -r dev
argument when running GEMmaker. For example:
nextflow run systemsgenetics/gemmaker -r dev -profile singularity \
--pipeline kallisto \
--kallisto_index_path Arabidopsis_thaliana.TAIR10.kallisto.indexed \
--sras SRAs.txt
The -r dev
argument forces Nextflow to use the development version of GEMmaker rather than the most recent stable version.
Note
You can find the most recent documentation for the dev
branch at https://gemmaker.readthedocs.io/en/dev/