A number of features have been added to Seal Demux.
- Support for substitution errors when matching tags;
- Support for “project” directories, like newer versions of CASAVA;
- “Separate reads” feature that outputs reads in separate data sets according to read number (this way you can throw the files directly into a standard pipeline that expects read 1 and 2 to be in separate files)
A new tool has been introduced, bcl2qseq, that drives Illumina’s bclToqseq reduce the time required to convert your bcl files into reads.
Most Seal tools, where appropriate, now support Fastq, Qseq and Sam in input and/or output.
- Illumina-style Fastq tags with meta data are also supported.
- Formats are supported in input and output
- For text-based formats (i.e., fastq, qseq and sam), transparent compression and decompression is supported with a number of codecs (gzip, bzip2, snappy).
The changes made also lay the groundwork to integrate BAM input and output support.
The new I/O format support is made available thanks to the Hadoop-BAM library.
The added dependency on this library entails a few things of which you need to be aware.
The properties that configured fastq and qseq input or output have changed name. Here is a full list:
Old property | Replacement |
seal.fastq-input.base-quality-encoding | seal.input.base-quality-encoding |
seal.qseq-input.base-quality-encoding | seal.input.base-quality-encoding |
seal.qseq-output.base-quality-encoding | not available |
Warning
Note that the old property names are no longer supported and Seal will not warn you if you try to use them.
You will need to make the Hadoop-BAM jars available to build Seal.
This version repackages Seal in a more conventional way and partly automates installation with PyPi. As a result, the names of all the Seal commands have changed. We now only have a single command, seal, which in turn has many subcommands.
Old name | New name |
bwa_index_to_mmap | seal bwa_index_to_mmap |
demux | seal demux |
distcp_files | seal distcp_files |
merge_alignments | seal merge_alignments |
prq | seal prq |
read_sort | seal read_sort |
recab_table | seal recab_table |
recab_table_fetch | seal recab_table_fetch |
seqal | seal seqal |
tsvsort | seal tsvsort |
version | seal version |
Also, the names of the Python packages in Seal have changed (the root was bl but is now seal). This change will not affect you unless you were using Seal modules from your own scripts or if you want remove seal—you’ll now have to remove the seal directory instead of the bl directory.
In addition, the way to run the Seal tools if you don’t install them to the system (e.g., you build Seal but don’t install it) has changed slightly. In that case, you now have to set PYTHONPATH to include the Seal build directory. This setting is not necessary if you install Seal to one of the standard system locations.
We’ve made installing Seal easier.
The Seal build process is now driven by Python distutils. You can now perform all build and installation related actions by invoking python setup.py <command>. The old Makefile is deprecated.
We’ve also put Seal on PyPi so, after you install all the dependencies and Python pip with your package manager (see the installation page), you can now install Seal with a simple command:
pip install seal
Seal now works with newer versions of Hadoop, including the Hadoop 1.x and 2.x lines.
This release includes several bug fixes and adds more automated tests.
A new program has been added to the Seal suite: RecabTable. RecabTable computes a result equivalent to the GATK CountCovariatesWalker, but does so in a scalable way by taking advantage of your Hadoop cluster.
See the RecabTable page for all the details.
In particular, PairReadsQSeq can read the meta-infomation in the fastq files produced by the new version of CASAVA, and it should also be able to cope with generic fastq files as long as the trailing “/1” or “/2” is present in the read id to indicate the read number.
A number of property names have been changed. We have migrated from the old “bl.”-style naming, which we had for historic reasons, and have moved to a new “seal.” naming scheme.
Deprecated property names should still work with this version, but will be removed from new ones. You are urged to updated your configuration files and/or scripts, especially since Seal ignores properties it does not recognize, so misnamed properties do not normally result in an error message.
Here is a list of the deprecated property names.
Deprecated property | Replacement |
bl.prq.min-bases-per-read | seal.prq.min-bases-per-read |
bl.prq.drop-failed-filter | seal.prq.drop-failed-filter |
bl.prq.warning-only-if-unpaired | seal.prq.warning-only-if-unpaired |
bl.seqal.log.level | seal.seqal.log.level |
bl.seqal.alignment.max.isize | seal.seqal.alignment.max.isize |
bl.seqal.pairing.batch.size | seal.seqal.pairing.batch.size |
bl.seqal.fastq-subformat | seal.seqal.fastq-subformat |
bl.seqal.min_hit_quality | seal.seqal.min_hit_quality |
bl.seqal.remove_unmapped | seal.seqal.remove_unmapped |
bl.seqal.discard_duplicates | seal.seqal.discard_duplicates |
bl.seqal.nthreads | seal.seqal.nthreads |
bl.seqal.trim.qual | seal.seqal.trim.qual |
bl.seqal.log.level | seal.seqal.log.level |
bl.seqal.discard_duplicates | seal.seqal.discard_duplicates |
The prq file format now has been defined as using the Sanger Phred+33 quality encoding. Therefore, PairReadsQSeq now produces Sanger qualities and Seqal by default expects Sanger qualities.
Since, as just mentioned, prq files now contain base quelities in Sanger Phred+33 encoding, we’ve changed the default base quality encoding expected by Seqal from Illumina Phred+64 to Sanger Phred+33.
You can get the old behaviour by setting -D seal.seqal.fastq-subformat=fastq-illumina when you call seqal.
More than a simple utility, TsvSort is a Hadoop program for sorting text files based on the Terasort algorithm. It is a scalable, fast, distributed sorting application. It allows a use pattern similar to the Unix sort utility, allowing you to specify a field delimiter and which fields to use as keys.
See the TsvSort page for details.
A few bug fixes and usability improvements are also introduced by this release.
The MergeAlignments utility provided to merge multi-part output from Seal tools now has a couple of additional features:
See the MergeAlignments documentation for details.
We updated the Seqal distributed alignment tool to include the alignment code from BWA 0.5.9.
You can now store your usual Seal run configuration in a separate config file (by default, $HOME/.sealrc). All programs in the Seal suite will now read that file if it exists. You can also specify your own configuration file name, allowing you to easily have a number of preset run configurations. In addition, you can now specify all options directly on the command line (overriding default and file settings).
For more details, see the section Seal Config File.
Old name | New name |
bin/run_prq.sh | bin/prq |
bin/run_seqal.sh | bin/seqal |
bin/merge_sorted_alignments | bin/merge_alignments |
All Seal Hadoop commands except Seqal now accept multiple input paths. The generic command line is:
tool [ options ] <input 1> <input 2>...<input N> <output>
Seqal unfortunately can only take a single input path for now. This is due to a limitation in the Hadoop pipes command line interface.
We have made the command line interface of the Seal tools more consistent. This change mainly affects PairReadsQSeq and Seqal. We describe this new command line interface in the section on Program Usage section.
In addition to changing the name of the command from run_prq.sh to prq, we have also changed the arguments prq accepts.
Old:
./bin/run_prq.sh input output 54
where 54 was an optional argument to override the minimum number of required bases for a read to avoid filtering.
New:
./bin/prq -D bl.prq.min-bases-per-read=54 input output
Now the parameter is a configuration property that can be specified on the command line or the new Seal configuration file <seal_config>. PairReadsQSeq configuration properties are documented in the section PairReadsQSeq
In addition to changing the name of the command from run_seqal.sh to seqal, we have also changed the arguments seqal accepts.
Old:
./bin/run_seqal.sh input output reference 15
where 15 was an optional argument to control read trimming.
New:
./bin/seqal -D seal.seqal.trim.qual=15 input output
or:
./bin/seqal --trimq 15 input output
Now the trim quality parameter is the configuration property seal.seqal.trim.qual that can be specified on the command line or the new Seal configuration file. In addition, Seqal provides a shortcut --trimq argument. Seqal configuration properties are documented in the section Seqal Properties.
Note the changes to the default values of these Seqal options. They may affect your workflow.
Parameter | Old value | New value |
seal.seqal.min_hit_quality | 1 | 0 |
seal.seqal.remove_unmapped | True | False |
PRQ used to stop with a (rather cryptic) error if it encountered an unpaired read in the input data. By default it still does that, although we think we’ve somewhat improved the error message. However, if you prefer you can tell it to discard the unpaired reads with a warning:
./bin/prq -D bl.prq.warning-only-if-unpaired=true input output