A new program has been added to the Seal suite: RecabTable. RecabTable computes a result equivalent to the GATK CountCovariatesWalker, but does so in a scalable way by taking advantage of your Hadoop cluster.
See the RecabTable page for all the details.
In particular, PairReadsQSeq can read the meta-infomation in the fastq files produced by the new version of CASAVA, and it should also be able to cope with generic fastq files as long as the trailing “/1” or “/2” is present in the read id to indicate the read number.
A number of property names have been changed. We have migrated from the old “bl.”-style naming, which we had for historic reasons, and have moved to a new “seal.” naming scheme.
Deprecated property names should still work with this version, but will be removed from new ones. You are urged to updated your configuration files and/or scripts, especially since Seal ignores properties it does not recognize, so misnamed properties do not normally result in an error message.
Here is a list of the deprecated property names.
| Deprecated property | Replacement |
| bl.prq.min-bases-per-read | seal.prq.min-bases-per-read |
| bl.prq.drop-failed-filter | seal.prq.drop-failed-filter |
| bl.prq.warning-only-if-unpaired | seal.prq.warning-only-if-unpaired |
| bl.seqal.log.level | seal.seqal.log.level |
| bl.seqal.alignment.max.isize | seal.seqal.alignment.max.isize |
| bl.seqal.pairing.batch.size | seal.seqal.pairing.batch.size |
| bl.seqal.fastq-subformat | seal.seqal.fastq-subformat |
| bl.seqal.min_hit_quality | seal.seqal.min_hit_quality |
| bl.seqal.remove_unmapped | seal.seqal.remove_unmapped |
| bl.seqal.discard_duplicates | seal.seqal.discard_duplicates |
| bl.seqal.nthreads | seal.seqal.nthreads |
| bl.seqal.trim.qual | seal.seqal.trim.qual |
| bl.seqal.log.level | seal.seqal.log.level |
| bl.seqal.discard_duplicates | seal.seqal.discard_duplicates |
The prq file format now has been defined as using the Sanger Phred+33 quality encoding. Therefore, PairReadsQSeq now produces Sanger qualities and Seqal by default expects Sanger qualities.
Since, as just mentioned, prq files now contain base quelities in Sanger Phred+33 encoding, we’ve changed the default base quality encoding expected by Seqal from Illumina Phred+64 to Sanger Phred+33.
You can get the old behaviour by setting -D seal.seqal.fastq-subformat=fastq-illumina when you call seqal.
More than a simple utility, TsvSort is a Hadoop program for sorting text files based on the Terasort algorithm. It is a scalable, fast, distributed sorting application. It allows a use pattern similar to the Unix sort utility, allowing you to specify a field delimiter and which fields to use as keys.
See the TsvSort page for details.
A few bug fixes and usability improvements are also introduced by this release.
The MergeAlignments utility provided to merge multi-part output from Seal tools now has a couple of additional features:
See the MergeAlignments documentation for details.
We updated the Seqal distributed alignment tool to include the alignment code from BWA 0.5.9.
You can now store your usual Seal run configuration in a separate config file (by default, $HOME/.sealrc). All programs in the Seal suite will now read that file if it exists. You can also specify your own configuration file name, allowing you to easily have a number of preset run configurations. In addition, you can now specify all options directly on the command line (overriding default and file settings).
For more details, see the section Seal Config File.
| Old name | New name |
| bin/run_prq.sh | bin/prq |
| bin/run_seqal.sh | bin/seqal |
| bin/merge_sorted_alignments | bin/merge_alignments |
All Seal Hadoop commands except Seqal now accept multiple input paths. The generic command line is:
tool [ options ] <input 1> <input 2>...<input N> <output>
Seqal unfortunately can only take a single input path for now. This is due to a limitation in the Hadoop pipes command line interface.
We have made the command line interface of the Seal tools more consistent. This change mainly affects PairReadsQSeq and Seqal. We describe this new command line interface in the section on Program Usage section.
In addition to changing the name of the command from run_prq.sh to prq, we have also changed the arguments prq accepts.
Old:
./bin/run_prq.sh input output 54
where 54 was an optional argument to override the minimum number of required bases for a read to avoid filtering.
New:
./bin/prq -D bl.prq.min-bases-per-read=54 input output
Now the parameter is a configuration property that can be specified on the command line or the new Seal configuration file <seal_config>. PairReadsQSeq configuration properties are documented in the section PairReadsQSeq
In addition to changing the name of the command from run_seqal.sh to seqal, we have also changed the arguments seqal accepts.
Old:
./bin/run_seqal.sh input output reference 15
where 15 was an optional argument to control read trimming.
New:
./bin/seqal -D seal.seqal.trim.qual=15 input output
or:
./bin/seqal --trimq 15 input output
Now the trim quality parameter is the configuration property seal.seqal.trim.qual that can be specified on the command line or the new Seal configuration file. In addition, Seqal provides a shortcut --trimq argument. Seqal configuration properties are documented in the section Seqal Properties.
Note the changes to the default values of these Seqal options. They may affect your workflow.
| Parameter | Old value | New value |
| seal.seqal.min_hit_quality | 1 | 0 |
| seal.seqal.remove_unmapped | True | False |
PRQ used to stop with a (rather cryptic) error if it encountered an unpaired read in the input data. By default it still does that, although we think we’ve somewhat improved the error message. However, if you prefer you can tell it to discard the unpaired reads with a warning:
./bin/prq -D bl.prq.warning-only-if-unpaired=true input output