News¶

New in 0.4.0¶

New features in Demux¶

A number of features have been added to Seal Demux.

Support for substitution errors when matching tags;

Support for “project” directories, like newer versions of CASAVA;

“Separate reads” feature that outputs reads in separate data sets according to read number (this way you can throw the files directly into a standard pipeline that expects read 1 and 2 to be in separate files)

Hadoop-based bcl to qseq conversion¶

A new tool has been introduced, bcl2qseq, that drives Illumina’s bclToqseq reduce the time required to convert your bcl files into reads.

New I/O format support¶

Most Seal tools, where appropriate, now support Fastq, Qseq and Sam in input and/or output.

Illumina-style Fastq tags with meta data are also supported.

Formats are supported in input and output

For text-based formats (i.e., fastq, qseq and sam), transparent compression and decompression is supported with a number of codecs (gzip, bzip2, snappy).

The changes made also lay the groundwork to integrate BAM input and output support.

Hadoop-BAM¶

The new I/O format support is made available thanks to the Hadoop-BAM library.

The added dependency on this library entails a few things of which you need to be aware.

Changes in property names¶

The properties that configured fastq and qseq input or output have changed name. Here is a full list:

Old property	Replacement
seal.fastq-input.base-quality-encoding	seal.input.base-quality-encoding
seal.qseq-input.base-quality-encoding	seal.input.base-quality-encoding
seal.qseq-output.base-quality-encoding	not available

Warning

Note that the old property names are no longer supported and Seal will not warn you if you try to use them.

Hadoop-BAM jars¶

You will need to make the Hadoop-BAM jars available to build Seal.

Repackaging¶

This version repackages Seal in a more conventional way and partly automates installation with PyPi. As a result, the names of all the Seal commands have changed. We now only have a single command, seal, which in turn has many subcommands.

Old name	New name
bwa_index_to_mmap	seal bwa_index_to_mmap
demux	seal demux
distcp_files	seal distcp_files
merge_alignments	seal merge_alignments
prq	seal prq
read_sort	seal read_sort
recab_table	seal recab_table
recab_table_fetch	seal recab_table_fetch
seqal	seal seqal
tsvsort	seal tsvsort
version	seal version

Also, the names of the Python packages in Seal have changed (the root was bl but is now seal). This change will not affect you unless you were using Seal modules from your own scripts or if you want remove seal—you’ll now have to remove the seal directory instead of the bl directory.

In addition, the way to run the Seal tools if you don’t install them to the system (e.g., you build Seal but don’t install it) has changed slightly. In that case, you now have to set PYTHONPATH to include the Seal build directory. This setting is not necessary if you install Seal to one of the standard system locations.

Easier installation¶

We’ve made installing Seal easier.

The Seal build process is now driven by Python distutils. You can now perform all build and installation related actions by invoking python setup.py <command>. The old Makefile is deprecated.

We’ve also put Seal on PyPi so, after you install all the dependencies and Python pip with your package manager (see the installation page), you can now install Seal with a simple command:

pip install seal

Newer versions of Hadoop¶

Seal now works with newer versions of Hadoop, including the Hadoop 1.x and 2.x lines.

Bug fixes¶

This release includes several bug fixes and adds more automated tests.

New in 0.3.0¶

RecabTable program¶

A new program has been added to the Seal suite: RecabTable. RecabTable computes a result equivalent to the GATK CountCovariatesWalker, but does so in a scalable way by taking advantage of your Hadoop cluster.

See the RecabTable page for all the details.

PairReadsQSeq can now also read fastq¶

In particular, PairReadsQSeq can read the meta-infomation in the fastq files produced by the new version of CASAVA, and it should also be able to cope with generic fastq files as long as the trailing “/1” or “/2” is present in the read id to indicate the read number.

Changes in property names¶

A number of property names have been changed. We have migrated from the old “bl.”-style naming, which we had for historic reasons, and have moved to a new “seal.” naming scheme.

Deprecated property names should still work with this version, but will be removed from new ones. You are urged to updated your configuration files and/or scripts, especially since Seal ignores properties it does not recognize, so misnamed properties do not normally result in an error message.

Here is a list of the deprecated property names.

Deprecated property	Replacement
bl.prq.min-bases-per-read	seal.prq.min-bases-per-read
bl.prq.drop-failed-filter	seal.prq.drop-failed-filter
bl.prq.warning-only-if-unpaired	seal.prq.warning-only-if-unpaired
bl.seqal.log.level	seal.seqal.log.level
bl.seqal.alignment.max.isize	seal.seqal.alignment.max.isize
bl.seqal.pairing.batch.size	seal.seqal.pairing.batch.size
bl.seqal.fastq-subformat	seal.seqal.fastq-subformat
bl.seqal.min_hit_quality	seal.seqal.min_hit_quality
bl.seqal.remove_unmapped	seal.seqal.remove_unmapped
bl.seqal.discard_duplicates	seal.seqal.discard_duplicates
bl.seqal.nthreads	seal.seqal.nthreads
bl.seqal.trim.qual	seal.seqal.trim.qual
bl.seqal.log.level	seal.seqal.log.level
bl.seqal.discard_duplicates	seal.seqal.discard_duplicates

prq files now always use sanger quality encoding¶

The prq file format now has been defined as using the Sanger Phred+33 quality encoding. Therefore, PairReadsQSeq now produces Sanger qualities and Seqal by default expects Sanger qualities.

Seqal default quality encoding is now sanger¶

Since, as just mentioned, prq files now contain base quelities in Sanger Phred+33 encoding, we’ve changed the default base quality encoding expected by Seqal from Illumina Phred+64 to Sanger Phred+33.

You can get the old behaviour by setting -D seal.seqal.fastq-subformat=fastq-illumina when you call seqal.

TsvSort utility¶

More than a simple utility, TsvSort is a Hadoop program for sorting text files based on the Terasort algorithm. It is a scalable, fast, distributed sorting application. It allows a use pattern similar to the Unix sort utility, allowing you to specify a field delimiter and which fields to use as keys.

See the TsvSort page for details.

Bug fixes and usability¶

A few bug fixes and usability improvements are also introduced by this release.

when an error in the input file format is encountered, the tools now try to tell you exactly in which file and line the problem occurred.
Seqal logging and error reporting has been fixed. In particular, when a usage error occurred with Seqal the program blurted a rather unhelpful message such as Error running seqal. We had a problem that was causing the actual error message to be lost. That should be fixed now.

New in 0.2.3¶

Improved MergeAlignments¶

The MergeAlignments utility provided to merge multi-part output from Seal tools now has a couple of additional features:

Reference checksums
additional SAM header tags

See the MergeAlignments documentation for details.

New in 0.2.2¶

Seqal now integrates BWA 0.5.9¶

We updated the Seqal distributed alignment tool to include the alignment code from BWA 0.5.9.

New configuration system¶

You can now store your usual Seal run configuration in a separate config file (by default, $HOME/.sealrc). All programs in the Seal suite will now read that file if it exists. You can also specify your own configuration file name, allowing you to easily have a number of preset run configurations. In addition, you can now specify all options directly on the command line (overriding default and file settings).

For more details, see the section Seal Config File.

Changes names of executables¶

Old name	New name
bin/run_prq.sh	bin/prq
bin/run_seqal.sh	bin/seqal
bin/merge_sorted_alignments	bin/merge_alignments

Multiple inputs¶

All Seal Hadoop commands except Seqal now accept multiple input paths. The generic command line is:

tool [ options ] <input 1> <input 2>...<input N> <output>

Seqal unfortunately can only take a single input path for now. This is due to a limitation in the Hadoop pipes command line interface.

Changes in command line tool usage¶

We have made the command line interface of the Seal tools more consistent. This change mainly affects PairReadsQSeq and Seqal. We describe this new command line interface in the section on Program Usage section.

Prq¶

In addition to changing the name of the command from run_prq.sh to prq, we have also changed the arguments prq accepts.

Old:

./bin/run_prq.sh input output 54

where 54 was an optional argument to override the minimum number of required bases for a read to avoid filtering.

New:

./bin/prq -D bl.prq.min-bases-per-read=54 input output

Now the parameter is a configuration property that can be specified on the command line or the new Seal configuration file <seal_config>. PairReadsQSeq configuration properties are documented in the section PairReadsQSeq

Seqal¶

In addition to changing the name of the command from run_seqal.sh to seqal, we have also changed the arguments seqal accepts.

Old:

./bin/run_seqal.sh input output reference 15

where 15 was an optional argument to control read trimming.

New:

./bin/seqal -D seal.seqal.trim.qual=15 input output

or:

./bin/seqal --trimq 15 input output

Now the trim quality parameter is the configuration property seal.seqal.trim.qual that can be specified on the command line or the new Seal configuration file. In addition, Seqal provides a shortcut --trimq argument. Seqal configuration properties are documented in the section Seqal Properties.

Changes to default values¶

Note the changes to the default values of these Seqal options. They may affect your workflow.

Parameter	Old value	New value
seal.seqal.min_hit_quality	1	0
seal.seqal.remove_unmapped	True	False

Let PRQ discard unpaired reads¶

PRQ used to stop with a (rather cryptic) error if it encountered an unpaired read in the input data. By default it still does that, although we think we’ve somewhat improved the error message. However, if you prefer you can tell it to discard the unpaired reads with a warning:

./bin/prq -D bl.prq.warning-only-if-unpaired=true input output

Table Of Contents

Previous topic

Next topic

Get Seal

Contributors

News¶

New in 0.4.0¶

New features in Demux¶

Hadoop-based bcl to qseq conversion¶

New I/O format support¶

Hadoop-BAM¶

Changes in property names¶

Hadoop-BAM jars¶

Repackaging¶

Easier installation¶

Newer versions of Hadoop¶

Bug fixes¶

New in 0.3.0¶

RecabTable program¶

PairReadsQSeq can now also read fastq¶

Changes in property names¶

prq files now always use sanger quality encoding¶

Seqal default quality encoding is now sanger¶

TsvSort utility¶

Bug fixes and usability¶

New in 0.2.3¶

Improved MergeAlignments¶

New in 0.2.2¶

Seqal now integrates BWA 0.5.9¶

New configuration system¶

Changes names of executables¶

Multiple inputs¶

Changes in command line tool usage¶

Prq¶

Seqal¶

Changes to default values¶

Let PRQ discard unpaired reads¶

Navigation

Table Of Contents

Previous topic

Next topic

Get Seal

Contributors

Quick search

News¶

New in 0.4.0¶

New features in Demux¶

Hadoop-based bcl to qseq conversion¶

New I/O format support¶

Hadoop-BAM¶

Changes in property names¶

Hadoop-BAM jars¶

Repackaging¶

Easier installation¶

Newer versions of Hadoop¶

Bug fixes¶

New in 0.3.0¶

RecabTable program¶

PairReadsQSeq can now also read fastq¶

Changes in property names¶

prq files now always use sanger quality encoding¶

Seqal default quality encoding is now sanger¶

TsvSort utility¶

Bug fixes and usability¶

New in 0.2.3¶

Improved MergeAlignments¶

New in 0.2.2¶

Seqal now integrates BWA 0.5.9¶

New configuration system¶

Changes names of executables¶

Multiple inputs¶

Changes in command line tool usage¶

Prq¶

Seqal¶

Changes to default values¶

Let PRQ discard unpaired reads¶

Navigation