Seqal is a distributed short read mapping and duplicate removal tool. It implements a distributed version of the BWA aligner, and adds a duplicate read identification feature using the same criteria as the Picard MarkDuplicates command (see below for details).
Like BWA, Seqal takes as input read pairs to be aligned, and an indexed reference sequence. The input read pairs need to be in prq format (PairReadsQSeq may help you format them quickly). The indexed reference is a slightly modified version of the one generated by BWA. See the following section for instructions.
Seqal needs a modified version of the BWA index files to run. The original index must be generated with a compatible version of BWA. To created the modified index follow these steps:
Generate the standard bwa index, using whatever bwa options you need:
bwa index [OPTIONS] ref.fa
seal bwa_index_to_mmap ref.fa
This generates two new files: ref.fa.sax and ref.fa.rsax.
Delete the original .sa and .rsa files:
rm -f ref.fa.*sa
These two files are not required by Seqal.
Build an archive containing the index files at the top level:
tar cf ref.tar ref.fa.*
Hadoop’s distributed cache mechanism will be used to distribute and unpack the archive on all the cluster nodes. We recommend not including the original .sa and .rsa files since for a full human genome reference they are large enough to slow the distribution process by a few minutes. Also, we recommend not compressing the archive since we have found that decompressing it adds several minutes to the start-up time, even if copying the file may be faster. Using a light compression (e.g. gzip --fast) may be the optimal solution but we have yet to test it.
|Seal version||BWA version|
|0.1 – 0.2.1||0.5.9|
To run seqal, use the seal seqal command:
seal seqal INPUT OUTPUT REF.tar
The command takes the following arguments:
|-q Q||–trimq Q||trim quality, like BWA’s -q argument (default: 0).|
|-a||–align-only||Only perform alignment and skip duplicates detection (default: false).|
Seqal also provides a number of properties to control its behaviour. For a full description see the Seqal Properties page.
In addition to the mandatory arguments and options listed here, Seqal supports the usual Seal command line options. See the Program Usage section for details.
The criteria applied by Seqal (and by Picard at the time of this writing) to identify duplicate reads roughly equates to aligned finding reads (or pairs) whose mappings start at the same reference position.
These are the steps that explain the criteria in more detail.
Each read can be mapped on the forward or reverse strand.
For each read, we’re look for the reference position of the first base from each read (in the example below, S1 and S2).
Take a single fragment that we’re going to sequence:
S1 S2 sequence AAACCCGGGTTTAAAGTTCAAGCAATTCTCACCTCCACCTTCCAGAACCGGTTAACCGGT |-------------| |-------------| Read 1 Read 2
Suppose the last 3 bases from Read 1 are trimmed:
| trimmed S1 v AAACCCGGGTTTaaa |-------------| Read 1
If Read 1 is mapped on the forward strand, the reference position M of S1 is then simply the mapping position reported by the aligner for this read:
M S1 AAACCCGGGTTTaaa |-------------| Read 1
If instead Read 1 is mapped to the reverse strand, its mapping position will refer to its last bases, since the read is reversed (and complemented):
M S1 aaaTTTGGGCCCAAA |-------------| Read 1
Therefore, to find our “start” position S1 we’ll have to look at the alignment (through the CIGAR string) and find the reference position of the right-most read (note that for simplicity we didn’t complement the bases in the example above). We’ll use the reference position of S1 when deciding whether this read has duplicates.
Dealing with Read 2 with trimming can be a little more complicated, since the trimming happens at the S2 side of the read. Consider Read 2 with 4 bases trimmed and mapped to the forward strand:
S2 M aaccGGTTAACCGGT |-------------| Read 2
In this case the alignment reports the reference position of the first G, which is the 5th base in the read. To find the reference position of S2 we have to count backwards. The number of position to back up is indicated by the soft and/or hard clipping operations in the CIGAR—for the example above it could be 4S11M, so we would need to subtract 4 positions from our alignment coordinate.
The final case is Read 2 on the reversed strand. Again, in this case the read is reversed and complemented so S2 is on the “right” and the mapping position M is on the other end:
M S2 TGGCCAATTGGccaa |-------------| Read 2
We therefore have to count forward, including any trimmed bases to find our canonical position S2 which will be used to evaluate duplicates.
In the end, for any given read we will have its corresponding start coordinate S.
Match the pairs by whether they align on the same strand and by the reference coordinates of start of each read, S1 or S2, from step (2). With this criteria we identify sets of equivalent reads.
To calculate the equivalency classes of reads we form a key (S1, orientation read 1, S2). All pairs which result in identical instances of this tuple will be considered duplicates.
Given a set of pairs, leave the pair with the highest base qualities as is, while we label the rest as duplicates.
To decide which pair has the best quality, we sum all base qualities >= 15. The pair with the highest sum “wins” (we implicitly assume reads have the same length).
For unpaired reads (or reads whose mate is unmapped), if the read’s S coordinate (as in step 2) and mapping orientation falls on a paired read, it will be marked as a duplicate—i.e. paired reads are given precedence.
If instead for a particular coordinate and orientation we only find unpaired reads, then we apply the same base quality-based criteria that we used for pairs: the one with the highest sum( base qualities >= 15 ) is left as is, while the rest are marked as duplicates.
Unmapped reads cannot be marked as duplicates, since our criteria for identifying duplicates is based on mapping coordinates. Seqal does not try to match reads by identical nucleotide sequence.