RecabTable¶

RecabTable is a Hadoop program to calculate a table of base qualities for all values of a given set of factors. It computes a result equivalent to the GATK CountCovariatesWalker in the Seal framework.

Usage¶

To run RecabTable, launch seal recab_table. You will need to provide input and output paths, and a database of known variation sites in ROD or VCF format. For example,

seal recab_table --vcf-file dbsnp.vcf sam_directory recab_table_output

Input files must be in SAM format without a header (like the ones produced by Seqal and ReadSort). Output files are all CSV, without a header. The known variants file must be on HDFS.

To assemble a single CSV from the RecabTable output that can be used by GATK TableRecalibration use:

seal recab_table_fetch recab_table_output recab_table.csv

or:

seal recab_table_fetch recab_table_output > recab_table.csv

The utility seal recab_table_fetch simply copies the data generated by RecabTable into one local file while inserting the appropriate data header so that GATK will accept it.

In its default configuration RecabTable produces output that is almost identical to what is produced by this GATK CountCovariates command:

java -jar GATK-1.2-24-g6478681.jar \
-T CountCovariates \
-cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate \
-R ref.fasta -recalFile output.csv -knownSites dbsnp.vcf -I data.bam

Regarding the “almost” part, see the section on the Differences from GATK CountCovariates.

seal recab_table follows the normal Seal usage convention. See the section Program Usage for details.

Configurable Properties¶

Name	Meaning
seal.recab.rg-covariate.default-rg	Read group to assign to mappings without an RG tag. This value is mandatory if your data includes mappings that do not have a read group tag (RG).
seal.recab.smoothing	Smoothing parameter for empirical quality calculation (default: 0).
seal.recab.max-qscore	Upper limit for the empirical quality scores (default: 40).
seal.recab.skip-known-variant-sites	Set to false to ignore known variants DB (for testing purposes).

In addition, all the general Seal and Hadoop configuration properties apply.

Note

Config File Section Title: RecabTable

Counters¶

RecabTable has a number of counters to help you monitor what it’s doing. Here’s an explanation of what they mean.

Read Counters¶

Counter name	Explanation
Processed	Reads seen by RecabTable.
FilteredTotal	Reads seen and discarded.
FilteredUnmapped	Reads seen and discarded because they were unmapped.
FilteredMapQ	Reads seen and discarded because they had a mapq of 0 or 255
FilteredDuplicate	Reads seen and discarded because they were marked as duplicates.
FilteredQC	Reads seen and discarded because they were marked as having failed QC.
FilteredSecondaryAlignment	Reads seen and discarded because they were marked as secondary alignments.

Base Counters¶

Counter name	Explanation
All	Number of bases in the input.
Used	Bases used in table calculation.
BadBases	Bases skipped because they were unreadable (N) or base quality was 0.
VariantMismatches	Reference mismatches at known variant locations.
VariantBases	Bases at known variant locations.
NonVariantMismatches	Reference mismatches at a regular (not known variant) locations.
AdaptorBasesTrimmed	Adaptor bases trimmed (when insert is shorter than the read).

What RecabTable does¶

RecabTable looks at all the read mappings in the input data set. It ignores records that are:

unmapped
have a mapq of 0 or 255 (i.e. bad mapping quality or mapping quality unavailable)
are marked as duplicates, having failed QC, or as secondary alignments.

For the remaining reads, RecabTable looks at all the bases that have been matched to a reference coordinate (CIGAR ‘M’ operator). For these bases, it skips the ones that are aligned to a known variation site, that have a base quality score of 0 or that are not determined (N values). For the rest, it computes the values of the selected covariates and, for each combination of values, notes whether the base is a match or a mismatch to the reference.

For each identical combination of covariate values, RecabTable counts the number of times it was observed in the input data set, and how many of those times the base was a match or mismatch to the reference.

Covariates¶

The following covariates are used by RecabTable:

Read group
Base quality score
Sequencing cycle
Dinucleotide

An explanation of the covariates follows.

Read group¶

The value of this covariate is simply the value of the mapping’s RG tag. If the mapping does not have an RG tag the value specified in the seal.recab.rg-covariate.default-rg property is used.

Base Quality Score¶

The Phred-scaled quality score for each base.

Sequencing cycle¶

The run cycle during which the base was read. Bases from the second read in a pair are given negative cycle numbers.

Dinucleotide¶

Given a base in a read, its dinucleotide value is the pair of bases formed by the previous base and the base itself, where the previous base is the one that was sequenced immediately before the one in question. Since the first base in a read doesn’t have a previous base the first dinucleotide is NN. For instance, given the read

GAAGAAGGTGTGTGACC

dinucleotide values would be:

NN, GA, AA, AG, GA, ...

Note that the dinucleotides are given in the order they were read by the sequencer, so reads that are aligned to the reverse strand are complemented and reversed before extracting the nucleotide pairs (in the SAM input format all reads are normalized to the forward strand).

Output¶

RecabTable produces a CSV table with the following columns:

Read group
Base quality score
Cycle
Dinucleotide
Number of observations
Number of reference mismatches
Empirical quality score.

Number of observations¶

The number of times that combination of covariate values was seen.

Number of reference mismatches¶

The number of times that combination of covariate values was seen and the resulting base did not match the reference.

Empirical quality score¶

The empirical quality score is calculated according the following formula:

                   (mismatches + smoothing)
round( -10*log10 ( -------------------------- + eps )  )
                   (observations + smoothing)

The quality value is bounded between 1 and max-qscore.

Constants

eps	Min reasonable error: 1e-4
smoothing	seal.recab.smoothing (default: 0)
max-qscore	seal.recab.max-qscore (default: 40)

Differences from GATK CountCovariates¶

RecabTable produces results almost identical to GATK CountCovariates, but there are some small differences.

Read adapter clipping¶

While unusual, it can happen that a sequenced template is shorted than the read itself. In this case, the sequencer ends up reading part of the read adapter. Both GATK and Seal RecabTable take this into account, but GATK as of version 1.2-24-g6478681 has a small bug which causes it to clip the wrong bases in some circumstances. The GATK developers know about this issue and will surely address it quickly. However, as of the version tested (1.2-24-g6478681) this causes small differences in the covariates produced and the number of bases used by the two tools given the same input. In any case, this effect should be negligible for most sequencing runs.

Limitations¶

Hard-coded covariates¶

Currently, the set of covariates used by RecabTable is hard-coded and thus cannot be altered without editing the code and recompiling Seal. If you would like this feature to be added soon please let the Seal developers know by filing a feature request through the Seal web site.

List of known variants isn’t shared in memory¶

The list of known variants currently isn’t shared in memory by the map tasks. For the human genome, it’s relatively big and takes a while to load; in fact, RecabTable map tasks on average currently spend just as long loading the table as they spend doing useful work. We will make an effort to address this problem in the short-term, probably by using a memory-mapped binary index file, similar to what Seqal does with the reference sequence.

Table Of Contents

Previous topic

Next topic

Get Seal

Contributors

RecabTable¶

Usage¶

Configurable Properties¶

Counters¶

Read Counters¶

Base Counters¶

What RecabTable does¶

Covariates¶

Read group¶

Base Quality Score¶

Sequencing cycle¶

Dinucleotide¶

Output¶

Number of observations¶

Number of reference mismatches¶

Empirical quality score¶

Differences from GATK CountCovariates¶

Read adapter clipping¶

Limitations¶

Hard-coded covariates¶

List of known variants isn’t shared in memory¶

Navigation

Table Of Contents

Previous topic

Next topic

Get Seal

Contributors

Quick search

RecabTable¶

Usage¶

Configurable Properties¶

Counters¶

Read Counters¶

Base Counters¶

What RecabTable does¶

Covariates¶

Read group¶

Base Quality Score¶

Sequencing cycle¶

Dinucleotide¶

Output¶

Number of observations¶

Number of reference mismatches¶

Empirical quality score¶

Differences from GATK CountCovariates¶

Read adapter clipping¶

Limitations¶

Hard-coded covariates¶

List of known variants isn’t shared in memory¶

Navigation