Table Of Contents

Previous topic

MergeAlignments

Next topic

Seal Utilities

Get Seal

Contributors

Seal is developed by: crs4 logo

And generously hosted by Get SEAL at SourceForge.net. Fast, secure and Free Open Source software downloads
and GitHub

RecabTable

RecabTable is a Hadoop program to calculate a table of base qualities for all values of a given set of factors. It computes a result equivalent to the GATK CountCovariatesWalker in the Seal framework.

Usage

To run RecabTable, launch seal recab_table. You will need to provide input and output paths, and a database of known variation sites in ROD or VCF format. For example,

seal recab_table --vcf-file dbsnp.vcf sam_directory recab_table_output

Input files must be in SAM format without a header (like the ones produced by Seqal and ReadSort). Output files are all CSV, without a header. The known variants file must be on HDFS.

To assemble a single CSV from the RecabTable output that can be used by GATK TableRecalibration use:

seal recab_table_fetch recab_table_output recab_table.csv

or:

seal recab_table_fetch recab_table_output > recab_table.csv

The utility seal recab_table_fetch simply copies the data generated by RecabTable into one local file while inserting the appropriate data header so that GATK will accept it.

In its default configuration RecabTable produces output that is almost identical to what is produced by this GATK CountCovariates command:

java -jar GATK-1.2-24-g6478681.jar \
-T CountCovariates \
-cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate \
-R ref.fasta -recalFile output.csv -knownSites dbsnp.vcf -I data.bam

Regarding the “almost” part, see the section on the Differences from GATK CountCovariates.

seal recab_table follows the normal Seal usage convention. See the section Program Usage for details.

Configurable Properties

Name Meaning
seal.recab.rg-covariate.default-rg Read group to assign to mappings without an RG tag. This value is mandatory if your data includes mappings that do not have a read group tag (RG).
seal.recab.smoothing Smoothing parameter for empirical quality calculation (default: 0).
seal.recab.max-qscore Upper limit for the empirical quality scores (default: 40).
seal.recab.skip-known-variant-sites Set to false to ignore known variants DB (for testing purposes).

In addition, all the general Seal and Hadoop configuration properties apply.

Note

Config File Section Title: RecabTable

Counters

RecabTable has a number of counters to help you monitor what it’s doing. Here’s an explanation of what they mean.

Read Counters

Counter name Explanation
Processed Reads seen by RecabTable.
FilteredTotal Reads seen and discarded.
FilteredUnmapped Reads seen and discarded because they were unmapped.
FilteredMapQ Reads seen and discarded because they had a mapq of 0 or 255
FilteredDuplicate Reads seen and discarded because they were marked as duplicates.
FilteredQC Reads seen and discarded because they were marked as having failed QC.
FilteredSecondaryAlignment Reads seen and discarded because they were marked as secondary alignments.

Base Counters

Counter name Explanation
All Number of bases in the input.
Used Bases used in table calculation.
BadBases Bases skipped because they were unreadable (N) or base quality was 0.
VariantMismatches Reference mismatches at known variant locations.
VariantBases Bases at known variant locations.
NonVariantMismatches Reference mismatches at a regular (not known variant) locations.
AdaptorBasesTrimmed Adaptor bases trimmed (when insert is shorter than the read).

What RecabTable does

RecabTable looks at all the read mappings in the input data set. It ignores records that are:

  • unmapped
  • have a mapq of 0 or 255 (i.e. bad mapping quality or mapping quality unavailable)
  • are marked as duplicates, having failed QC, or as secondary alignments.

For the remaining reads, RecabTable looks at all the bases that have been matched to a reference coordinate (CIGAR ‘M’ operator). For these bases, it skips the ones that are aligned to a known variation site, that have a base quality score of 0 or that are not determined (N values). For the rest, it computes the values of the selected covariates and, for each combination of values, notes whether the base is a match or a mismatch to the reference.

For each identical combination of covariate values, RecabTable counts the number of times it was observed in the input data set, and how many of those times the base was a match or mismatch to the reference.

Covariates

The following covariates are used by RecabTable:

  • Read group
  • Base quality score
  • Sequencing cycle
  • Dinucleotide

An explanation of the covariates follows.

Read group

The value of this covariate is simply the value of the mapping’s RG tag. If the mapping does not have an RG tag the value specified in the seal.recab.rg-covariate.default-rg property is used.

Base Quality Score

The Phred-scaled quality score for each base.

Sequencing cycle

The run cycle during which the base was read. Bases from the second read in a pair are given negative cycle numbers.

Dinucleotide

Given a base in a read, its dinucleotide value is the pair of bases formed by the previous base and the base itself, where the previous base is the one that was sequenced immediately before the one in question. Since the first base in a read doesn’t have a previous base the first dinucleotide is NN. For instance, given the read

GAAGAAGGTGTGTGACC

dinucleotide values would be:

NN, GA, AA, AG, GA, ...

Note that the dinucleotides are given in the order they were read by the sequencer, so reads that are aligned to the reverse strand are complemented and reversed before extracting the nucleotide pairs (in the SAM input format all reads are normalized to the forward strand).

Output

RecabTable produces a CSV table with the following columns:

  1. Read group
  2. Base quality score
  3. Cycle
  4. Dinucleotide
  5. Number of observations
  6. Number of reference mismatches
  7. Empirical quality score.

Number of observations

The number of times that combination of covariate values was seen.

Number of reference mismatches

The number of times that combination of covariate values was seen and the resulting base did not match the reference.

Empirical quality score

The empirical quality score is calculated according the following formula:

                   (mismatches + smoothing)
round( -10*log10 ( -------------------------- + eps )  )
                   (observations + smoothing)

The quality value is bounded between 1 and max-qscore.

Constants

eps Min reasonable error: 1e-4
smoothing seal.recab.smoothing (default: 0)
max-qscore seal.recab.max-qscore (default: 40)

Differences from GATK CountCovariates

RecabTable produces results almost identical to GATK CountCovariates, but there are some small differences.

Read adapter clipping

While unusual, it can happen that a sequenced template is shorted than the read itself. In this case, the sequencer ends up reading part of the read adapter. Both GATK and Seal RecabTable take this into account, but GATK as of version 1.2-24-g6478681 has a small bug which causes it to clip the wrong bases in some circumstances. The GATK developers know about this issue and will surely address it quickly. However, as of the version tested (1.2-24-g6478681) this causes small differences in the covariates produced and the number of bases used by the two tools given the same input. In any case, this effect should be negligible for most sequencing runs.

Limitations

Hard-coded covariates

Currently, the set of covariates used by RecabTable is hard-coded and thus cannot be altered without editing the code and recompiling Seal. If you would like this feature to be added soon please let the Seal developers know by filing a feature request through the Seal web site.

List of known variants isn’t shared in memory

The list of known variants currently isn’t shared in memory by the map tasks. For the human genome, it’s relatively big and takes a while to load; in fact, RecabTable map tasks on average currently spend just as long loading the table as they spend doing useful work. We will make an effort to address this problem in the short-term, probably by using a memory-mapped binary index file, similar to what Seqal does with the reference sequence.