RecabTable is a Hadoop program to calculate a table of base qualities for all values of a given set of factors. It computes a result equivalent to the GATK CountCovariatesWalker in the Seal framework.
To run RecabTable, launch seal recab_table. You will need to provide input and output paths, and a database of known variation sites in ROD or VCF format. For example,
seal recab_table --vcf-file dbsnp.vcf sam_directory recab_table_output
Input files must be in SAM format without a header (like the ones produced by Seqal and ReadSort). Output files are all CSV, without a header. The known variants file must be on HDFS.
To assemble a single CSV from the RecabTable output that can be used by GATK TableRecalibration use:
seal recab_table_fetch recab_table_output recab_table.csv
or:
seal recab_table_fetch recab_table_output > recab_table.csv
The utility seal recab_table_fetch simply copies the data generated by RecabTable into one local file while inserting the appropriate data header so that GATK will accept it.
In its default configuration RecabTable produces output that is almost identical to what is produced by this GATK CountCovariates command:
java -jar GATK-1.2-24-g6478681.jar \
-T CountCovariates \
-cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate \
-R ref.fasta -recalFile output.csv -knownSites dbsnp.vcf -I data.bam
Regarding the “almost” part, see the section on the Differences from GATK CountCovariates.
seal recab_table follows the normal Seal usage convention. See the section Program Usage for details.
Name | Meaning |
seal.recab.rg-covariate.default-rg | Read group to assign to mappings without an RG tag. This value is mandatory if your data includes mappings that do not have a read group tag (RG). |
seal.recab.smoothing | Smoothing parameter for empirical quality calculation (default: 0). |
seal.recab.max-qscore | Upper limit for the empirical quality scores (default: 40). |
seal.recab.skip-known-variant-sites | Set to false to ignore known variants DB (for testing purposes). |
In addition, all the general Seal and Hadoop configuration properties apply.
Note
Config File Section Title: RecabTable
RecabTable has a number of counters to help you monitor what it’s doing. Here’s an explanation of what they mean.
Counter name | Explanation |
Processed | Reads seen by RecabTable. |
FilteredTotal | Reads seen and discarded. |
FilteredUnmapped | Reads seen and discarded because they were unmapped. |
FilteredMapQ | Reads seen and discarded because they had a mapq of 0 or 255 |
FilteredDuplicate | Reads seen and discarded because they were marked as duplicates. |
FilteredQC | Reads seen and discarded because they were marked as having failed QC. |
FilteredSecondaryAlignment | Reads seen and discarded because they were marked as secondary alignments. |
Counter name | Explanation |
All | Number of bases in the input. |
Used | Bases used in table calculation. |
BadBases | Bases skipped because they were unreadable (N) or base quality was 0. |
VariantMismatches | Reference mismatches at known variant locations. |
VariantBases | Bases at known variant locations. |
NonVariantMismatches | Reference mismatches at a regular (not known variant) locations. |
AdaptorBasesTrimmed | Adaptor bases trimmed (when insert is shorter than the read). |
RecabTable looks at all the read mappings in the input data set. It ignores records that are:
For the remaining reads, RecabTable looks at all the bases that have been matched to a reference coordinate (CIGAR ‘M’ operator). For these bases, it skips the ones that are aligned to a known variation site, that have a base quality score of 0 or that are not determined (N values). For the rest, it computes the values of the selected covariates and, for each combination of values, notes whether the base is a match or a mismatch to the reference.
For each identical combination of covariate values, RecabTable counts the number of times it was observed in the input data set, and how many of those times the base was a match or mismatch to the reference.
The following covariates are used by RecabTable:
An explanation of the covariates follows.
The value of this covariate is simply the value of the mapping’s RG tag. If the mapping does not have an RG tag the value specified in the seal.recab.rg-covariate.default-rg property is used.
The Phred-scaled quality score for each base.
The run cycle during which the base was read. Bases from the second read in a pair are given negative cycle numbers.
Given a base in a read, its dinucleotide value is the pair of bases formed by the previous base and the base itself, where the previous base is the one that was sequenced immediately before the one in question. Since the first base in a read doesn’t have a previous base the first dinucleotide is NN. For instance, given the read
GAAGAAGGTGTGTGACC
dinucleotide values would be:
NN, GA, AA, AG, GA, ...
Note that the dinucleotides are given in the order they were read by the sequencer, so reads that are aligned to the reverse strand are complemented and reversed before extracting the nucleotide pairs (in the SAM input format all reads are normalized to the forward strand).
RecabTable produces a CSV table with the following columns:
The number of times that combination of covariate values was seen.
The number of times that combination of covariate values was seen and the resulting base did not match the reference.
The empirical quality score is calculated according the following formula:
(mismatches + smoothing)
round( -10*log10 ( -------------------------- + eps ) )
(observations + smoothing)
The quality value is bounded between 1 and max-qscore.
Constants
eps | Min reasonable error: 1e-4 |
smoothing | seal.recab.smoothing (default: 0) |
max-qscore | seal.recab.max-qscore (default: 40) |
RecabTable produces results almost identical to GATK CountCovariates, but there are some small differences.
While unusual, it can happen that a sequenced template is shorted than the read itself. In this case, the sequencer ends up reading part of the read adapter. Both GATK and Seal RecabTable take this into account, but GATK as of version 1.2-24-g6478681 has a small bug which causes it to clip the wrong bases in some circumstances. The GATK developers know about this issue and will surely address it quickly. However, as of the version tested (1.2-24-g6478681) this causes small differences in the covariates produced and the number of bases used by the two tools given the same input. In any case, this effect should be negligible for most sequencing runs.
Currently, the set of covariates used by RecabTable is hard-coded and thus cannot be altered without editing the code and recompiling Seal. If you would like this feature to be added soon please let the Seal developers know by filing a feature request through the Seal web site.