ERROR CORRECTION METHOD FOR SEQUENCING DATA WITH INSERTIONS AND DELETIONS
Read the full article
For citation: Alexandrov A.V., Shalyto A.A. Error correction method for sequencing data with insertions and deletions. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2016, vol. 16, no. 1, pp. 108–114.
Subject of Research.A method for error correction for sequencing reads of a haploid organism with insertions and deletions was developed. It was tested on two libraries: a synthesized dataset for Escherichia coli bacterium and a real dataset of reads for Pseudomonas stutzeri. Method. The method is based on using k-mers but only for finding reads that are close to each other. For the close reads a consensus string is created which is then used for correcting errors in the initial reads. Main Results. The algorithm is implemented as a separated program. The program has been tested on both real and synthesized data. The method performance is higher than that of the other known methods (N50 metric was used as well as total contig length and maximal contig length as metrics for comparison). Practical Relevance. The method can be used together with known genome assembly methods not suitable for application with the reads containing insertion and deletion errors.
1. Rothberg J.M., Hinz W., Rearick T.M. et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature, 2011, vol. 475, no. 7356, pp. 348–352. doi: 10.1038/nature10242
2. Bentley D.R., Balasubramanian S., Swerdlow H. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 2008, vol. 456, no. 7218, pp. 53–59. doi: 10.1038/nature07517
3. Roach J., Boysen C., Wang K., Hood L. Pairwise end sequencing: a unified approach to genomic mapping and sequencing. Genomics, 1995, vol. 26, no. 2, pp. 345–353. doi: 10.1016/0888-7543(95)80219-C
4. Bragg L.M., Stone G., Butler M.K., Hugenholtz P., Tyson G.W. Shining a light on dark sequencing: characterizing errors in ion torrent PGM data. PLOS Computational Biology, 2013, vol. 9, no. 4, art. e1003031. doi: 10.1371/journal.pcbi.1003031
5. Alexandrov A.V., Kazakov S.V., Melnikov S.V., Sergushichev A.A., Tsarev F.N., Shalyto A.A. Errors correction method in the readings set of nucleotide sequence. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2011, no. 5(75), pp. 81–84.
6. Simpson J.T., Wong K., Jackman S.D., Schein J.E., Jones S.J.M., Birol I. ABySS: a parallel assembler for short read sequence data. Genome Research, 2009, vol. 19, no. 6, pp. 1117–1123. doi: 10.1101/gr.089532.108
7. Kelley D.R., Schatz M.C., Salzberg S.L. Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 2010, vol. 11, no. 11, art. R116. doi: 10.1186/gb-2010-11-11-r116
8. Medvedev P., Scott E., Kakaradov B., Pevzner P. Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics, 2011, vol. 27, no. 13, pp. i137–i141. doi: 10.1093/bioinformatics/btr208
9. Butler J., MacCallum I., Kleber M., Shlyakhter I.A., Belmonte M.K., Lander E.S., Nusbaum C., Jaffe D.B. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Research, 2008, vol. 18, no. 5, pp. 810–820. doi: 10.1101/gr.7337908
10. de Bruijn N.G. A combinatorial problem. Koninklijke Nederlandse Akademie v. Wetenschappen, 1946, vol. 49, pp. 758–764.
11. Pevzner P.A., Tang H., Waterman M.S. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the USA, 2001, vol. 98, no. 17, pp. 9748–9753. doi: 10.1073/pnas.171285098
12. Zerbino D.R., Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 2008, vol. 18, no. 5, pp. 821–829. doi: 10.1101/gr.074492.107
13. Riley M., Abe T., Arnaud M.B., Berlyn M.K., Blattner F.R., Chaudhuri R.R., Glasner J.D., Horiuchi T., Keseler I.M., Kosuge T., Mori H., Perna N.T., Plunkett III G., Rudd K.E., Serres M.H., Thomas G.H., Thomson N.R., Wishart D., Wanner B.L. Escherichia coli K-12: a cooperatively developed annotation snapshot-2005. Nucleic Acids Research, 2006, vol. 34, no. 1, pp. 1–9. doi: 10.1093/nar/gkj405
14. Chen M., Yan Y., Zhang W., Lu W., Wang J., Ping S., Lin M. Complete genome sequence of the type strain Pseudomonas stutzeri CGMCC 1.1803. Journal of Bacteriology, 2011, vol. 193, no. 21, pp. 6095. doi: 10.1128/JB.06061-11
15. Chevreux B., Wetter T., Suhai S. Genome sequence assembly using trace signals and additional sequence information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB), 1999, vol. 99, pp. 45–56.
16. Miller J.R., Koren S., Sutton G. Assembly algorithms for next-generation sequencing data. Genomics, 2010, vol. 95, no. 6, pp. 315–327. doi: 10.1016/j.ygeno.2010.03.001