ANALYSIS AND ESTIMATION OF THE TRIE MINIMUM LEVEL IN NON-HASH
DEDUPLICATION SYSTEM

Zhukov Maxim A., Afanasiev Dmitry B.

doi:10.17586/2226-1494-2015-15-3-470-475

2015 , VOLUME 15, NUMBER 3 ( May-June )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2015-15-3-470-475

ANALYSIS AND ESTIMATION OF THE TRIE MINIMUM LEVEL IN NON-HASH DEDUPLICATION SYSTEM

M. A. Zhukov, D. B. Afanasiev

Read the full article

Article in русский

For citation: Zhukov M.A., Afanasiev D.B. Analysis and estimation of the trie minimum level in non-hash deduplication system. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2015, vol.15, no. 3, pp. 470–475.

Abstract

Subject of research. The paper deals with a method of restriction for the trie minimum level in non-hash data deduplication system. Method. The subject matter of the method lies in forcibly completing the trie to a specific minimum level. The proposed method makes it possible to increase performance of the process by reducing the number of collisions at the lower levels ofthe trie. The maximum theoretical performance growth corresponds to the share of collisions in the total number of data read operations from the storage medium. Proposed method application increases the metadata size to the amount of new structures containing one element. Main results. The results of the work have been proved by the data of computational experiment with non-has deduplication on 528 GB data set. The process analysis has shown that 99% of the execution time is taken to head positioning of hard-drives. The reason is a random distribution of the blocks on the storage medium. Application of the method of minimum level restriction for the trie in non-hash data deduplication system on the experimental data set gives the possibility to increase performance maximum by 16% and the increase of metadata size is 49%. The total amount of metadata is 34% less than with hash-based deduplication using the MD5 algorithm, and is 17% less than using Tiger192 algorithm. These results confirm the effectiveness of the proposed method. Practical relevance. The proposed method increases the performance of deduplication process by reducing the number of collisions in the trie construction. The results are of practical importance for professionals involved in the development of non-hash data deduplication methods.

Keywords: non-hash-based deduplication, deduplication performance increase, trie, prefix tree, I/O operations, metadata size, non-hash deduplication experiment.

References

1. Shcherbinin A. Resheniya po deduplikatsii dannykh. Storage News, 2008, no. 2 (35), pp. 2–7.

2. Chernyak L. Prosto o slozhnostyakh deduplikatsii [Easy about difficulties of deduplication]. Otkrytye Sistemy. SUBD, 2013, no. 3, pp. 54–55.

3. Meyer D.T, Bolosky W.J. A study of practical deduplication. ACM Transactions on Storage, 2012, vol. 7, no. 4, art. 14. doi: 10.1145/2078861.2078864

4. Meister D. Advanced Data Deduplication Techniques and their Application, PhD dissertation. Mainz, Johannes Gutenberg University, 2013, 230 p.

5. Bhagwat D., Eshghi K., Long D.D.E., Lillibridge M. Extreme binning: scalable, parallel deduplication for chunk-based file backup. Proc. 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS. London, UK, 2009, pp. 237–245. doi: 10.1109/MASCOT.2009.5366623

6. Kazakov V.G., Fedosin S.A., Plotnikova N.P. Sposob adaptivnoi deduplikatsii s primeneniem mnogourovnevogo indeksa razmeshcheniya kopiruemykh blokov dannykh [Method of adaptive deduplication with multilevel block indexing]. Fundamental'nye Issledovaniya, 2013, no. 8–6, pp. 1322– 1325.

7. Maddodi S., Attigeri G.V., Karunakar A.K. Data deduplication techniques and analysis. Proc. 3rd Int. Conf. on Emerging Trends in Engineering and Technology, ICETET 2010. Goa, India, 2010, pp. 664–668. doi: 10.1109/ICETET.2010.42

8. Orlando K., Bautista M.M., Mejia J.R.M., Langnor R.G. IBM ProtecTIER Implementation and Best Practices Guide. IBM Redbooks, 2014, 578 p.

9. Biplob D., Sudipta S., Jin L. ChunkStash: speeding up inline storage deduplication using flash memory. Proc. 2010 USENIX Annual Technical Conference, 2010, p. 16.

10. Osuna A., Balogh E., Galante de Carvalho R.A., Javier R.F., Mann Z. Implementing IBM Storage Data Deduplication Solutions. IBM Redbooks, 2011, 322 p.

11. Sheu R.-K., Yuan S.-M., Lo W.-T., Ku C.-I. Design and implementation of file deduplication framework on HDFS. International Journal of Distributed Sensor Networks, 2014, vol. 2014, art. 561340. doi: 10.1155/2014/561340

12. Srinivasan K., Bisson T., Goodson G., Voruganti K. iDedup: latency-aware, inline data deduplication for primary storage. Proc. 10th USENIX Conference on File and Storage Technologies, FAST 2012. San Jose, USA, 2012, pp. 299–312.

13. Knuth D.E. The Art of Computer Programming. Vol. 3. Storing and Searching. Addison-Wesley, 1998, 780 p.

14. Aho A., Hopcroft J., Ullman J. Data Structures and Algorithms. Addison-Wesley, 1983, 427 p.

15. Zhukov M.A., Afanasiev D.B. Verifikatsiya blokov dannykh v sisteme beskheshevoi deduplikatsii [Verification of data blocks in the system of non-hesh deduplication]. Sbornik Tezisov Dokladov II Kongressa Molodykh Uchenykh, vyp. 1 [Proc. II Congress of Young Scientists, vol. 1]. St. Petersburg, NRU ITMO Publ., 2014, p. 78.

16. Zhukov M.A., Afanasiev D.B. Khranenie metadannykh blokov v strukture dannykh prefiksnogo dereva [Metadata blocks storage in the data structure of Trie]. Sbornik Trudov Molodykh Uchenykh i Sotrudnikov Kafedry VT. Vypusk 5 [Proc. of Young Scientists and Staff of the Department VT. Issue 5]. St. Petersburg, NIU ITMO, 2014, pp. 12–15.

17. Zhukov M.A., Afanasiev D.B. Poryadok postroeniya prefiksnogo dereva v sisteme beskheshevoi deduplikatsii [The procedure for constructing Trie in non-hash deduplication system]. Trudy XXI Vserossiiskoi NauchnoMetodicheskoi Konferentsii “Telematika'2014” [Proc. of All-Russion Scientific and Methodical Conferences “Telematika’2014"]. St.Petersburg, 2014, p. 89.

18. Zhukov M.A. Nastroika parametrov deduplikatsii [Configuring deduplication parameters]. Sbornik Tezisov Dokladov Kongressa Molodykh

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License