doi: 10.17586/2226-1494-2025-25-2-339-344


Vector search using method of clustering using ensemble of oblivious trees 

N. A. Tomilov, V. P. Turov, A. A. Babayants, A. V. Platonov


Read the full article  ';
Article in English

For citation:
Tomilov N.A., Turov V.P., Babayants A.A., Platonov A.V. Vector search using method of clustering using ensemble of oblivious trees. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2025, vol. 25, no. 2, pp. 339–344. doi: 10.17586/2226-1494-2025-25-2-339-344


Abstract
Information retrieval using machine learning algorithms is based on transforming the original multimodal documents into vector representations. These vectors are then indexed, and the search is performed within this index. A popular method for indexing is vector clustering such as with k-nearest neighbors. We propose a clustering method based on an ensemble of Oblivious Decision Trees and introduce a vector search algorithm built on this method. The proposed clustering method is deterministic and supports parameter serialization for the ensemble. The essence of the method involves training an ensemble of binary or ternary Oblivious Trees. This ensemble is then used to compute a hash for each of the original vectors. Vectors with the same hash are considered to belong to the same cluster. For searching, several clusters are selected whose centroids are closest to the vector representation of the search query followed by a full search of the vector representations within the selected clusters. The proposed method demonstrates search quality comparable to widely used industry-standard vector search libraries, such as Faiss, Annoy, and HNSWlib. For datasets with an angular distance metric, the proposed search method achieves accuracy equal to or better than existing solutions. For datasets with a Euclidean distance metric, the search quality is on par with existing solutions. The developed method can be applied to improve search quality in the development of multimodal search systems. The ability to serialize enables clustering data on one computational node and transferring ensemble parameters to another, allowing the proposed algorithm to be utilized in distributed systems.

Keywords: vector representations, vector search, embeddings, oblivious decision tree

References
1. Grbovic M., Cheng H. Real-time personalization using embeddings for search ranking at Airbnb. Proc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 311–320. https://doi.org/10.1145/3219819.3219885
2. Berry M.W., Drmac Z., Jessup E.R. Matrices, vector spaces, and information retrieval. SIAM Review, 1999, vol. 41, no. 2, pp. 335–362. https://doi.org/10.1137/S0036144598347035
3. Korn F., Sidiropoulos N., Faloutsos C., Siegel E., Protopapas Z Fast nearest neighbor search in medical image databases. Proc. of the 22th International Conference on Very Large Data Bases (VLDB '96), 1996, pp. 215–226. https://doi.org/10.5555/645922.673493
4. Seidl T., Kriegel H.-P. Optimal multi-step k-nearest neighbor search. ACM SIGMOD Record, 1998, vol. 27, no. 2, pp. 154–165. https://doi.org/10.1145/276305.276319
5. Lou Y., Obukhov M. BDT: Gradient Boosted Decision Tables for high accuracy and scoring efficiency. Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17), 2017, pp. 1893–1901. https://doi.org/10.1145/3097983.3098175
6. Chakrabarti S. Efficient image retrieval using multi neural hash codes and bloom filters. Proc. of the IEEE International Conference for Innovation in Technology (INOCON), 2020, pp. 1–6. https://doi.org/10.1109/INOCON50539.2020.9298228
7. Rokach L., Maimon O. Data Mining with Decision Trees: Theory and Applications. 2nd ed. World Scientific Publishing, 2014. 328 p.
8. Fumero J., Papadimitriou M., Zakkak F.S., Xekalaki M., Clarkson J., Kotselidis C. Dynamic application reconfiguration on heterogeneous hardware. Proc. of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2019, pp. 165–178. https://doi.org/10.1145/3313808.3313819
9. Xiao H., Rasul K., Vollgraf R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv, 2017, arXiv:1708.07747. https://doi.org/10.48550/arXiv.1708.07747
10. Aumüller M., Bernhardsson E., Faithfull A. Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Lecture Notes in Computer Science, 2017, vol. 10609, pp. 34–49. https://doi.org/10.1007/978-3-319-68474-1_3
11. Tomilov N.A., Turov V.P., Babayants A.A., Platonov A.V. A method of storing vector data in compressed form using clustering. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 1, pp. 112–117. (in Russian). https://doi.org/10.17586/2226-1494-2024-24-1-112-117
12. Douze M., Guzhva A., Deng C., Johnson J., Szilvasy G., Mazaré P.-E., Lomeli M., Hosseini L., Jégou H. The Faiss library. arXiv, 2024, arXiv.2401.08281. https://doi.org/10.48550/arXiv.2401.08281
13. Malkov Y., Yashunin D. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, vol. 42, no. 4. pp. 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
14. Tomilov N.A., Turov V.P., Babayants A.A. Developing a tool to compare vector search algorithms. Proc. of the 11th Congress of Young Scientists, 2022, vol. 1. pp. 446–450. (in Russian).
15. Laaber C., Leitner P. An evaluation of open-source software microbenchmark suites for continuous performance assessment. Proc. of the 15th International Conference on Mining Software Repositories (MSR '18), 2018, pp. 119–130. https://doi.org/10.1145/3196398.3196407
 
 


Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Copyright 2001-2025 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.

Яндекс.Метрика