Menu
Publications
2025
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.
Partners
doi: 10.17586/2226-1494-2025-25-2-339-344
Vector search using method of clustering using ensemble of oblivious trees
Read the full article

Article in English
For citation:
Abstract
For citation:
Tomilov N.A., Turov V.P., Babayants A.A., Platonov A.V. Vector search using method of clustering using ensemble of oblivious trees. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2025, vol. 25, no. 2, pp. 339–344. doi: 10.17586/2226-1494-2025-25-2-339-344
Abstract
Information retrieval using machine learning algorithms is based on transforming the original multimodal documents into vector representations. These vectors are then indexed, and the search is performed within this index. A popular method for indexing is vector clustering such as with k-nearest neighbors. We propose a clustering method based on an ensemble of Oblivious Decision Trees and introduce a vector search algorithm built on this method. The proposed clustering method is deterministic and supports parameter serialization for the ensemble. The essence of the method involves training an ensemble of binary or ternary Oblivious Trees. This ensemble is then used to compute a hash for each of the original vectors. Vectors with the same hash are considered to belong to the same cluster. For searching, several clusters are selected whose centroids are closest to the vector representation of the search query followed by a full search of the vector representations within the selected clusters. The proposed method demonstrates search quality comparable to widely used industry-standard vector search libraries, such as Faiss, Annoy, and HNSWlib. For datasets with an angular distance metric, the proposed search method achieves accuracy equal to or better than existing solutions. For datasets with a Euclidean distance metric, the search quality is on par with existing solutions. The developed method can be applied to improve search quality in the development of multimodal search systems. The ability to serialize enables clustering data on one computational node and transferring ensemble parameters to another, allowing the proposed algorithm to be utilized in distributed systems.
Keywords: vector representations, vector search, embeddings, oblivious decision tree
References
References
1. Grbovic M., Cheng H. Real-time personalization using embeddings for search ranking at Airbnb. Proc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 311–320. https://doi.org/10.1145/3219819.3219885
2. Berry M.W., Drmac Z., Jessup E.R. Matrices, vector spaces, and information retrieval. SIAM Review, 1999, vol. 41, no. 2, pp. 335–362. https://doi.org/10.1137/S0036144598347035
3. Korn F., Sidiropoulos N., Faloutsos C., Siegel E., Protopapas Z Fast nearest neighbor search in medical image databases. Proc. of the 22th International Conference on Very Large Data Bases (VLDB '96), 1996, pp. 215–226. https://doi.org/10.5555/645922.673493
4. Seidl T., Kriegel H.-P. Optimal multi-step k-nearest neighbor search. ACM SIGMOD Record, 1998, vol. 27, no. 2, pp. 154–165. https://doi.org/10.1145/276305.276319
5. Lou Y., Obukhov M. BDT: Gradient Boosted Decision Tables for high accuracy and scoring efficiency. Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17), 2017, pp. 1893–1901. https://doi.org/10.1145/3097983.3098175
6. Chakrabarti S. Efficient image retrieval using multi neural hash codes and bloom filters. Proc. of the IEEE International Conference for Innovation in Technology (INOCON), 2020, pp. 1–6. https://doi.org/10.1109/INOCON50539.2020.9298228
7. Rokach L., Maimon O. Data Mining with Decision Trees: Theory and Applications. 2nd ed. World Scientific Publishing, 2014. 328 p.
8. Fumero J., Papadimitriou M., Zakkak F.S., Xekalaki M., Clarkson J., Kotselidis C. Dynamic application reconfiguration on heterogeneous hardware. Proc. of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2019, pp. 165–178. https://doi.org/10.1145/3313808.3313819
9. Xiao H., Rasul K., Vollgraf R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv, 2017, arXiv:1708.07747. https://doi.org/10.48550/arXiv.1708.07747
10. Aumüller M., Bernhardsson E., Faithfull A. Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Lecture Notes in Computer Science, 2017, vol. 10609, pp. 34–49. https://doi.org/10.1007/978-3-319-68474-1_3
11. Tomilov N.A., Turov V.P., Babayants A.A., Platonov A.V. A method of storing vector data in compressed form using clustering. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 1, pp. 112–117. (in Russian). https://doi.org/10.17586/2226-1494-2024-24-1-112-117
12. Douze M., Guzhva A., Deng C., Johnson J., Szilvasy G., Mazaré P.-E., Lomeli M., Hosseini L., Jégou H. The Faiss library. arXiv, 2024, arXiv.2401.08281. https://doi.org/10.48550/arXiv.2401.08281
13. Malkov Y., Yashunin D. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, vol. 42, no. 4. pp. 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
15. Laaber C., Leitner P. An evaluation of open-source software microbenchmark suites for continuous performance assessment. Proc. of the 15th International Conference on Mining Software Repositories (MSR '18), 2018, pp. 119–130. https://doi.org/10.1145/3196398.3196407