Vector search using method of clustering using ensemble of oblivious trees

Tomilov Nikita A. , Turov Vladimir P. , Babayants Alexander A., Alexey V. Platonov

2025 , VOLUME 25, NUMBER 2 ( march-april )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2025-25-2-339-344

Vector search using method of clustering using ensemble of oblivious trees

N. A. Tomilov, V. P. Turov, A. A. Babayants, A. V. Platonov

Read the full article

Article in English

For citation:

Tomilov N.A., Turov V.P., Babayants A.A., Platonov A.V. Vector search using method of clustering using ensemble of oblivious trees. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2025, vol. 25, no. 2, pp. 339–344. doi: 10.17586/2226-1494-2025-25-2-339-344

Abstract

Information retrieval using machine learning algorithms is based on transforming the original multimodal documents into vector representations. These vectors are then indexed, and the search is performed within this index. A popular method for indexing is vector clustering such as with k-nearest neighbors. We propose a clustering method based on an ensemble of Oblivious Decision Trees and introduce a vector search algorithm built on this method. The proposed clustering method is deterministic and supports parameter serialization for the ensemble. The essence of the method involves training an ensemble of binary or ternary Oblivious Trees. This ensemble is then used to compute a hash for each of the original vectors. Vectors with the same hash are considered to belong to the same cluster. For searching, several clusters are selected whose centroids are closest to the vector representation of the search query followed by a full search of the vector representations within the selected clusters. The proposed method demonstrates search quality comparable to widely used industry-standard vector search libraries, such as Faiss, Annoy, and HNSWlib. For datasets with an angular distance metric, the proposed search method achieves accuracy equal to or better than existing solutions. For datasets with a Euclidean distance metric, the search quality is on par with existing solutions. The developed method can be applied to improve search quality in the development of multimodal search systems. The ability to serialize enables clustering data on one computational node and transferring ensemble parameters to another, allowing the proposed algorithm to be utilized in distributed systems.

Keywords: vector representations, vector search, embeddings, oblivious decision tree

References

1. Grbovic M., Cheng H. Real-time personalization using embeddings for search ranking at Airbnb. Proc. of the 24^th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 311–320. https://doi.org/10.1145/3219819.3219885

2. Berry M.W., Drmac Z., Jessup E.R. Matrices, vector spaces, and information retrieval. SIAM Review, 1999, vol. 41, no. 2, pp. 335–362. https://doi.org/10.1137/S0036144598347035

3. Korn F., Sidiropoulos N., Faloutsos C., Siegel E., Protopapas Z Fast nearest neighbor search in medical image databases. Proc. of the 22^th International Conference on Very Large Data Bases (VLDB '96), 1996, pp. 215–226. https://doi.org/10.5555/645922.673493

4. Seidl T., Kriegel H.-P. Optimal multi-step k-nearest neighbor search. ACM SIGMOD Record, 1998, vol. 27, no. 2, pp. 154–165. https://doi.org/10.1145/276305.276319

5. Lou Y., Obukhov M. BDT: Gradient Boosted Decision Tables for high accuracy and scoring efficiency. Proc. of the 23^rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17), 2017, pp. 1893–1901. https://doi.org/10.1145/3097983.3098175

6. Chakrabarti S. Efficient image retrieval using multi neural hash codes and bloom filters. Proc. of the IEEE International Conference for Innovation in Technology (INOCON), 2020, pp. 1–6. https://doi.org/10.1109/INOCON50539.2020.9298228

7. Rokach L., Maimon O. Data Mining with Decision Trees: Theory and Applications. 2nd ed. World Scientific Publishing, 2014. 328 p.

8. Fumero J., Papadimitriou M., Zakkak F.S., Xekalaki M., Clarkson J., Kotselidis C. Dynamic application reconfiguration on heterogeneous hardware. Proc. of the 15^th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2019, pp. 165–178. https://doi.org/10.1145/3313808.3313819

9. Xiao H., Rasul K., Vollgraf R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv, 2017, arXiv:1708.07747. https://doi.org/10.48550/arXiv.1708.07747

10. Aumüller M., Bernhardsson E., Faithfull A. Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Lecture Notes in Computer Science, 2017, vol. 10609, pp. 34–49. https://doi.org/10.1007/978-3-319-68474-1_3

11. Tomilov N.A., Turov V.P., Babayants A.A., Platonov A.V. A method of storing vector data in compressed form using clustering. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 1, pp. 112–117. (in Russian). https://doi.org/10.17586/2226-1494-2024-24-1-112-117

12. Douze M., Guzhva A., Deng C., Johnson J., Szilvasy G., Mazaré P.-E., Lomeli M., Hosseini L., Jégou H. The Faiss library. arXiv, 2024, arXiv.2401.08281. https://doi.org/10.48550/arXiv.2401.08281

13. Malkov Y., Yashunin D. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, vol. 42, no. 4. pp. 824–836. https://doi.org/10.1109/TPAMI.2018.2889473

14. Tomilov N.A., Turov V.P., Babayants A.A. Developing a tool to compare vector search algorithms. Proc. of the 11^th Congress of Young Scientists, 2022, vol. 1. pp. 446–450. (in Russian).

15. Laaber C., Leitner P. An evaluation of open-source software microbenchmark suites for continuous performance assessment. Proc. of the 15^th International Conference on Mining Software Repositories (MSR '18), 2018, pp. 119–130. https://doi.org/10.1145/3196398.3196407

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License