Efficient sparse retrieval through embedding-based inverted index construction

Dobrynin Viacheslav Yu. , Abramovich Roman K. , Alexey V. Platonov

2025 , VOLUME 25, NUMBER 1 ( january-february )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2025-25-1-61-67

Efficient sparse retrieval through embedding-based inverted index construction

V. Y. Dobrynin, R. K. Abramovich, A. V. Platonov

Read the full article

Article in English

For citation:

Dobrynin V.Yu., Abramovich R.K., Platonov A.V. Efficient sparse retrieval through embedding-based inverted index construction. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2025, vol. 25, no. 1, pp. 61–67 doi: 10.17586/2226-1494-2025-25-1-61-67

Abstract

Modern search engines use a two-stage architecture for efficient and high-quality search over large volumes of data. In the first stage, simple and fast algorithms like BM25 are applied, while in the second stage, more precise but resource- intensive methods methods, such as deep neural networks, are employed. Although this approach yields good results, it is fundamentally limited in quality due to the vocabulary mismatch problem inherent in the simple algorithms of the first stage. To address this issue, we propose an algorithm for constructing an inverted index using vector representations combining the advantages of both stages: the efficiency of the inverted index and the high search quality of vector models. In our work, we suggest creating a vector index that preserves the various semantic meanings of vocabulary tokens. For each token, we identify the documents in which it is used, and then cluster its contextualized embeddings. The centroids of the resulting clusters represent different semantic meanings of the tokens. This process forms an extended vocabulary which is used to build the inverted index. During index construction, similarity scores between each semantic meaning of a token and documents are calculated which are then used in the search process. This approach reduces the number of computations required for similarity estimation in real-time. Searching the inverted index first requires finding keys in the vector index, helping to solve the vocabulary mismatch problem. The operation of the algorithm is demonstrated on a search task within the SciFact dataset. It is shown that the proposed method achieves high search quality with low memory requirements. The proposed algorithm demonstrates high search quality, while maintaining a compact vector index whose size remains constant and depends only on the size of the vocabulary. The main drawback of the algorithm is the need to use a deep neural network to generate vector representations of queries during the search process which slows down this stage. Finding ways to address this issue and accelerate the search process represents a direction for future research.

Keywords: inverted index, vocabulary mismatch problem, neural networks, vector representations, clusterization

References

Khattab O., Zaharia M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Proc. of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 39–48. https://doi.org/10.1145/3397271.3401075
Zamani H., Dehghani M., Croft W.B., Learned-Miller E., Kamps J. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. Proc. of the 27th ACM International Conference on Information and Knowledge Management, 2018, pp. 497–506. https://doi.org/10.1145/3269206.3271800
Bai Y., Li X., Wang G., Zhang C., Shang L., Xu J., Wang Z., Wang F., Liu Q. SparTerm: Learning term-based sparse representation for fast text retrieval. arXiv, 2020, arXiv:2010.00768. https://doi.org/10.48550/arXiv.2010.00768
Devlin J., Chang M.W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018, arXiv:1810.04805v2. https://doi.org/10.48550/arXiv.1810.04805
Formal T., Piwowarski B., Clinchant S. SPLADE: Sparse lexical and expansion model for first stage ranking. Proc. of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 2288–2292. https://doi.org/10.1145/3404835.3463098
Santhanam K., Khattab O., Saad-Falcon J., Potts C., Zaharia M. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. Proc. of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 3715–3734. https://doi.org/10.18653/v1/2022.naacl-main.272
Johnson J., Douze M., Jégou H. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 2021, vol. 7, no. 3, pp. 535–547. https://doi.org/10.1109/tbdata.2019.2921572
Jégou H., Douze M., Schmid C. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, vol. 33, no. 1, pp. 117–128. https://doi.org/10.1109/tpami.2010.57
Kong W., Dudek J.M., Li C., Zhang M., Bendersky M. SparseEmbed: Learning sparse lexical representations with contextual embeddings for retrieval. Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023. P. 2399–2403. https://doi.org/10.1145/3539618.3592065
Tiancheng Z., Lu X., Lee K. SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval. Proc. of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 565–575. https://doi.org/10.18653/v1/2021.naacl-main.47
Dobrynin V., Abramovich R., Platonov A. Building a full-text search index using ”Transformer” neural network. Proc. of the 2023 IEEE 17th International Conference on Application of Information and Communication Technologies (AICT), 2023. pp. 1–5. https://doi.org/10.1109/aict59525.2023.10313200
Malkov Y.A., Yashunin D.A. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, vol. 42, no. 4, pp. 824–836. https://doi.org/10.1109/tpami.2018.2889473
Arthur D., Vassilvitskii S. k-means++: the advantages of careful seeding. Proc. of the 18th ACM-SIAM Symposium on Discrete Algorithms Mathematics, 2007, pp. 1027–1035.
Bajaj P., Campos D., Craswell N., Deng L., Gao J., Liu X., Majumder R., McNamara A., Mitra B., Nguyen T., Rosenberg M., Song X., Stoica A, Tiwary S., Wang T. MS MARCO: a human generated MAchine Reading COmprehension dataset. arXiv, 2016, arXiv:1611.09268. https://doi.org/10.48550/arXiv.1611.09268
Thakur N., Nils R., Rücklé A., Srivastava A., Gurevych I. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv, 2021, arXiv:2104.08663. https://doi.org/10.48550/arXiv.2104.08663

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License