Menu
Publications
2025
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.
Partners
doi: 10.17586/2226-1494-2025-25-1-61-67
Efficient sparse retrieval through embedding-based inverted index construction
Read the full article

Article in English
For citation:
Abstract
For citation:
Dobrynin V.Yu., Abramovich R.K., Platonov A.V. Efficient sparse retrieval through embedding-based inverted index construction. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2025, vol. 25, no. 1, pp. 61–67 doi: 10.17586/2226-1494-2025-25-1-61-67
Abstract
Modern search engines use a two-stage architecture for efficient and high-quality search over large volumes of data. In the first stage, simple and fast algorithms like BM25 are applied, while in the second stage, more precise but resource- intensive methods methods, such as deep neural networks, are employed. Although this approach yields good results, it is fundamentally limited in quality due to the vocabulary mismatch problem inherent in the simple algorithms of the first stage. To address this issue, we propose an algorithm for constructing an inverted index using vector representations combining the advantages of both stages: the efficiency of the inverted index and the high search quality of vector models. In our work, we suggest creating a vector index that preserves the various semantic meanings of vocabulary tokens. For each token, we identify the documents in which it is used, and then cluster its contextualized embeddings. The centroids of the resulting clusters represent different semantic meanings of the tokens. This process forms an extended vocabulary which is used to build the inverted index. During index construction, similarity scores between each semantic meaning of a token and documents are calculated which are then used in the search process. This approach reduces the number of computations required for similarity estimation in real-time. Searching the inverted index first requires finding keys in the vector index, helping to solve the vocabulary mismatch problem. The operation of the algorithm is demonstrated on a search task within the SciFact dataset. It is shown that the proposed method achieves high search quality with low memory requirements. The proposed algorithm demonstrates high search quality, while maintaining a compact vector index whose size remains constant and depends only on the size of the vocabulary. The main drawback of the algorithm is the need to use a deep neural network to generate vector representations of queries during the search process which slows down this stage. Finding ways to address this issue and accelerate the search process represents a direction for future research.
Keywords: inverted index, vocabulary mismatch problem, neural networks, vector representations, clusterization
References
References
- Khattab O., Zaharia M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Proc. of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 39–48. https://doi.org/10.1145/3397271.3401075
- Zamani H., Dehghani M., Croft W.B., Learned-Miller E., Kamps J. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. Proc. of the 27th ACM International Conference on Information and Knowledge Management, 2018, pp. 497–506. https://doi.org/10.1145/3269206.3271800
- Bai Y., Li X., Wang G., Zhang C., Shang L., Xu J., Wang Z., Wang F., Liu Q. SparTerm: Learning term-based sparse representation for fast text retrieval. arXiv, 2020, arXiv:2010.00768. https://doi.org/10.48550/arXiv.2010.00768
- Devlin J., Chang M.W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018, arXiv:1810.04805v2. https://doi.org/10.48550/arXiv.1810.04805
- Formal T., Piwowarski B., Clinchant S. SPLADE: Sparse lexical and expansion model for first stage ranking. Proc. of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 2288–2292. https://doi.org/10.1145/3404835.3463098
- Santhanam K., Khattab O., Saad-Falcon J., Potts C., Zaharia M. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. Proc. of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 3715–3734. https://doi.org/10.18653/v1/2022.naacl-main.272
- Johnson J., Douze M., Jégou H. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 2021, vol. 7, no. 3, pp. 535–547. https://doi.org/10.1109/tbdata.2019.2921572
- Jégou H., Douze M., Schmid C. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, vol. 33, no. 1, pp. 117–128. https://doi.org/10.1109/tpami.2010.57
- Kong W., Dudek J.M., Li C., Zhang M., Bendersky M. SparseEmbed: Learning sparse lexical representations with contextual embeddings for retrieval. Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023. P. 2399–2403. https://doi.org/10.1145/3539618.3592065
- Tiancheng Z., Lu X., Lee K. SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval. Proc. of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 565–575. https://doi.org/10.18653/v1/2021.naacl-main.47
- Dobrynin V., Abramovich R., Platonov A. Building a full-text search index using ”Transformer” neural network. Proc. of the 2023 IEEE 17th International Conference on Application of Information and Communication Technologies (AICT), 2023. pp. 1–5. https://doi.org/10.1109/aict59525.2023.10313200
- Malkov Y.A., Yashunin D.A. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, vol. 42, no. 4, pp. 824–836. https://doi.org/10.1109/tpami.2018.2889473
- Arthur D., Vassilvitskii S. k-means++: the advantages of careful seeding. Proc. of the 18th ACM-SIAM Symposium on Discrete Algorithms Mathematics, 2007, pp. 1027–1035.
- Bajaj P., Campos D., Craswell N., Deng L., Gao J., Liu X., Majumder R., McNamara A., Mitra B., Nguyen T., Rosenberg M., Song X., Stoica A, Tiwary S., Wang T. MS MARCO: a human generated MAchine Reading COmprehension dataset. arXiv, 2016, arXiv:1611.09268. https://doi.org/10.48550/arXiv.1611.09268
- Thakur N., Nils R., Rücklé A., Srivastava A., Gurevych I. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv, 2021, arXiv:2104.08663. https://doi.org/10.48550/arXiv.2104.08663