doi: 10.17586/2226-1494-2025-25-1-61-67


Efficient sparse retrieval through embedding-based inverted index construction 

V. Y. Dobrynin, R. K. Abramovich, A. V. Platonov


Read the full article  ';
Article in English

For citation:
Dobrynin V.Yu., Abramovich R.K., Platonov A.V. Efficient sparse retrieval through embedding-based inverted index construction. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2025, vol. 25, no. 1, pp. 61–67 doi: 10.17586/2226-1494-2025-25-1-61-67


Abstract
Modern search engines use a two-stage architecture for efficient and high-quality search over large volumes of data. In the first stage, simple and fast algorithms like BM25 are applied, while in the second stage, more precise but resource- intensive methods methods, such as deep neural networks, are employed. Although this approach yields good results, it is fundamentally limited in quality due to the vocabulary mismatch problem inherent in the simple algorithms of the first stage. To address this issue, we propose an algorithm for constructing an inverted index using vector representations combining the advantages of both stages: the efficiency of the inverted index and the high search quality of vector models. In our work, we suggest creating a vector index that preserves the various semantic meanings of vocabulary tokens. For each token, we identify the documents in which it is used, and then cluster its contextualized embeddings. The centroids of the resulting clusters represent different semantic meanings of the tokens. This process forms an extended vocabulary which is used to build the inverted index. During index construction, similarity scores between each semantic meaning of a token and documents are calculated which are then used in the search process. This approach reduces the number of computations required for similarity estimation in real-time. Searching the inverted index first requires finding keys in the vector index, helping to solve the vocabulary mismatch problem. The operation of the algorithm is demonstrated on a search task within the SciFact dataset. It is shown that the proposed method achieves high search quality with low memory requirements. The proposed algorithm demonstrates high search quality, while maintaining a compact vector index whose size remains constant and depends only on the size of the vocabulary. The main drawback of the algorithm is the need to use a deep neural network to generate vector representations of queries during the search process which slows down this stage. Finding ways to address this issue and accelerate the search process represents a direction for future research.

Keywords: inverted index, vocabulary mismatch problem, neural networks, vector representations, clusterization

References
  1. Khattab O., Zaharia M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Proc. of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 39–48. https://doi.org/10.1145/3397271.3401075
  2. Zamani H., Dehghani M., Croft W.B., Learned-Miller E., Kamps J. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. Proc. of the 27th ACM International Conference on Information and Knowledge Management, 2018, pp. 497–506. https://doi.org/10.1145/3269206.3271800 
  3. Bai Y., Li X., Wang G., Zhang C., Shang L., Xu J., Wang Z., Wang F., Liu Q. SparTerm: Learning term-based sparse representation for fast text retrieval. arXiv, 2020, arXiv:2010.00768. https://doi.org/10.48550/arXiv.2010.00768
  4. Devlin J., Chang M.W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018, arXiv:1810.04805v2. https://doi.org/10.48550/arXiv.1810.04805 
  5. Formal T., Piwowarski B., Clinchant S. SPLADE: Sparse lexical and expansion model for first stage ranking. Proc. of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 2288–2292. https://doi.org/10.1145/3404835.3463098 
  6. Santhanam K., Khattab O., Saad-Falcon J., Potts C., Zaharia M. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. Proc. of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 3715–3734. https://doi.org/10.18653/v1/2022.naacl-main.272 
  7. Johnson J., Douze M., Jégou H. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 2021, vol. 7, no. 3, pp. 535–547. https://doi.org/10.1109/tbdata.2019.2921572 
  8. Jégou H., Douze M., Schmid C. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, vol. 33, no. 1, pp. 117–128. https://doi.org/10.1109/tpami.2010.57
  9. Kong W., Dudek J.M., Li C., Zhang M., Bendersky M. SparseEmbed: Learning sparse lexical representations with contextual embeddings for retrieval. Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023. P. 2399–2403. https://doi.org/10.1145/3539618.3592065
  10. Tiancheng Z., Lu X., Lee K. SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval. Proc. of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 565–575. https://doi.org/10.18653/v1/2021.naacl-main.47
  11. Dobrynin V., Abramovich R., Platonov A. Building a full-text search index using ”Transformer” neural network. Proc. of the 2023 IEEE 17th International Conference on Application of Information and Communication Technologies (AICT), 2023. pp. 1–5. https://doi.org/10.1109/aict59525.2023.10313200
  12. Malkov Y.A., Yashunin D.A. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, vol. 42, no. 4, pp. 824–836. https://doi.org/10.1109/tpami.2018.2889473
  13. Arthur D., Vassilvitskii S. k-means++: the advantages of careful seeding. Proc. of the 18th ACM-SIAM Symposium on Discrete Algorithms Mathematics, 2007, pp. 1027–1035. 
  14. Bajaj P., Campos D., Craswell N., Deng L., Gao J., Liu X., Majumder R., McNamara A., Mitra B., Nguyen T., Rosenberg M., Song X., Stoica A, Tiwary S., Wang T. MS MARCO: a human generated MAchine Reading COmprehension dataset. arXiv, 2016, arXiv:1611.09268. https://doi.org/10.48550/arXiv.1611.09268
  15. Thakur N., Nils R., Rücklé A., Srivastava A., Gurevych I. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv, 2021, arXiv:2104.08663. https://doi.org/10.48550/arXiv.2104.08663


Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Copyright 2001-2025 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.

Яндекс.Метрика