Menu
Publications
2025
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.
Partners
doi: 10.17586/2226-1494-2024-24-6-962-971
Application of Markov chain Monte Carlo and machine learning for identifying active modules in biological graphs
Read the full article

Article in Russian
For citation:
Abstract
For citation:
Usoltsev D.A., Molotkov I.I., Artomov M.N., Sergushichev A.A., Shalyto A.A. Application of Markov chain Monte Carlo and machine learning for identifying active modules in biological graphs. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 6, pp. 962–971 (in Russian). doi: 10.17586/2226-1494-2024-24-6-962-971
Abstract
In biology, information about interactions between the proteins or genes under study can be represented as a biological graph. A connected subgraph, whose vertices perform a common biological function, is called an active module. The Markov Chain Monte Carlo (MCMC) algorithm is an effective method for identifying active modules in biological graphs. In the context of protein-protein interactions, accurately identifying the active module allows for determining which protein function disruption leads to certain changes (e.g., diseases) in a biological system (cell/organism). This study demonstrates that applying MCMC in combination with models (that take graph topology into account) provides higher accuracy in identifying the active module. This study independently utilizes a protein-protein interaction graph (InWebIM) and the GeneMANIA functional association network for training the model and comparing it with the known MCMC-based method. To search for the active module, a combination of MCMC and a machine learning method, gradient boosting (xgboost), was employed. The combined use of the MCMC-based method and gradient boosting improves the accuracy of active module identification compared to the MCMC-based method alone on simulated data. Improving the accuracy of active module identification is crucial for studying the biological mechanisms of diseases and discovering individual proteins functionally associated with the development of diseases.
Keywords: graphs, machine learning, protein networks, MCMC, active module
References
References
- Huber W., Carey V.J., Long L., Falcon S., Gentleman R. Graphs in molecular biology. BMC Bioinformatics, 2007, vol. 8, suppl. 6, pp. S8. https://doi.org/10.1186/1471-2105-8-S6-S8
- Szczepanski A.P., Wang L. Emerging multifaceted roles of BAP1 complexes in biological processes. Cell Death Discovery, 2021, vol. 7, no. 1, pp. 20. https://doi.org/10.1038/s41420-021-00406-2
- Carbone M., Yang H., Pass H.I., Krausz T., Testa J.R., Gaudino G. BAP1 and cancer. Nature Reviews Cancer, 2013, vol. 13, no. 3, pp. 153–159. https://doi.org/10.1038/nrc3459
- Lin J.S., Lai E.M. Protein-protein interactions: Co-Immunoprecipitation. Methods in Molecular Biology, 2017, vol. 1615, pp. 211–219. https://doi.org/10.1007/978-1-4939-7033-9_17
- Tamara S., den Boer M.A., Heck A.J.R. High-resolution native mass spectrometry. Chemical Reviews, 2022, vol. 122, no. 8, pp. 7269–7326. https://doi.org/10.1021/acs.chemrev.1c00212
- Okpara M.O., Hermann C., van der Watt P.J., Garnett S., Blackburn J.M., Leaner V.D. A mass spectrometry-based approach for the identification of Kpnβ1 binding partners in cancer cells. Scientific Reports, 2022, vol. 12, no. 1, pp. 20171. https://doi.org/10.1038/s41598-022-24194-6
- Li T., Wernersson R., Hansen R.B., Horn H., Mercer J., Slodkowicz G., Workman C.T., Rigina O., Rapacki K., Stærfeldt H.H., Brunak S., Jensen T.S., Lage K. A scored human protein-protein interaction network to catalyze genomic interpretation. Nature Methods, 2017, vol. 14, no. 1, pp. 61–64. https://doi.org/10.1038/nmeth.4083
- Zhu Q.M., Hsu Y.H., Lassen F.H., MacDonald B.T., Stead S., Malolepsza E., Kim A., Li T., Mizoguchi T., Schenone M., Guzman G., Tanenbaum B., Fornelos N., Carr S.A., Gupta R.M., Ellinor P.T., Lage K. Protein interaction networks in the vasculature prioritize genes and pathways underlying coronary artery disease. Communications Biology, 2024, vol. 7, no. 1, pp. 87.https://doi.org/10.1038/s42003-023-05705-1
- Nehme R., Pietiläinen O., Artomov M., Tegtmeyer M., Valakh V., Lehtonen L., Bell C., Singh T., Trehan A., Sherwood J., Manning D., Peirent E., Malik R., Guss E.J., Hawes D., Beccard A., Bara A.M., Hazelbaker D.Z., Zuccaro E., Genovese G., Loboda A.A., Neumann A., Lilliehook C., Kuismin O., Hamalainen E., Kurki M., Hultman C.M., Kähler A.K., Paulo J.A., Ganna A., Madison J., Cohen B., McPhie D., Adolfsson R., Perlis R., Dolmetsch R., Farhi S., McCarroll S., Hyman S., Neale B., Barrett L.E., Harper W., Palotie A., Daly M., Eggan K. The 22q11.2 region regulates presynaptic gene-products linked to schizophrenia. Nature Communications, 2022, vol. 13, no. 1, pp. 3690. https://doi.org/10.1038/s41467-022-31436-8
- Nguyen H., Shrestha S., Tran D., Shafi A., Draghici S., Nguyen T. A Comprehensive survey of tools and software for active subnetwork identification. Frontiers in Genetics, 2019, vol. 10, pp. 155. https://doi.org/10.3389/fgene.2019.00155
- Mitra K., Carvunis A.R., Ramesh S.K., Ideker T. Integrative approaches for finding modular structure in biological networks. Nature Reviews Genetics, 2013, vol. 14, no. 10, pp. 719–732. https://doi.org/10.1038/nrg3552
- Strauss B.S. Biochemical genetics and molecular biology: The contributions of George Beadle and Edward Tatum. Genetics, 2016, vol. 203, no. 1, pp. 13–20. https://doi.org/10.1534/genetics.116.188995
- Montecino-Rodriguez E., Casero D., Fice M., Le J., Dorshkind K. Differential expression of PU.1 and key T lineage transcription factors distinguishes fetal and adult T cell development. Journal of Immunology, 2018, vol. 200, no. 6, pp. 2046–2056. https://doi.org/10.4049/jimmunol.1701336
- Suzuki K., Hatzikotoulas K., Southam L., Taylor H.J., Yin X., Lorenz K.M. et al. Genetic drivers of heterogeneity in type 2 diabetes pathophysiology. Nature, 2024, vol. 627, pp. 347–357. https://doi.org/10.1038/s41586-024-07019-6
- Kim T.K., Park J.H. More about the basic assumptions of t-test: normality and sample size. Korean Journal of Anesthesiology, 2019, vol. 72, no. 4, pp. 331–335. https://doi.org/10.4097/kja.d.18.00292
- Barton S.J., Crozier S.R., Lillycrop K.A., Godfrey K.M., Inskip H.M. Correction of unexpected distributions of P values from analysis of whole genome arrays by rectifying violation of statistical assumptions. BMC Genomics, 2013, no. 14, pp. 161. https://doi.org/10.1186/1471-2164-14-161
- Alexeev N., Isomurodov J., Sukhov V., Korotkevich G., Sergushichev A. Markov chain Monte Carlo for active module identification problem. BMC Bioinformatics, 2020, vol. 21, suppl. 6, pp. 261. https://doi.org/10.1186/s12859-020-03572-9
- Dittrich M.T., Klau G.W., Rosenwald A., Dandekar T., Müller T. Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics, 2008, vol. 24, no. 13, pp. i223–i231. https://doi.org/10.1093/bioinformatics/btn161
- Zhu Z., Zhang F., Hu H., Bakshi A., Robinson M.R., Powell J.E., Montgomery G.W., Goddard M.E., Wray N.R., Visscher P.M., Yang J. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nature Genetics, 2016, vol. 48, no. 5, pp. 481–487. https://doi.org/10.1038/ng.3538
- Chen T., Guestrin C. XGBoost: A scalable tree boosting system. Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794. https://doi.org/10.1145/2939672.2939785
- Warde-Farley D., Donaldson S.L., Comes O., Zuberi K., Badrawi R., Chao P., Franz M., Grouios C., Kazi F., Lopes C.T., Maitland A., Mostafavi S., Montojo J., Shao Q., Wright G., Bader G.D., Morris Q. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research, 2010, vol. 38, suppl. 2, pp. W214–W220. https://doi.org/10.1093/nar/gkq537