<div>
	Application of Markov chain Monte Carlo and machine learning for identifying active modules in biological graphs</div>

Dmitrii A. Usoltsev, Molotkov Ivan I., Artomov Mykyta N., Sergushichev Alexey A. , Shalyto Anatoly A.

2024 , VOLUME 24, NUMBER 6 ( november-december )

ISSN 2226-1494 (print), ISSN 2500-0373 (online)

Publications

Editor-in-Chief

Nikiforov
Vladimir O.
D.Sc., Prof.

Partners

doi: 10.17586/2226-1494-2024-24-6-962-971

Application of Markov chain Monte Carlo and machine learning for identifying active modules in biological graphs

D. A. Usoltsev, I. I. Molotkov, M. N. Artomov, A. A. Sergushichev, A. A. Shalyto

Read the full article

Article in Russian

For citation:

Usoltsev D.A., Molotkov I.I., Artomov M.N., Sergushichev A.A., Shalyto A.A. Application of Markov chain Monte Carlo and machine learning for identifying active modules in biological graphs. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 6, pp. 962–971 (in Russian). doi: 10.17586/2226-1494-2024-24-6-962-971

Abstract

In biology, information about interactions between the proteins or genes under study can be represented as a biological graph. A connected subgraph, whose vertices perform a common biological function, is called an active module. The Markov Chain Monte Carlo (MCMC) algorithm is an effective method for identifying active modules in biological graphs. In the context of protein-protein interactions, accurately identifying the active module allows for determining which protein function disruption leads to certain changes (e.g., diseases) in a biological system (cell/organism). This study demonstrates that applying MCMC in combination with models (that take graph topology into account) provides higher accuracy in identifying the active module. This study independently utilizes a protein-protein interaction graph (InWebIM) and the GeneMANIA functional association network for training the model and comparing it with the known MCMC-based method. To search for the active module, a combination of MCMC and a machine learning method, gradient boosting (xgboost), was employed. The combined use of the MCMC-based method and gradient boosting improves the accuracy of active module identification compared to the MCMC-based method alone on simulated data. Improving the accuracy of active module identification is crucial for studying the biological mechanisms of diseases and discovering individual proteins functionally associated with the development of diseases.

Keywords: graphs, machine learning, protein networks, MCMC, active module

References

Huber W., Carey V.J., Long L., Falcon S., Gentleman R. Graphs in molecular biology. BMC Bioinformatics, 2007, vol. 8, suppl. 6, pp. S8. https://doi.org/10.1186/1471-2105-8-S6-S8
Szczepanski A.P., Wang L. Emerging multifaceted roles of BAP1 complexes in biological processes. Cell Death Discovery, 2021, vol. 7, no. 1, pp. 20. https://doi.org/10.1038/s41420-021-00406-2
Carbone M., Yang H., Pass H.I., Krausz T., Testa J.R., Gaudino G. BAP1 and cancer. Nature Reviews Cancer, 2013, vol. 13, no. 3, pp. 153–159. https://doi.org/10.1038/nrc3459
Lin J.S., Lai E.M. Protein-protein interactions: Co-Immunoprecipitation. Methods in Molecular Biology, 2017, vol. 1615, pp. 211–219. https://doi.org/10.1007/978-1-4939-7033-9_17
Tamara S., den Boer M.A., Heck A.J.R. High-resolution native mass spectrometry. Chemical Reviews, 2022, vol. 122, no. 8, pp. 7269–7326. https://doi.org/10.1021/acs.chemrev.1c00212
Okpara M.O., Hermann C., van der Watt P.J., Garnett S., Blackburn J.M., Leaner V.D. A mass spectrometry-based approach for the identification of Kpnβ1 binding partners in cancer cells. Scientific Reports, 2022, vol. 12, no. 1, pp. 20171. https://doi.org/10.1038/s41598-022-24194-6
Li T., Wernersson R., Hansen R.B., Horn H., Mercer J., Slodkowicz G., Workman C.T., Rigina O., Rapacki K., Stærfeldt H.H., Brunak S., Jensen T.S., Lage K. A scored human protein-protein interaction network to catalyze genomic interpretation. Nature Methods, 2017, vol. 14, no. 1, pp. 61–64. https://doi.org/10.1038/nmeth.4083
Zhu Q.M., Hsu Y.H., Lassen F.H., MacDonald B.T., Stead S., Malolepsza E., Kim A., Li T., Mizoguchi T., Schenone M., Guzman G., Tanenbaum B., Fornelos N., Carr S.A., Gupta R.M., Ellinor P.T., Lage K. Protein interaction networks in the vasculature prioritize genes and pathways underlying coronary artery disease. Communications Biology, 2024, vol. 7, no. 1, pp. 87.https://doi.org/10.1038/s42003-023-05705-1
Nehme R., Pietiläinen O., Artomov M., Tegtmeyer M., Valakh V., Lehtonen L., Bell C., Singh T., Trehan A., Sherwood J., Manning D., Peirent E., Malik R., Guss E.J., Hawes D., Beccard A., Bara A.M., Hazelbaker D.Z., Zuccaro E., Genovese G., Loboda A.A., Neumann A., Lilliehook C., Kuismin O., Hamalainen E., Kurki M., Hultman C.M., Kähler A.K., Paulo J.A., Ganna A., Madison J., Cohen B., McPhie D., Adolfsson R., Perlis R., Dolmetsch R., Farhi S., McCarroll S., Hyman S., Neale B., Barrett L.E., Harper W., Palotie A., Daly M., Eggan K. The 22q11.2 region regulates presynaptic gene-products linked to schizophrenia. Nature Communications, 2022, vol. 13, no. 1, pp. 3690. https://doi.org/10.1038/s41467-022-31436-8
Nguyen H., Shrestha S., Tran D., Shafi A., Draghici S., Nguyen T. A Comprehensive survey of tools and software for active subnetwork identification. Frontiers in Genetics, 2019, vol. 10, pp. 155. https://doi.org/10.3389/fgene.2019.00155
Mitra K., Carvunis A.R., Ramesh S.K., Ideker T. Integrative approaches for finding modular structure in biological networks. Nature Reviews Genetics, 2013, vol. 14, no. 10, pp. 719–732. https://doi.org/10.1038/nrg3552
Strauss B.S. Biochemical genetics and molecular biology: The contributions of George Beadle and Edward Tatum. Genetics, 2016, vol. 203, no. 1, pp. 13–20. https://doi.org/10.1534/genetics.116.188995
Montecino-Rodriguez E., Casero D., Fice M., Le J., Dorshkind K. Differential expression of PU.1 and key T lineage transcription factors distinguishes fetal and adult T cell development. Journal of Immunology, 2018, vol. 200, no. 6, pp. 2046–2056. https://doi.org/10.4049/jimmunol.1701336
Suzuki K., Hatzikotoulas K., Southam L., Taylor H.J., Yin X., Lorenz K.M. et al. Genetic drivers of heterogeneity in type 2 diabetes pathophysiology. Nature, 2024, vol. 627, pp. 347–357. https://doi.org/10.1038/s41586-024-07019-6
Kim T.K., Park J.H. More about the basic assumptions of t-test: normality and sample size. Korean Journal of Anesthesiology, 2019, vol. 72, no. 4, pp. 331–335. https://doi.org/10.4097/kja.d.18.00292
Barton S.J., Crozier S.R., Lillycrop K.A., Godfrey K.M., Inskip H.M. Correction of unexpected distributions of P values from analysis of whole genome arrays by rectifying violation of statistical assumptions. BMC Genomics, 2013, no. 14, pp. 161. https://doi.org/10.1186/1471-2164-14-161
Alexeev N., Isomurodov J., Sukhov V., Korotkevich G., Sergushichev A. Markov chain Monte Carlo for active module identification problem. BMC Bioinformatics, 2020, vol. 21, suppl. 6, pp. 261. https://doi.org/10.1186/s12859-020-03572-9
Dittrich M.T., Klau G.W., Rosenwald A., Dandekar T., Müller T. Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics, 2008, vol. 24, no. 13, pp. i223–i231. https://doi.org/10.1093/bioinformatics/btn161
Zhu Z., Zhang F., Hu H., Bakshi A., Robinson M.R., Powell J.E., Montgomery G.W., Goddard M.E., Wray N.R., Visscher P.M., Yang J. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nature Genetics, 2016, vol. 48, no. 5, pp. 481–487. https://doi.org/10.1038/ng.3538
Chen T., Guestrin C. XGBoost: A scalable tree boosting system. Proc. of the 22^nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794. https://doi.org/10.1145/2939672.2939785
Warde-Farley D., Donaldson S.L., Comes O., Zuberi K., Badrawi R., Chao P., Franz M., Grouios C., Kazi F., Lopes C.T., Maitland A., Mostafavi S., Montojo J., Shao Q., Wright G., Bader G.D., Morris Q. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research, 2010, vol. 38, suppl. 2, pp. W214–W220. https://doi.org/10.1093/nar/gkq537

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License