Charron-Lamoureux, V., Mannochio-Russo, H., Lamichhane, S., Xing, S., Patan, A., Portal Gomes, P. W., Rajkumar, P., Deleray, V., Caraballo-Rodríguez, A. M., Chua, K. V., Lee, L. S., Liu, Z., Ching, J., Wang, M., & Dorrestein, P. C. (2025). A guide to reverse metabolomics—a framework for big data discovery strategy. Nature Protocols. https://doi.org/10.1038/s41596-024-01136-2
@article{CharronLamoureux2025,
author = {Charron-Lamoureux, Vincent and Mannochio-Russo, Helena and Lamichhane, Santosh and Xing, Shipei and Patan, Abubaker and Portal Gomes, Paulo Wender and Rajkumar, Prajit and Deleray, Victoria and Caraballo-Rodr{\'i}guez, Andr{\'e}s Mauricio and Chua, Kee Voon and Lee, Lye Siang and Liu, Zhao and Ching, Jianhong and Wang, Mingxun and Dorrestein, Pieter C.},
title = {A guide to reverse metabolomics---a framework for big data discovery strategy},
journal = {Nature Protocols},
year = {2025},
month = feb,
day = {28},
issn = {1750-2799},
doi = {10.1038/s41596-024-01136-2},
url = {https://doi.org/10.1038/s41596-024-01136-2}
}
Untargeted metabolomics is evolving into a field of big data science. There is a growing interest within the metabolomics community in mining tandem mass spectrometry (MS/MS)-based data from public repositories. In traditional untargeted metabolomics, samples to address a predefined question are collected and liquid chromatography with MS/MS data are generated. We then identify metabolites associated with a phenotype (for example, disease versus healthy) and elucidate or validate their structural details (for example, molecular formula, structural classification, substructure or complete structural annotation or identification). In reverse metabolomics, we start with MS/MS spectra for known or unknown molecules. These spectra are used as search terms to search public data repositories to discover phenotype-relevant information such as organ/biofluid distribution, disease condition, intervention status (for example, pre- and postintervention), organisms (for example, mammals versus others), geography and any other biologically relevant associations. Here we guide the reader through a four-part process: (1) obtaining the MS/MS spectra of interest (Universal Spectrum Identifier) and (2) Mass Spectrometry Search Tool searches to find the files associated with the MS/MS that are in available databases, (3) using the Reanalysis Data User Interface framework to link the files with their metadata and (4) validating the observations. Parts 1–3 could take from hours to days depending on the method used for collecting MS/MS spectra. For example, we use MS/MS spectra from three small molecules: phenylalanine-cholic acid (a microbially conjugated bile acid), phenylalanine-C4:0 and histidine-C4:0 (two N-acyl amides). We leverage the Global Natural Products Social Molecular Networking-based framework to explore the microbial producers of these molecules and their associations with health conditions and organ distributions in humans and rodents.
Krutkin, D. D., Thomas, S., Zuffa, S., Rajkumar, P., Knight, R., Dorrestein, P. C., & Kelley, S. T. (2025). To Impute or Not To Impute in Untargeted Metabolomics─That is the Compositional Question. Journal of the American Society for Mass Spectrometry. https://doi.org/10.1021/jasms.4c00434
@article{Krutkin2025,
author = {Krutkin, Dennis D. and Thomas, Sydney and Zuffa, Simone and Rajkumar, Prajit and Knight, Rob and Dorrestein, Pieter C. and Kelley, Scott T.},
title = {To Impute or Not To Impute in Untargeted Metabolomics─That is the Compositional Question},
journal = {Journal of the American Society for Mass Spectrometry},
year = {2025},
month = feb,
day = {25},
publisher = {American Society for Mass Spectrometry. Published by the American Chemical Society. All rights reserved.},
issn = {1044-0305},
doi = {10.1021/jasms.4c00434},
url = {https://doi.org/10.1021/jasms.4c00434}
}
Untargeted metabolomics often produce large datasets with missing values. These missing values are derived from biological or technical factors and can undermine statistical analyses and lead to biased biological interpretations. Imputation methods, such as k-Nearest Neighbors (kNN) and Random Forest (RF) regression, are commonly used, but their effects vary depending on the type of missing data, e.g., Missing Completely At Random (MCAR) and Missing Not At Random (MNAR). Here, we determined the impacts of degree and type of missing data on the accuracy of kNN and RF imputation using two datasets: a targeted metabolomic dataset with spiked-in standards and an untargeted metabolomic dataset. We also assessed the effect of compositional data approaches (CoDA), such as the centered log-ratio (CLR) transform, on data interpretation since these methods are increasingly being used in metabolomics. Overall, we found that kNN and RF performed more accurately when the proportion of missing data across samples for a metabolic feature was low. However, these imputations could not handle MNAR data and generated wildly inflated or imputed values where none should exist. Furthermore, we show that the proportion of missing values had a strong impact on the accuracy of imputation, which affected the interpretation of the results. Our results suggest imputation should be used with extreme caution even with modest levels of missing data and especially when the type of missingness is unknown.
preprints
Zhao, H. N., Kvitne, K. E., Brungs, C., Mohan, S., Charron-Lamoureux, V., Bittremieux, W., Tang, R., Schmid, R., Lamichhane, S., El Abiead, Y., Andalibi, M. S., Mannochio-Russo, H., Ambre, M., Avalon, N. E., Bryant, M. K., Caraballo-Rodrı́guez Andrés Mauricio, Maya, M. C., Chin, L., Ellis, R. J., … Dorrestein, P. C. (2024). Empirically establishing drug exposure records directly from untargeted metabolomics data. BioRxiv. https://www.biorxiv.org/content/early/2024/10/26/2024.10.07.617109
@article{Zhao2024,
author = {Zhao, Haoqi Nina and Kvitne, Kine Eide and Brungs, Corinna and Mohan, Siddharth and Charron-Lamoureux, Vincent and Bittremieux, Wout and Tang, Runbang and Schmid, Robin and Lamichhane, Santosh and El Abiead, Yasin and Andalibi, Mohammadsobhan S. and Mannochio-Russo, Helena and Ambre, Madison and Avalon, Nicole E. and Bryant, MacKenzie and Caraballo-Rodr{\'\i}guez, Andr{\'e}s Mauricio and Maya, Martin Casas and Chin, Loryn and Ellis, Ronald J. and Franklin, Donald and Girod, Sagan and Gomes, Paulo Wender P and Hansen, Lauren and Heaton, Robert and Iudicello, Jennifer E. and Jarmusch, Alan K. and Khatib, Lora and Letendre, Scott and Magyari, Sarolt and McDonald, Daniel and Mohanty, Ipsita and Cumsille, Andr{\'e}s and Moore, David J. and Rajkumar, Prajit and Ross, Dylan H. and Sapre, Harshada and Shahneh, Mohammad Reza Zare and Thomas, Sydney P. and Tribelhorn, Caitlin and Tubb, Helena M. and Walker, Corinn and Wang, Crystal X. and Xing, Shipei and Zemlin, Jasmine and Zuffa, Simone and Wishart, David S. and Kaddurah-Daouk, Rima and Wang, Mingxun and Raffatellu, Manuela and Zengler, Karsten and Pluskal, Tom{\'a}{\v s} and Xu, Libin and Knight, Rob and Tsunoda, Shirley M. and Dorrestein, Pieter C.},
title = {Empirically establishing drug exposure records directly from untargeted metabolomics data},
elocation-id = {2024.10.07.617109},
year = {2024},
doi = {10.1101/2024.10.07.617109},
publisher = {Cold Spring Harbor Laboratory},
url = {https://www.biorxiv.org/content/early/2024/10/26/2024.10.07.617109},
eprint = {https://www.biorxiv.org/content/early/2024/10/26/2024.10.07.617109.full.pdf},
journal = {bioRxiv}
}
Despite extensive efforts, extracting information on medication exposure from clinical records remains challenging. To complement this approach, we developed the tandem mass spectrometry (MS/MS) based GNPS Drug Library. This resource integrates MS/MS data for drugs and their metabolites/analogs with controlled vocabularies on exposure sources, pharmacologic classes, therapeutic indications, and mechanisms of action. It enables direct analysis of drug exposure and metabolism from untargeted metabolomics data independent of clinical records. Our library facilitates stratification of individuals in clinical studies based on the empirically detected medications, exemplified by drug-dependent microbiota-derived N-acyl lipid changes in a cohort with human immunodeficiency virus. The GNPS Drug Library holds potential for broader applications in drug discovery and precision medicine.Competing Interest StatementR.S.: R.S. is a co-founder of mzio GmbH. D.M.: D.M. is a consultant for BiomeSense, Inc., has equity and receives income. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. R.K.-D.: R.K.-D. is an inventor on a series of patents on use of metabolomics for the diagnosis and treatment of CNS diseases and holds equity in Metabolon Inc., Chymia LLC and PsyProtix. M.W.: M.W. is a co-founder of Ometa Labs LLC T.P.: T.P. is a co-founder of mzio GmbH. R.K.: R.K. is a scientific advisory board member, and consultant for BiomeSense, Inc., has equity and receives income. He is a scientific advisory board member and has equity in GenCirq. He is a consultant for DayTwo, and receives income. He has equity in and acts as a consultant for Cybele. He is a co-founder of Biota, Inc., and has equity. He is a cofounder of Micronoma, and has equity and is a scientific advisory board member. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. S.M.T.: S.M.T. receives research funding from Veloxis Pharmaceuticals. P.C.D.: P.C.D. is a scientific advisor and holds equity in Cybele, and bileOmix, and is a Scientific Co-founder, and advisor and holds equity in Ometa, Arome, and Enveda with prior approval by UC-San Diego.
Mannochio-Russo, H., Charron-Lamoureux, V., van Faassen, M., Lamichhane, S., Gonçalves Nunes, W. D., Deleray, V., Patan, A., Vittali, K., Rajkumar, P., Abiead, Y. E., Zhao, H. N., Portal Gomes, P. W., Mohanty, I., Lee, C., Sund, A., Sharma, M., Liu, Y., Pattynama, D., Walker, G. T., … Dorrestein, P. C. (2024). The microbiome diversifies N-acyl lipid pools - including short-chain fatty acid-derived compounds. BioRxiv. https://www.biorxiv.org/content/early/2024/11/03/2024.10.31.621412
@article{MannochioRusso2024,
author = {Mannochio-Russo, Helena and Charron-Lamoureux, Vincent and van Faassen, Martijn and Lamichhane, Santosh and Gon{\c c}alves Nunes, Wilhan D. and Deleray, Victoria and Patan, Abubaker and Vittali, Kyle and Rajkumar, Prajit and Abiead, Yasin El and Zhao, Haoqi Nina and Portal Gomes, Paulo Wender and Mohanty, Ipsita and Lee, Carlynda and Sund, Aidan and Sharma, Meera and Liu, Yuanhao and Pattynama, David and Walker, Gregory T. and Norton, Grant J. and Khatib, Lora and Andalibi, Mohammadsobhan S. and Wang, Crystal X. and Ellis, Ronald J. and Moore, David J. and Iudicello, Jennifer E. and Franklin, Donald and Letendre, Scott and Chin, Loryn and Walker, Corinn and Renwick, Simone and Zemlin, Jasmine and Meehan, Michael J. and Song, Xinyang and Kasper, Dennis and Burcham, Zachary and Kim, Jane J. and Kadakia, Sejal and Raffatellu, Manuela and Bode, Lars and Zengler, Karsten and Wang, Mingxun and Siegel, Dionicio and Knight, Rob and Dorrestein, Pieter C.},
title = {The microbiome diversifies N-acyl lipid pools - including short-chain fatty acid-derived compounds},
elocation-id = {2024.10.31.621412},
year = {2024},
doi = {10.1101/2024.10.31.621412},
publisher = {Cold Spring Harbor Laboratory},
url = {https://www.biorxiv.org/content/early/2024/11/03/2024.10.31.621412},
eprint = {https://www.biorxiv.org/content/early/2024/11/03/2024.10.31.621412.full.pdf},
journal = {bioRxiv}
}
N-acyl lipids are important mediators of several biological processes including immune function and stress response. To enhance the detection of N-acyl lipids with untargeted mass spectrometry-based metabolomics, we created a reference spectral library retrieving N-acyl lipid patterns from 2,700 public datasets, identifying 851 N-acyl lipids that were detected 356,542 times. 777 are not documented in lipid structural databases, with 18% of these derived from short-chain fatty acids and found in the digestive tract and other organs. Their levels varied with diet, microbial colonization, and in people living with diabetes. We used the library to link microbial N-acyl lipids, including histamine and polyamine conjugates, to HIV status and cognitive impairment. This resource will enhance the annotation of these compounds in future studies to further the understanding of their roles in health and disease and highlight the value of large-scale untargeted metabolomics data for metabolite discovery.Competing Interest StatementPCD: PCD is an advisor and holds equity in Cybele, BileOmix and Sirenas and a Scientific co-founder, advisor and holds equity to Ometa, Enveda, and Arome with prior approval by UC-San Diego. PCD also consulted for DSM animal health in 2023. MW: MW is a co-founder of Ometa Labs LLC. RK: Rob Knight is a scientific advisory board member, and consultant for BiomeSense, Inc., has equity and receives income. He is a scientific advisory board member and has equity in GenCirq. He is a consultant for DayTwo, and receives income. He has equity in and acts as a consultant for Cybele. He is a co-founder of Biota, Inc., and has equity. He is a cofounder of Micronoma, and has equity and is a scientific advisory board member. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies.