publications

Prajit Rajkumar

accepted

Rajkumar, P., Tang, R., Sapre, H., Zemlin, J., Deleray, V., Seo, J. I., Mohan, S., Xing, S., Gouda, H., El Abiead, Y., Tsunoda, S. M., Zhao, H. N., & Dorrestein, P. C. (2026). Retrieval-Augmented Language Models Enable Scalable Chemical Source Classification in Metabolomics Workflows. Analytical Chemistry. https://doi.org/10.1021/acs.analchem.5c05301

@article{Rajk2026,
  title = {Retrieval-Augmented Language Models Enable Scalable Chemical
                 Source Classification in Metabolomics Workflows},
  author = {Rajkumar, Prajit and Tang, Runbang and Sapre, Harshada and Zemlin, Jasmine and Deleray, Victoria and Seo, Jeong In and Mohan, Siddharth and Xing, Shipei and Gouda, Harsha and El Abiead, Yasin and Tsunoda, Shirley M and Zhao, Haoqi Nina and Dorrestein, Pieter C},
  journal = {Analytical Chemistry},
  publisher = {American Chemical Society},
  month = jan,
  year = {2026},
  doi = {10.1021/acs.analchem.5c05301},
  url = {https://doi.org/10.1021/acs.analchem.5c05301}
}

There is a growing need for scalable chemical classification to
               support the interpretation of exposomics and metabolomics data.
               While structural categorization has been largely automated,
               functional and exposure-based labeling of chemicals remains a
               manual and time-consuming process. Here, we present chemsource,
               a flexible framework that integrates large language models
               (LLMs) with retrieval-augmented generation (RAG) to automate
               chemical classification. chemsource retrieves descriptive text
               from Wikipedia or PubMed abstracts based on chemical names and
               prompts LLMs to assign user-defined categories based on the
               retrieved content. We demonstrate classification into five
               exposure categories: endogenous metabolites, food molecules,
               drugs, personal care products, industrial chemicals, and
               combinations of these possibilities. Benchmarking against
               manually curated labels for 4,953 compounds showed 75% overall
               agreement, with category-level recall exceeding 75% across all
               classes. Expert review indicated that most discrepancies could
               be attributed to prompt ambiguity and incomplete manual labels
               rather than model failure. To demonstrate the utility of
               chemsource in metabolomics workflow, we applied it to eight
               public untargeted metabolomics data sets, revealing distinct
               exposure patterns across human biospecimens, mouse tissues,
               environmental dust, and consumer product extracts. chemsource is
               customizable via prompt editing, enabling diverse classification
               tasks without requiring coding expertise. The tool is freely
               available as a Python package
               (https://pypi.org/project/chemsource/). Text retrieval is free;
               classification requires user-supplied LLM API access.

Zhao, H. N., Kvitne, K. E., Brungs, C., Mohan, S., Charron-Lamoureux, V., Bittremieux, W., Tang, R., Schmid, R., Lamichhane, S., Xing, S., El Abiead, Y., Andalibi, M. S., Mannochio-Russo, H., Ambre, M., Avalon, N. E., Bryant, M., Burnett, L. A., Caraballo-Rodrı́guez Andrés Mauricio, Maya, M. C., … Dorrestein, P. C. (2025). A resource to empirically establish drug exposure records directly from untargeted metabolomics data. Nature Communications, 16(1), 10600. https://doi.org/10.1038/s41467-025-65993-5

@article{Zhao2025,
  title = {A resource to empirically establish drug exposure records
                directly from untargeted metabolomics data},
  author = {Zhao, Haoqi Nina and Kvitne, Kine Eide and Brungs, Corinna and Mohan, Siddharth and Charron-Lamoureux, Vincent and Bittremieux, Wout and Tang, Runbang and Schmid, Robin and Lamichhane, Santosh and Xing, Shipei and El Abiead, Yasin and Andalibi, Mohammadsobhan S and Mannochio-Russo, Helena and Ambre, Madison and Avalon, Nicole E and Bryant, Mackenzie and Burnett, Lindsey A and Caraballo-Rodr{\'\i}guez, Andr{\'e}s Mauricio and Maya, Martin Casas and Chin, Loryn and Corominas, Llu{\'\i}s and Ellis, Ronald J and Franklin, Donald and Girod, Sagan and Gomes, Paulo Wender P and Hansen, Lauren and Heaton, Robert K and Iudicello, Jennifer E and Jarmusch, Alan K and Khatib, Lora and Letendre, Scott and Magyari, Sarolt and McDonald, Daniel and Mohanty, Ipsita and Cumsille, Andr{\'e}s and Moore, David J and Rajkumar, Prajit and Ross, Dylan H and Sapre, Harshada and Shahneh, Mohammad Reza Zare and Gil-Solsona, Ruben and Thomas, Sydney P and Tribelhorn, Caitlin and Tubb, Helena M and Walker, Corinn and Wang, Crystal X and Zemlin, Jasmine and Zuffa, Simone and Wishart, David S and Gago-Ferrero, Pablo and Kaddurah-Daouk, Rima and Wang, Mingxun and Raffatellu, Manuela and Zengler, Karsten and Pluskal, Tom{\'a}{\v s} and Xu, Libin and Knight, Rob and Tsunoda, Shirley M and Dorrestein, Pieter C},
  journal = {Nature Communications},
  volume = {16},
  number = {1},
  pages = {10600},
  month = dec,
  year = {2025},
  url = {https://doi.org/10.1038/s41467-025-65993-5},
  doi = {10.1038/s41467-025-65993-5}
}

Despite extensive efforts, extracting medication exposure
              information from clinical records remains challenging. To
              complement this approach, here we show the Global Natural Product
              Social Molecular Networking (GNPS) Drug Library, a tandem mass
              spectrometry (MS/MS) based resource designed for drug screening
              with untargeted metabolomics. This resource integrates MS/MS
              references of drugs and their metabolites/analogs with
              standardized vocabularies on their exposure sources,
              pharmacologic classes, therapeutic indications, and mechanisms of
              action. It enables direct analysis of drug exposure and
              metabolism from untargeted metabolomics data, supporting flexible
              summarization at multiple ontology levels to align with different
              research goals. We demonstrate its application by stratifying
              participants in a human immunodeficiency virus (HIV) cohort based
              on detected drug exposures. We uncover drug-associated
              alterations in microbiota-derived N-acyl lipids that are not
              captured when stratifying by self-reported medication use.
              Overall, GNPS Drug Library provides a scalable resource for
              empirical drug screening in clinical, nutritional, environmental,
              and other research disciplines, facilitating insights into the
              ecological and health consequences of drug exposures. While not
              intended for immediate clinical decision-making, it supports
              data-driven exploration of drug exposures where traditional
              records are limited or unreliable.

Mannochio-Russo, H., Charron-Lamoureux, V., van Faassen, M., Lamichhane, S., Gonçalves Nunes, W. D., Deleray, V., Ayala, A. V., Tanaka, Y., Patan, A., Vittali, K., Rajkumar, P., El Abiead, Y., Zhao, H. N., Gomes, P. W. P., Mohanty, I., Lee, C., Sund, A., Sharma, M., Liu, Y., … Dorrestein, P. C. (2025). The microbiome diversifies long- to short-chain fatty acid-derived N-acyl lipids. Cell, 188(15), 4154–4169.e19. https://doi.org/10.1016/j.cell.2025.05.015

@article{MannochioRusso2025_1,
  author = {Mannochio-Russo, Helena and Charron-Lamoureux, Vincent and van Faassen, Martijn and Lamichhane, Santosh and Gon{\c{c}}alves Nunes, Wilhan D. and Deleray, Victoria and Ayala, Adriana V. and Tanaka, Yuichiro and Patan, Abubaker and Vittali, Kyle and Rajkumar, Prajit and El Abiead, Yasin and Zhao, Haoqi Nina and Gomes, Paulo Wender Portal and Mohanty, Ipsita and Lee, Carlynda and Sund, Aidan and Sharma, Meera and Liu, Yuanhao and Pattynama, David and Walker, Gregory T. and Norton, Grant J. and Khatib, Lora and Andalibi, Mohammadsobhan S. and Wang, Crystal X. and Ellis, Ronald J. and Moore, David J. and Iudicello, Jennifer E. and Franklin Jr., Donald and Letendre, Scott and Chin, Loryn and Walker, Corinn and Renwick, Simone and Zemlin, Jasmine and Meehan, Michael J. and Song, Xinyang and Kasper, Dennis and Burcham, Zachary and Kim, Jane J. and Kadakia, Sejal and Raffatellu, Manuela and Bode, Lars and Chu, Hiutung and Zengler, Karsten and Wang, Mingxun and Siegel, Dionicio and Knight, Rob and Dorrestein, Pieter C.},
  title = {The microbiome diversifies long- to short-chain fatty acid-derived N-acyl lipids},
  journal = {Cell},
  year = {2025},
  month = jul,
  day = {24},
  publisher = {Elsevier},
  volume = {188},
  number = {15},
  pages = {4154-4169.e19},
  issn = {0092-8674},
  doi = {10.1016/j.cell.2025.05.015},
  url = {https://doi.org/10.1016/j.cell.2025.05.015}
}

N-Acyl lipids are important mediators of several biological processes including immune function and stress response. To enhance the detection of N-acyl lipids with untargeted mass spectrometry-based metabolomics, we created a reference spectral library retrieving N-acyl lipid patterns from 2,700 public datasets, identifying 851 N-acyl lipids that were detected 356,542 times. 777 are not documented in lipid structural databases, with 18% of these derived from short-chain fatty acids and found in the digestive tract and other organs. Their levels varied with diet and microbial colonization and in people living with diabetes. We used the library to link microbial N-acyl lipids, including histamine and polyamine conjugates, to HIV status and cognitive impairment. This resource will enhance the annotation of these compounds in future studies to further the understanding of their roles in health and disease and to highlight the value of large-scale untargeted metabolomics data for metabolite discovery.

Charron-Lamoureux, V., Mannochio-Russo, H., Lamichhane, S., Xing, S., Patan, A., Portal Gomes, P. W., Rajkumar, P., Deleray, V., Caraballo-Rodríguez, A. M., Chua, K. V., Lee, L. S., Liu, Z., Ching, J., Wang, M., & Dorrestein, P. C. (2025). A guide to reverse metabolomics—a framework for big data discovery strategy. Nature Protocols. https://doi.org/10.1038/s41596-024-01136-2

@article{CharronLamoureux2025,
  author = {Charron-Lamoureux, Vincent and Mannochio-Russo, Helena and Lamichhane, Santosh and Xing, Shipei and Patan, Abubaker and Portal Gomes, Paulo Wender and Rajkumar, Prajit and Deleray, Victoria and Caraballo-Rodr{\'i}guez, Andr{\'e}s Mauricio and Chua, Kee Voon and Lee, Lye Siang and Liu, Zhao and Ching, Jianhong and Wang, Mingxun and Dorrestein, Pieter C.},
  title = {A guide to reverse metabolomics---a framework for big data discovery strategy},
  journal = {Nature Protocols},
  year = {2025},
  month = feb,
  day = {28},
  issn = {1750-2799},
  doi = {10.1038/s41596-024-01136-2},
  url = {https://doi.org/10.1038/s41596-024-01136-2}
}

Untargeted metabolomics is evolving into a field of big data science. There is a growing interest within the metabolomics community in mining tandem mass spectrometry (MS/MS)-based data from public repositories. In traditional untargeted metabolomics, samples to address a predefined question are collected and liquid chromatography with MS/MS data are generated. We then identify metabolites associated with a phenotype (for example, disease versus healthy) and elucidate or validate their structural details (for example, molecular formula, structural classification, substructure or complete structural annotation or identification). In reverse metabolomics, we start with MS/MS spectra for known or unknown molecules. These spectra are used as search terms to search public data repositories to discover phenotype-relevant information such as organ/biofluid distribution, disease condition, intervention status (for example, pre- and postintervention), organisms (for example, mammals versus others), geography and any other biologically relevant associations. Here we guide the reader through a four-part process: (1) obtaining the MS/MS spectra of interest (Universal Spectrum Identifier) and (2) Mass Spectrometry Search Tool searches to find the files associated with the MS/MS that are in available databases, (3) using the Reanalysis Data User Interface framework to link the files with their metadata and (4) validating the observations. Parts 1–3 could take from hours to days depending on the method used for collecting MS/MS spectra. For example, we use MS/MS spectra from three small molecules: phenylalanine-cholic acid (a microbially conjugated bile acid), phenylalanine-C4:0 and histidine-C4:0 (two N-acyl amides). We leverage the Global Natural Products Social Molecular Networking-based framework to explore the microbial producers of these molecules and their associations with health conditions and organ distributions in humans and rodents.

Krutkin, D. D., Thomas, S., Zuffa, S., Rajkumar, P., Knight, R., Dorrestein, P. C., & Kelley, S. T. (2025). To Impute or Not To Impute in Untargeted Metabolomics─That is the Compositional Question. Journal of the American Society for Mass Spectrometry. https://doi.org/10.1021/jasms.4c00434

@article{Krutkin2025,
  author = {Krutkin, Dennis D. and Thomas, Sydney and Zuffa, Simone and Rajkumar, Prajit and Knight, Rob and Dorrestein, Pieter C. and Kelley, Scott T.},
  title = {To Impute or Not To Impute in Untargeted Metabolomics─That is the Compositional Question},
  journal = {Journal of the American Society for Mass Spectrometry},
  year = {2025},
  month = feb,
  day = {25},
  publisher = {American Society for Mass Spectrometry. Published by the American Chemical Society. All rights reserved.},
  issn = {1044-0305},
  doi = {10.1021/jasms.4c00434},
  url = {https://doi.org/10.1021/jasms.4c00434}
}

Untargeted metabolomics often produce large datasets with missing values. These missing values are derived from biological or technical factors and can undermine statistical analyses and lead to biased biological interpretations. Imputation methods, such as k-Nearest Neighbors (kNN) and Random Forest (RF) regression, are commonly used, but their effects vary depending on the type of missing data, e.g., Missing Completely At Random (MCAR) and Missing Not At Random (MNAR). Here, we determined the impacts of degree and type of missing data on the accuracy of kNN and RF imputation using two datasets: a targeted metabolomic dataset with spiked-in standards and an untargeted metabolomic dataset. We also assessed the effect of compositional data approaches (CoDA), such as the centered log-ratio (CLR) transform, on data interpretation since these methods are increasingly being used in metabolomics. Overall, we found that kNN and RF performed more accurately when the proportion of missing data across samples for a metabolic feature was low. However, these imputations could not handle MNAR data and generated wildly inflated or imputed values where none should exist. Furthermore, we show that the proportion of missing values had a strong impact on the accuracy of imputation, which affected the interpretation of the results. Our results suggest imputation should be used with extreme caution even with modest levels of missing data and especially when the type of missingness is unknown.

preprints

Patan, A., Xing, S., Charron-Lamoureux, V., Hu, Z., Deleray, V., Agongo, J., El Abiead, Y., Mannochio-Russo, H., Mohanty, I., Gouda, H., Zemlin, J., Rajkumar, P., Lee, C., Leanos, D., Weimann, N., Tsuda, W., Giddings, S., Bui, T., Kvitne, K. E., … Dorrestein, P. C. (2025). Charting the undiscovered metabolome with synthetic multiplexing. BioRxiv.

@article{Patan2025,
  title = {Charting the undiscovered metabolome with synthetic
                   multiplexing},
  author = {Patan, Abubaker and Xing, Shipei and Charron-Lamoureux, Vincent and Hu, Zhewen and Deleray, Victoria and Agongo, Julius and El Abiead, Yasin and Mannochio-Russo, Helena and Mohanty, Ipsita and Gouda, Harsha and Zemlin, Jasmine and Rajkumar, Prajit and Lee, Carlynda and Leanos, Daniel and Weimann, Noah and Tsuda, Wataru and Giddings, Sadie and Bui, Tammy and Kvitne, Kine Eide and Zhao, Haoqi Nina and Zuffa, Simone and Nguyen, Vivian and Andrade, Aileen and Gon{\c
                   c}alves Nunes, Wilhan Donizete and Caraballo-Rodr{\'\i}guez, Andr{\'e}s M and Caetano David, Lurian and Carver, Jeremy and Bandeira, Nuno and Wang, Mingxun and Burnett, Lindsey A and Siegel, Dionicio and Dorrestein, Pieter C},
  journal = {bioRxiv},
  publisher = {Cold Spring Harbor Laboratory},
  month = nov,
  doi = {10.1101/2025.11.18.689170},
  year = {2025},
  copyright = {https://www.biorxiv.org/about/FAQ\#license}
}

In untargeted metabolomics, reference MS/MS libraries
                 are essential for structural annotation, yet currently explain
                 only 6.9% of the more than 1.7 billion MS/MS spectra in
                 public repositories. We hypothesized that many unannotated
                 features arise from simple, biologically plausible
                 transformations of endogenous and exposure-derived compounds.
                 To test this, we created a reference resource by synthesizing
                 over 100,000 compounds using multiplexed reactions that mimic
                 such biochemical transformations. 91% of the compounds
                 synthesized are absent from existing structural databases.
                 Through improvements in the construction of the computational
                 infrastructure that enables pan repository-scale MS/MS
                 comparisons, searching this biologically inspired MS/MS
                 library increased the overall reference-based match rate by
                 17.4%, yielding over 60 million new matches and raising the
                 global pan-repository MS/MS annotation rate to 8.1%. By
                 facilitating structural hypotheses for previously
                 uncharacterized MS/MS data, this framework expands the
                 accessible detectable biochemical landscape across human,
                 animal, plant, and microbial systems, revealing previously
                 undescribed metabolites such as ibuprofen-carnitine and
                 5-ASA-phenylpropionic acid conjugates arising from drug–host
                 and host–microbiome co-metabolism.

Mannochio-Russo, H., Gonçalves Nunes, W. D., Zhao, H. N., Kvitne, K. E., Xing, S., Gouda, H., Agongo, J., Mohanty, I., Charron-Lamoureux, V., Rajkumar, P., Pakkir Shah, A. K., Walter, A., Krishnaraj, R., El Abiead, Y., Ferreira, P. C., Zuffa, S., Patan, A., Caraballo-Rodrı́guez Andrés Mauricio, Bittremieux, W., … Dorrestein, P. (2025). Bridging complexity and accessibility in metabolomics with MetaboApps. ChemRxiv. https://chemrxiv.org/engage/chemrxiv/article-details/68e5680fdfd0d042d15c4900

@article{MannochioRusso2025_2,
  title = {Bridging complexity and accessibility in metabolomics with
                MetaboApps},
  author = {Mannochio-Russo, Helena and Gon{\c c}alves Nunes, Wilhan D and Zhao, Haoqi Nina and Kvitne, Kine Eide and Xing, Shipei and Gouda, Harsha and Agongo, Julius and Mohanty, Ipsita and Charron-Lamoureux, Vincent and Rajkumar, Prajit and Pakkir Shah, Abzer K and Walter, Axel and Krishnaraj, Rithi and El Abiead, Yasin and Ferreira, Patrick C and Zuffa, Simone and Patan, Abubaker and Caraballo-Rodr{\'\i}guez, Andr{\'e}s Mauricio and Bittremieux, Wout and Petras, Daniel and Wang, Mingxun and Dorrestein, Pieter},
  journal = {ChemRxiv},
  year = {2025},
  url = {https://chemrxiv.org/engage/chemrxiv/article-details/68e5680fdfd0d042d15c4900},
  doi = {10.26434/chemrxiv-2025-3nq29},
  publisher = {American Chemical Society}
}

Untargeted metabolomics is a powerful approach for exploring the
              chemical diversity and dynamics of biological systems. However,
              the types of questions that can be addressed depend not only on
              experimental design but also on the data processing and analysis
              workflows employed, many of which require advanced computational
              expertise. GNPS1, now transitioning to its second major
              implementation (GNPS2), has evolved into an expandable platform
              that supports the integration of modular web applications
              designed to simplify and enhance downstream analysis. These apps,
              named MetaboApps, facilitate the post-processing of outputs of
              several GNPS workflows and help make repository-scale
              metabolomics knowledge and other areas of metabolomics more
              accessible to a broader community.