Update paper.md

justinjjvanderhooft · web-flow · commit b59c4f7a054a · 2026-04-02T17:03:56.000+02:00
Removed unintended duplications of text in Statement of Need section.
diff --git a/joss/paper.md b/joss/paper.md
@@ -46,17 +46,17 @@ authors:
     corresponding: true
     affiliation: "3, 7"
 affiliations:
- - name: Netherlands eScience Center, Netherlands
+ - name: Netherlands eScience Center, the Netherlands
    index: 1
  - name: RECETOX, Faculty of Science, Masaryk University, Kotlářská 2, 60200, Brno, Czech Republic
    index: 2
- - name: Bioinformatics Group, Wageningen University & Research, Netherlands
+ - name: Bioinformatics Group, Wageningen University & Research, the Netherlands
    index: 3
  - name: Naicons Srl, Milan, Italy
    index: 4
  - name: Interfaculty Institute of Microbiology and Infection Medicine Tübingen (IMIT), University of Tübingen, Germany
    index: 5
- - name: Newcastle University, Biosciences Institute, Newcastle upon Tyne, UK
+ - name: Newcastle University, Biosciences Institute, Newcastle upon Tyne, United Kingdom
    index: 6
  - name: Department of Biochemistry, University of Johannesburg, 2006 Johannesburg, South Africa
    index: 7
@@ -70,8 +70,8 @@ Natural product discovery increasingly relies on the integration of multi-omics
 
 # Statement of need
 Omics datasets have become a key resource for natural products discovery, enabling the systematic exploration of specialized metabolites, the refinement of knowledge of known natural products, and the identification of novel bioactive compounds or metabolic enzymes. Paired omics analyses combine complementary genomics (e.g., biosynthetic gene clusters (BGCs)) and metabolomics (e.g., mass spectra and mass fragmentation or MS/MS spectra) datasets to elucidate gene-metabolite relationships, accelerating the discovery process [@goering_metabologenomics_2016; @leao_npomix_2022; @hooft_linking_2020].
-Several computational strategies have been developed to propose such gene cluster-mass spectral links, including i) feature-based approaches that match predicted structural or substructure information between genomes and tandem mass spectrometry (.e.g., Pep2Path, MetaRiPPquest, MetaMiner, DeepRiPP) [@medema2014pep2path; @mohimani2017metarippquest; @cao2019metaminer; @merwin2020deepripp], ii) correlation-based “metabologenomics” that infer co-occurrence patterns across strains or samples [@goering_metabologenomics_2016], and iii) hybrid frameworks such as NPLinker and the machine learning classifier-based NPOmix [@eldjarn_ranking_2021; @leao_npomix_2022], with recently proposed phylogeny-aware extensions to reduce false positive associations [@boldt2026phylogeny]. However, the connected omics data structures, preproccessing pipelines, resources, and annotation tools are constantly being improved: curated BGC repositories such as MIBiG are continually expanded with new entries and annotation fields [@zdouc_mibig_2025], while community MS/MS resources such as GNPS spectral libraries keep growing [@wang_sharing_2016]. Together with the constant expansion of experimental datasets, this puts a strain on downstream frameworks that integrate processed omics data and link results. Hence, natural products discovery would benefit from up-to-date and user-friendly software packages that parse processed omics data and connect it with algorithms returning ranked, queryable gene cluster - mass spectra links to prioritize links to further investigate manually. Here, we redesigned NPLinker to provide such an integrative omics tool that guides both users and developers in paired omics mining with its modular setup. For example, recent developments in omics processing, annotation tools, and ranking metrics could be added to the framework [@louwen_enhanced_2023; @louwen_ipresto_2023]. Moreover, several of such linking scores could then be used together with the currently implemented strain correlation score to further improve ranking results.    
-Omics datasets have become a key resource for natural products discovery, enabling the systematic exploration of specialized metabolites, the refinement of knowledge of known natural products, and the identification of novel bioactive compounds or metabolic enzymes. Paired omics analyses combine complementary genomics (e.g., biosynthetic gene clusters (BGCs)) and metabolomics (e.g., mass spectra) datasets to elucidate gene-metabolite relationships, accelerating the discovery process [@goering_metabologenomics_2016; @leao_npomix_2022; @hooft_linking_2020]. However, omics data structures, preproccessing pipelines, resources, and annotation tools are constantly being improved. For example, newer releases of MIBiG contain more validated BGCs and new annotation fields [@zdouc_mibig_2025], while mass spectral libraries are growing in size and information as well [@wang_sharing_2016]. Besides, newer versions of omics clustering tools have different output file formats. Together with the constant expansion of available experimental datasets, this puts a strain on downstream frameworks that integrate the data and results. Hence, researchers working in the natural products discovery field, or anyone with paired genomics and metabolomics data, would benefit from up-to-date and user-friendly software packages that parse processed omics data and connect it with algorithms returning ranked, queryable gene cluster - mass spectra links to prioritize links to further investigate manually. Here, we redesigned NPLinker to provide such an integrative omics tool that guides users in paired omics mining with its modular setup. The tool is also of interest to developers in this field, as recent developments in omics processing, annotation tools, and ranking metrics could be added to the framework [@louwen_enhanced_2023; @louwen_ipresto_2023]. Moreover, several of such linking scores could then be used together with the currently implemented strain correlation score to further improve ranking results.    
+Several computational strategies have been developed to propose such gene cluster-mass spectral links, including i) feature-based approaches that match predicted structural or substructure information between genomes and tandem mass spectrometry (.e.g., Pep2Path, MetaRiPPquest, MetaMiner, DeepRiPP) [@medema2014pep2path; @mohimani2017metarippquest; @cao2019metaminer; @merwin2020deepripp], ii) correlation-based “metabologenomics” that infer co-occurrence patterns across strains or samples [@goering_metabologenomics_2016], and iii) hybrid frameworks such as NPLinker and the machine learning classifier-based NPOmix [@eldjarn_ranking_2021; @leao_npomix_2022], with recently proposed phylogeny-aware extensions to reduce false positive associations [@boldt2026phylogeny]. However, the connected omics data structures, preproccessing pipelines, resources, and annotation tools are constantly being improved: curated BGC repositories such as MIBiG are continually expanded at every release with new entries and new annotation fields [@zdouc_mibig_2025], while community mass spectral (MS/MS) resources such as GNPS spectral libraries keep growing in size and information as well [@wang_sharing_2016]. Besides, newer versions of omics clustering tools have different output file formats. Together with the constant expansion of experimental datasets, this puts a strain on downstream frameworks that integrate processed omics data and link results.  
+Hence, researchers working in the natural products discovery field, or anyone with paired genomics and metabolomics data, would benefit from up-to-date and user-friendly software packages that parse processed omics data and connect it with algorithms returning ranked, queryable gene cluster - mass spectra links to prioritize links to further investigate manually. Here, we redesigned NPLinker to provide such an integrative omics tool that guides users in paired omics mining with its modular setup. The tool is also of interest to developers in this field, as recent developments in omics processing, annotation tools, and ranking metrics could be added to the framework [@louwen_enhanced_2023; @louwen_ipresto_2023]. Moreover, several of such linking scores could then be used together with the currently implemented strain correlation score to further improve ranking results.    
 
 ![The NPLinker 2 framework. The current pipeline consists of five main components: 1. Initiating an analysis with an input block that includes configuration file and optional input data; 2. Preparing dataset by automatically downloading or generating data; 3. Loading and parsing data from data files; 4. Scoring and linking data; 5. Creating an output for analysis and visualization of results.\label{fig:1}](fig1.png)