SyM-GEM: A Pathway Builder for Genome-Scale Models
Hadi Nazem-Bokaee1, Jiun Y Yen1, Ahmad IM Athamneh1, Advait A Apte1, Michael J McAnulty1, Ryan S Senger1,2*
1Department of Biological Systems
Engineering, Virginia Tech, Blacksburg, VA, USA
2Department of Chemical Engineering, Virginia Tech, Blacksburg, VA, USA
*Corresponding author: Ryan S Senger, Department of Biological Systems Engineering and Chemical Engineering, Virginia Tech, Blacksburg, VA, USA. Tel: +15402319501; Email:senger@vt.edu
Received Date: 25 September, 2017; Accepted Date: 11 October, 2017; Published Date: 18 October, 2017
Citation: Nazem-Bokaee H, Yen JY, Athamneh AIM, Apte AA, McAnulty MJ, et al. (2017) SyM-GEM: A Pathway Builder for Genome-Scale Models. Adv Biochem Biotehcnol 2: 141. DOI: 10.29011/2574-7258.000041
1. Abstract
The BioSynthetic andMetabolic Pathway Builder and Genome-Scale
Model Database (SyM-GEM) is a new web application capable of adding
user-generated de
novo biosynthetic pathways to genome-scale metabolic flux models
relevant to biotechnology.SyM-GEM is available freely to academic users as a
web application (http://www.mesb.bse.vt.edu/SyM-GEM),
and all source code is available through GitHub (https://github.com/SengerLab/SyM-GEM) under an MIT
license agreement.The rapid growth in genome-scale model generation
from different research groups and sources has producedincompatible systems of model
nomenclature.In response, SyM-GEM incorporates (i) a novel probabilistic
compound grouping algorithm, based on the PubChem Compound database, to
synchronize existing genome-scale models and (ii) an application to install
user-generatedmetabolic pathway(s) to one or multiple models.Modified or
unmodified models can be downloaded from SyM-GEM in Systems Biology Markup Language
(SBML) format and are compatible with the popular COnstraint-Based
Reconstruction and Analysis (COBRA) Toolbox.
Genome-scale
metabolic flux models enable the study of metabolism in silico and have clinical and
metabolic engineering applications [1-3]. Previously
reviewed applications of genome-scale models include (i) analyzing the
properties of complicated cellular metabolic networks [4-6],
(ii) predicting phenotypes and evaluating experimental data [7-9], and (iii) metabolic engineering [10-12] with industrial, medical, and environmental
applications. Often, metabolic engineering applications involve the
installation of a de novobiosynthetic pathway. We have identified challenges in
synthesizing and installing these pathways in existing genome-scale metabolic
flux models so their cell-wide metabolic demands can be studied. In response,
we have created the BioSynthetic and Metabolic Pathway Builder and Genome-Scale
Model Database (SyM-GEM). This is a web application available freely to
academic users (subject to licensing agreement available on the site) at
http://www.mesb.bse.vt.edu/SyM-GEM. All source code is available to academic
users through GitHub at https://github.com/SengerLab/SyM-GEM(subject to an
MIT licensing agreement). In the following sections, we discuss the
relationship of SyM-GEM to other existing tools, why adding a de novo
biosynthetic pathway to an existing genome-scale model is non-trivial, and what
algorithms and automations are available through SyM-GEM to simplify the
process.
Several repositories for biochemical data and metabolic modeling tools now exist. Many of these are described further in the Supplementary Appendix. Among the databases of metabolic network information, a wide variety of systems of nomenclature for representing biochemical reactions and compounds are used. Over the past two decades, “genome-scale” metabolic network reconstructions have been built using different representations of compounds and reactions, which are often inconsistent with other formats. This creates a challenge in synchronizing available networks and models. This is illustrated in Table 1 for the simple case of the ATP synthase catalyzed reaction. Here, the same reaction was extracted from nine different genome-scale models. Not only do these models differ in notation of reactions and compounds, they also differ in proton balancing. Moreover, inconsistencies, redundancies, and ambiguities inside individual models and databases result in difficulties developing a universal standardized approach that enables direct incorporation of metabolic information from these models. Developments such as BiGG[13], MEMOSyS [14], MetRxn[15], KBase[16], and others have attempted to address these critical issues. The integration of genome-scale models into a unified format is a logical step for the progression of model building and systems biology in general. Ultimately, standardized formats will enable the simulation of multiple models simultaneously. This will be extremely useful and necessary for studying microbial interactions. In the current research, the focus is placed on compound synchronization among models (while maintaining original nomenclatures) and enabling the addition of de novo biosynthetic and metabolic pathways in multiple genome-scale models for the production of new fuels, chemicals, and pharmaceuticals.
SyM-GEM was constructed with genome-scale models relevant to metabolic engineering in biotechnology and a novel algorithm that synchronizes the notation of compounds and reactions between models prepared in different formats. Overall, SyM-GEM incorporates: (i) deriving standardization between genome-scale models of different formats; (ii) storing synchronized models in an on-line database; (iii) adding user-derived de novo synthetic metabolic pathways to one, multiple, or all models in the database; and (iv) retrieving the original and/or modified model(s) in Systems Biology Markup Language (SBML) [17] format. To develop this application, a novel algorithm based on probabilistic associations was created to pinpoint identical compounds across models of differing formats and naming conventions. The method of indexing compound data across multiple genome-scale models used herein makes use of the well-curated data available publicly through the PubChem Compound database [18,19]. From there, the online genome-scale model database portion of SyM-GEM was created and contains the standardized models. This database serves as the platform for de novo biosynthetic metabolic pathway addition to genome-scale models through an interactive web application. Currently, the database contains 12 models (listed in Table 2), and all have relevance to biotechnology.
1. Systems and Methods
The SyM-GEM online database and web application for metabolic pathway addition was developed using MySQL and PHP and has been made freely available to academic users under a licensing agreement available on the site (http://www.mesb.bse.vt.edu/SyM-GEM). The source-code for the compound grouping algorithm is also available to academic users under a licensing agreement through GitHub (https://github.com/SengerLab/SyM-GEM). Genome-scale models were simulated using MATLAB (R2015b) (MathWorks, Natick, MA) with the Constraint-Based Reconstruction and Analysis (COBRA) Toolbox v2.0.5 [20] and the open-source GLPK (GNU Linear Programming Kit) software (http://www.gnu.org/software/glpk). To read and write models as SBML, the SBML Toolbox v4.1.0 and SBML library v5.8.0 were used [17].
2. Results
The overall methodology developed for SyM-GEM is shown in Figure 1 and is described in detail in the
Supplementary Appendix. The major components of SyM-GEM are described in the
following 6 steps. Step 1: Read and convert genome-scale models for use with
COBRA and SyM-GEM. Genome-scale models encoded in SBML format can be read
directly by the COBRA Toolbox in MATLAB. However, to our knowledge, no
universal script exists to read models prepared in other formats. In response,
a MATLAB script was created (and made available through GitHub) that uses regular
expressions to parse genome-scale models encoded in virtually any format
including those used in the BiGG[13], KEGG [21], MetaCyc [22], Model
SEED [23], and MetRxn
[15] databases and makes them compatible with the COBRA Toolbox in
MATLAB. Steps 2-5: Synchronization of compounds using a novel Compounds
Grouping Algorithm (CGA). Here, compound names are interfaced with the PubChem
Compound database [18] to identify identical
compounds with different names and identifiers used by different models. The CGA
is separated into distinct steps for clarity, and the CGA MATLAB script is also
available through GitHub. Step 2: Find identical compounds across all models by
querying against PubChem. This step uses direct search function to identify
compound names in multiple models and in PubChem. Step 3: Find additional
identical compounds by calculating the probability of a compound being
identical or similar to others across all models and the PubChem Compound
database. This probabilistic model for grouping identical and similar compounds
across all models is novel to SyM-GEM and is described in further detail and
demonstrated in the Supplementary Appendix. Step 4: Generate a repository of
identical compounds. Step 5: Evaluate the groups in the repository, detect anomalies,
and remove redundancies. Step 6: Build the SyM-GEM genome-scale model database
and web application.
The initial version of SyM-GEM was constructed using 11 available models
and contains 7,098 unique compounds. SyM-GEM has since been updated to contain
12 models (Table 2). The initial CGA resulted in
construction of 2,510 non-redundant compound groups. A comprehensive step-by-step
tutorial of how to use SyM-GEM to add a de novo biosynthetic pathway to an existing
genome-scale model and download a COBRA-compatible SBML version of the new
model is also presented in the Supplementary Appendix.
3. Discussion
SyM-GEM is a novel database and web application that allows systems biologists and metabolic engineers to access synchronized genome-scale metabolic flux models through a user-friendly on-line platform. Users can further modify available models by constructing and adding customized synthetic metabolic pathways. The original and/or modified models are downloadable as SBML files that can be readily used by common software packages such as MATLAB with the popular COBRA Toolbox. The underlying approach that achieved synchronization among differently formatted genome-scale models, the CGA, uses a series of data-mining steps and probabilistic modeling that collectively provide a fast and reliable method for finding and grouping identical compounds. This algorithm does not require large storage space for pooling all available biochemical data. In Step 2 of compound synchronization between models, the CGA directly queries information from the on-line well-curated PubChem Compound database. Thus, the problem of redundancies and inconsistencies can be bypassed due to its immense size and inclusion of all known names for each compound. Moreover, the PubChem Compound database has cross-linked information on compounds from over more than 200 sources and databases. The probabilistic association methodology (Step 3) along with the final compound treatment and evaluation procedure of the CGA resolve compound associations for those compounds that could not be found in the PubChem Compound database or lack specificity in their names. These inconsistencies arise mainly because (i) there are compounds in models with poor annotations (e.g., “salcn,” “Unknown product,” and “Acceptor”), too specific annotations (e.g., “Cell wall of B. subtilis” and “(O16 antigen)x4 core oligosaccharide lipid A”), or wrong annotations (e.g., “3',5'-Cyclic AMP, Adenosine 3[-,5[-cyclic monophosphate, Cyclic adenylic acid”), and (ii) the PubChem Compound database does not contain all macromolecules (e.g., tRNAs and lipids). The CGA allowed for finding identical compounds from a pool of 7,098 compounds collected from the original 11 genome-scale models and combining them into 2,510 unique groups. For example, the CGA was capable of grouping “L-Glutamate 5-Phosphate,” “L-Glutamyl 5-phosphate,” “L-Glutamyl_5-phosphate,” “glu5p,” and “cpd02097” into the same compound group. All of these, in fact, represent the same compound. The algorithm has the potential of improving its performance as more information is fed by adding additional genome-scale models to the database. For example, in the initial version of SyM-GEM, compounds “Triphosphate” and “Inorganic triphosphate” could not be linked together under the same group because none of the models in the database support this relationship. In addition, the CGA has the potential, in future implementations, to incorporate user knowledge to improve and/or assign associations; thus, further curation and improvements through “Crowd sourcing” are possible in future versions of SyM-GEM.
The SyM-GEM web application offers an effective method for incorporating
new biosynthetic metabolic pathways into one or multiple genome-scale models.
These modified models enable comprehensive analysis and comparison among
various metabolic engineering strategies for optimal production of a specific
target. SyM-GEM allows construction of a de novo biosynthetic pathway and its simultaneous
addition to multiple models without user regard for the different model formats
or specific nomenclatures used by individual model. In the current version of
the SyM-GEM, there are both highly-curated models and draft reconstructions.
The approach designed in this research has the potential to be complemented
with automated genome-scale model reconstruction pipelines or even other
databases of models to enable: (i) synchronization among newly published models
and those residing in SyM-GEM, (ii) addition of biosynthetic metabolic pathways
to an expanded number of models, and (iii) producing modified SBML files of
models for widespread distribution. Ultimately, the goal of SyM-GEM is to
support systems biology and metabolic engineering where one or more biosynthetic
pathways are added and evaluated in multiple hosts simultaneously.This will aid
in choosing a proper host for expression of a desired pathway by allowing
exploration of factors such as cofactor availability, biosynthesis of competing
products, substrate utilization, and bioprocess considerations.
Figure 1:Flowchart of SyM-GEM development: (i) custom reading of differently formatted genome-scale models, (ii) synchronization of models through
a novel Compound Grouping Algorithm (CGA), (iii)
development of an on-line database of genome-scale models, and (iv) a web application to add synthetic
metabolic pathway(s) to one or multiple models.
Organism Name (Model ID)
|
ATP Synthase Reaction |
Reference |
Aspergillus nidulans(iHD666) |
ADP[m] + PI[m] + 3.88 H+_PO -> ATP[m] + H2O[m] + 3.88 H+_PO[m] |
|
Bacillus subtilis(iBsu1103) |
cpd00008 + cpd00009 + (4) cpd00067[e] <=> cpd00001 + cpd00002 + (3) cpd00067 |
|
Clostridium acetobutylicum (iCAC490) |
ADP[c] + Orthophosphate[c] + 3 H+[e] <=> H2O[c] + ATP[c] + 2.48 H+[c] |
|
Corynebacterium glutamicum |
ADP + Pi + 4 PROTON[e] <=> ATP + WATER + PROTON |
|
Escherichia coli (iAF1260) |
ADP[c] + (4) H[p] + pi[c] <==>ATP[c] + (3) H[c] + H2O[c] |
|
Geobactersulfurreducens |
ADP[c] + (4) H[e] + pi[c] -->ATP[c] + (3) H[c] + H2O[c] |
|
Methanosarcinabarkeri (iMG746) |
ADP[c] + 4 H[e] + pi[c]<=>ATP[c] + H2O[c] + 3 H[c] |
|
Pseudomonas putida (iJP815) |
IC00008 + IC00009 + 4 EC00065[e] <==> IC00001 + IC00002 + 3 IC00065 |
|
Saccharomyces cerevisiae (iND750) |
ADP[g] + (3) H[c] + pi[g] -->ATP[g] + (2) H[g] + H2O[g] |
Table 1: Representation of the ATP Synthase Reaction in Several Genome-Scale Models.
Organism Name (Model ID)
|
Reference |
Bacillus subtilis(iBsu1103) |
|
Clostridium acetobutylicum ATCC 824 (iCAC498) |
|
Clostridium beijerinkiiNCIMB 8052(iCM925) |
|
Corynebacterium glutamicumATCC 13032 (Seed196627.4) |
|
Escherichia coli K-12 MG1655 (iAF1260) |
|
Geobactersulfurreducens |
|
Lactobacillus plantarum WCFS1 |
|
Methanosarcinabarkeri(iMG746) |
|
Pseudomonas putidaKT2440 (Seed160488.1) |
|
Saccharomyces cerevisiae (iND750) |
|
Synechocystissp. strain PCC6803 (iJN678) |
|
Yarrowialipolytica (iYL619_PCP) |
|
*The iCAC498 model has replaced the originally published iCAC490 model. |
Table 2: Genome-Scale Models Currently Available in SyM-GEM.
© by the Authors & Gavin Publishers. This is an Open Access Journal Article Published Under Attribution-Share Alike CC BY-SA: Creative Commons Attribution-Share Alike 4.0 International License. Read More About Open Access Policy.