WO2024016407A1 - 一种基于异质性的细胞代谢网络建模方法及其应用 - Google Patents

一种基于异质性的细胞代谢网络建模方法及其应用 Download PDF

Info

Publication number
WO2024016407A1
WO2024016407A1 PCT/CN2022/112025 CN2022112025W WO2024016407A1 WO 2024016407 A1 WO2024016407 A1 WO 2024016407A1 CN 2022112025 W CN2022112025 W CN 2022112025W WO 2024016407 A1 WO2024016407 A1 WO 2024016407A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
cell
modeling method
model
network modeling
Prior art date
Application number
PCT/CN2022/112025
Other languages
English (en)
French (fr)
Inventor
陶飞
孟宣霖
许平
Original Assignee
上海交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海交通大学 filed Critical 上海交通大学
Publication of WO2024016407A1 publication Critical patent/WO2024016407A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Definitions

  • the invention relates to the field of biology, and in particular to a heterogeneity-based cell metabolism network modeling method and its application.
  • Synthetic biology is an emerging field in biological sciences that has emerged in recent years. Research on synthetic biology has made rapid progress in recent years. Synthetic biology is different from traditional biology, which dissects living organisms to study their internal structures. The research strategy of synthetic biology is the opposite. It builds biological systems step by step starting from the most basic elements to reveal the inner workings of organisms. law.
  • Heterogeneity is a ubiquitous biological phenomenon. Multicellular organisms are composed of a variety of cells with different shapes and functions. Various types of cells are combined in an orderly manner to form tissues and organs. In the case of disease, abnormal cells often hide among normal cells. Similarly, for microorganisms, there is also heterogeneity between microbial cells in a culture (colony or biofilm) formed by propagation of the same ancestral cells. This heterogeneous differentiation of cells within a microbial population is caused by a variety of reasons, not only epigenetic differences, but also at the population level through the exchange of genetic material between cells and spontaneous mutations within the cells. Genetic differences.
  • heterogeneity of microorganisms can give microbial populations a greater chance of survival when faced with stressful environmental impacts, and is an important means for bacteria to adapt to the environment. It is worth noting that cell heterogeneity can affect overall macroscopic performance by affecting population stability. For example, heterogeneity affects the yield of biomanufacturing, and in the medical field, heterogeneity affects drug resistance. Heterogeneity is essentially caused by the internal differences of cells, so it is reflected in different dimensions such as genes, transcription, proteins, metabolism, etc., and can be characterized at different levels.
  • the present invention provides a heterogeneity-based cell metabolism network modeling method, which includes the following steps:
  • the single cell data is collected based on different dimensions of the heterogeneity.
  • one or more of a single cell transcriptome, a single cell proteome, and a single cell metabolome is selected to collect the single cell data.
  • a single cell data collection platform was used to collect the data of the single cell transcriptome.
  • cell wall digestion enzymes are used to lyse the single cells, and then the single cell data collection platform is used to collect data on the cell contents after reverse transcription.
  • a single cell data collection platform or mass spectrometry imaging equipment is used to collect data on the single cell proteome.
  • using the mass spectrometry imaging device to collect data on the single cell proteome includes: diluting the single cells, placing them on a conductive glass slide, taking microscopic photos, and using the mass spectrometry imaging device to collect data.
  • a single cell data collection platform or mass spectrometry imaging equipment is used to collect the data of the single cell metabolome.
  • the step of using the mass spectrometry imaging device to collect the data of the single cell proteome includes: diluting the single cells, placing them on a conductive glass slide, taking microscopic photos, and using the mass spectrometry imaging device to collect data.
  • steps of processing the single cell data include:
  • the preset conditions include: cells expressing transcripts/proteins/metabolites with more than 100 characteristics, and/or transcripts/proteins/metabolites shared by more than 1/5 cells.
  • steps to construct a cell metabolism model based on artificial intelligence include:
  • An artificial intelligence algorithm is selected to build a model for the target object.
  • the step of establishing a model for the target object includes:
  • the first model is tested using the reserved single cell data.
  • steps to establish an optimized metabolic model include:
  • the optimized metabolic model of the target substance is established based on the results of the visualization processing.
  • each piece of data used for prediction is fluctuated within a preset interval while keeping other data values unchanged.
  • the single prediction data and the random prediction number are visualized using polar coordinates and pictures reflecting the objective distribution of metabolites.
  • the above-mentioned cell metabolism network modeling method provided by the present invention can be applied in cell physiological response prediction, including the following steps:
  • the feature vector is input into the cell metabolism model established using the above cell metabolism network modeling method.
  • the above-mentioned cell metabolism network modeling method provided by the present invention can be applied in cell design, including the following steps:
  • the invention overcomes the shortcomings of the current synthetic biology technology route based on the design-build-test-learn cycle (DBTL), which has a small amount of test data, cannot effectively learn the internal correlations of complex metabolic networks, and further lacks a rational basis for the design part.
  • the method provided by the present invention has the characteristics of large amount of data and high collection throughput, and can directly analyze and learn complex metabolic networks based on data and AI, thereby establishing a calculable and predictable cell model, and thereby being able to perform cell analysis on cells. Physiological response prediction and rational design.
  • Figure 1 is a schematic diagram of the Uniform Manifold Approximation and Projection (UMAP) of yeast single-cell transcriptomics data
  • Figure 2 is a schematic diagram of the data distribution of yeast single cell transcriptome data
  • Figure 3 is a schematic diagram of deep learning model training based on yeast single cell transcriptome data for high transcript expression of ethanol synthase
  • Figure 4 is a schematic diagram of deep learning model training based on yeast single cell transcriptome data for high protein expression of methanol synthase
  • Figure 5 is a schematic diagram of deep learning model training for high propylene glycol production based on yeast single cell transcriptome data
  • FIG. 6 is a schematic diagram of the Uniform Manifold Approximation and Projection (UMAP) of Chlamydomonas reinhardtii single-cell transcriptomic data
  • Figure 7 is a schematic diagram of the data distribution of single-cell transcriptome data of Chlamydomonas reinhardtii;
  • Figure 8 is a schematic diagram of deep learning model training based on single-cell transcriptome data of Chlamydomonas reinhardtii for high transcript expression of glycerol synthase;
  • Figure 9 is a schematic diagram of deep learning model training based on single-cell transcriptome data of Chlamydomonas reinhardtii for high protein expression of glycerol synthase;
  • Figure 10 is a schematic diagram of deep learning model training for high-yield triglycerides based on Chlamydomonas reinhardtii single-cell transcriptome data;
  • Figure 11 is a flow chart of the heterogeneity-based cell metabolism network modeling method of the present invention.
  • Synthetic biology involves multiple iterations of the design-build-test-learn (DBTL) cycle.
  • DBTL design-build-test-learn
  • heterogeneity is essentially caused by the internal differences of cells. It is reflected in different dimensions such as genes, transcription, proteins, metabolism, etc., and can be characterized at different levels.
  • any macroscopic biological system such as a colony, a tissue or a culture, contains a large number of heterogeneous single cells. For example, in a typical bacterial colony, the number of microbial cells in it is at the level of 1 billion. Therefore, using single-cell technology to collect information from heterogeneous cells can obtain massive levels of single-cell information, that is, single-cell big data, and these data imply the stress mechanism of the metabolic network.
  • This heterogeneity-based single-cell data collection can provide big data suitable for machine learning.
  • modern artificial intelligence methods can be introduced to establish a cell metabolism model, which will fundamentally change the face of the DBTL cycle and promote revolutionary progress in the field of synthetic biology.
  • the present invention provides a cell metabolism network modeling method based on heterogeneity, which uses the characteristics of cell heterogeneity in various dimensions to collect massive single cell data, then processes the single cell data, and then uses Artificial intelligence algorithms build cell metabolism models.
  • the heterogeneity-based cell metabolism network modeling method provided by the present invention includes the following steps:
  • Single cell data collection and acquisition Single-cell data collection is based on the manifestation of cell heterogeneity in different dimensions. For example, heterogeneity is reflected in different dimensions such as genes, transcription, proteins, metabolism, etc. One or several dimensions can be selected for data collection.
  • step S2 Single cell data processing. Process the data collected in step S1, perform different processing on different single cell data, extract the corresponding data matrix, and then perform correction, and perform cell screening and functional analysis to determine the final retained data.
  • step S1 includes:
  • Single cell transcriptome data collection Commercial or non-commercial single cell data collection platforms can be used to collect single cell transcriptome data.
  • data collection platforms include but are not limited to 10X genomics, BD Rhapsody, Fluidigm C1, Bio-Rad, etc.; single-cell transcriptome technology collection known in the existing technology can also be used, such as Smart-seq, CEL-Seq, Quartz-Seq, Drop-seq, InDrop-seq, Smart-seq2, etc.
  • S1.2. Single-cell proteome data collection Commercial or non-commercial single-cell data collection platforms can be used to collect single-cell transcriptome data, or mass spectrometry imagers can be used for data collection.
  • Single-cell metabolome data collection Commercial or non-commercial single-cell data collection platforms can be used to collect single-cell metabolome data, or mass spectrometry imagers can be used for data collection.
  • steps S1.1-S1.3 can be selectively deleted according to actual needs, or data collection steps in other dimensions can be added.
  • step S2 includes:
  • Matrix generation perform matrix extraction on single cell data, for example, build a database for transcriptomic data, and then characterize the single cell transcripts, and use the Seurat program package for matrix extraction; for single cell proteome characterization, use the Seurat program. package for matrix extraction; to characterize the single-cell metabolome, use SCiLS Lab software for matrix extraction; organize the above data and establish dense/sparse data matrices respectively.
  • S2.3. Functional analysis Perform cell population analysis and interest index screening on the data preprocessed matrix.
  • transcripts specifically expressed in some cell populations are used as indicators of interest to distinguish this cell population from other cells; cells that meet the preset conditions are retained.
  • the preset conditions can be set according to actual needs. For example, cells with transcripts/proteins/metabolites expressing more than 100 features should be retained, and transcripts/proteins/metabolites shared by more than 1/5 of the cells should be retained.
  • step S3 includes:
  • Target selection You can select one/one type/multiple/multiple types of targets for prediction. For example, select one/one class/multiple/many classes of transcripts/proteins/metabolites.
  • the transcript/protein/metabolite matrix data should be normalized; when selecting transcript/protein/metabolite for prediction, the transcript/protein/metabolite should be normalized.
  • the metabolite matrix data is added and then normalized; for example: when predicting a certain metabolite as the target, the matrix values corresponding to all other metabolites except this metabolite should be added and then normalized.
  • Normalization method select maximum and minimum value normalization or select formula normalization.
  • the normalized interval is between (-1-1) or (0-1) or any interval that can reasonably scale the data.
  • Model establishment Select artificial intelligence algorithm for model establishment. You can choose an appropriate artificial intelligence algorithm, such as neural network, Bayesian, decision tree, linear classifier, cluster analysis and any other artificial intelligence algorithm. Use artificial intelligence algorithms to build models, and through training and testing, finally obtain the optimal model for the target object. You can choose Matlab, Python, Perl, R and other common programming languages or commercial software for model establishment, training, testing and optimization.
  • the above describes the metabolic network modeling method based on cell heterogeneity provided by the present invention. After the model is established through this method, it can be applied in different technical scenarios.
  • the above-mentioned model can be used to predict the physiological response of cells.
  • the prediction method includes: given a set of feature vectors that can represent metabolic characteristic data, directly input the above-mentioned model for calculation, and the corresponding parameters, physiological states and corresponding parameters can be obtained. target parameters.
  • the above model can be used for cell design, and the cell design method includes:
  • Functional or non-functional forms can be used for data normalization. Taking the functional form as an example, use the sigmoid function for data normalization; taking the non-functional form as an example, use the mapminmax function in MATLAB for data normalization.
  • step S4.2 Data prediction: Use the model established in step S3.3 to predict the generated data; among them, the already trained model can be used to predict the generated data.
  • Example 1 Yeast data collection based on single-cell technology
  • Yeast data collection based on single-cell technology mainly includes three aspects. That is, collecting transcription, protein, and metabolomics data from single cells. It mainly includes the following aspects:
  • Yeast single-cell transcriptomic data collection Use zymolyase (a cell wall digesting enzyme) to lyse cells before cDNA library construction.
  • the 10x Genomics platform was used to collect data on the cellular contents after reverse transcription; the acquisition results are shown in Figures 1 and 2.
  • Figure 1 is a display of the collected single cell data using t-SNE diagrams after dimensionality reduction;
  • Figure 2 It is a statistical histogram after scaling the collected single-cell transcriptomic data using the Matlab mapminmax function. The transcript response value is distributed between 0-0.5.
  • yeast single cell proteomics data collection dilute the yeast single cells to 100 cells/ ⁇ l, spot 0.5 ⁇ l of them on a conductive glass slide and take microscopic photos, and further use a mass spectrometer imager to collect data.
  • Yeast single cell metabolomics data collection dilute the yeast single cells to 100 cells/ ⁇ l, then spot 0.5 ⁇ l of them on a conductive glass slide and take microscopic photos, and further use a mass spectrometer imager to collect data.
  • Example 2 Yeast data processing based on single cell technology
  • Matrix generation input the raw sequencing data, use STAR to compare the raw data to the yeast reference genome, and obtain the transcript matrix; input the raw data, use Protein discover or Mascot for automated protein characterization, and obtain the protein matrix; input the raw data, Use Compound discover or QI to automatically annotate raw data to obtain a metabolite matrix. Organize the above data to create dense/sparse data matrices respectively;
  • Batch correction Use open source batch correction software packages such as Harmony, MetNormalizer, etc. to perform corrections between different data collection batches based on the code (which can be obtained for free from the Github website). Eliminate internal differences between different data collection batches through batch correction;
  • Example 3 Method for establishing a high transcription expression model of ethanol synthase based on yeast cell heterogeneity
  • This embodiment consists of three parts. That is, single-cell transcriptomic data collection, target-based deep learning, and optimal metabolic model establishment. Single-cell transcriptomic data collection uses 10X genomics' latest Chromium TM , including reverse transcription cDNA library construction, cell counting and computer testing.
  • Target-based deep learning includes the following steps:
  • Target selection Select ethanol synthase as the target transcript.
  • Deep learning training As shown in Figure 3, the optimizable neural network for ethanol synthase Y and other features X is established through the regression learner of MATLAB 2021b.
  • the neural network is built from inputs Y and X and is automatically trained by a regression learner. Obtain the best neural network model through hyperparameter selection.
  • the regression value R represents the correlation between the prediction output and the target output. The closer the R value is to 1, the closer the relationship between the prediction and the output data. The closer the R value is to 0, the greater the randomness of the relationship between the prediction and the output data. .
  • the mean square error MSE represents the difference between the predicted values (y) and (y_) of n samples.
  • Model testing Verify the final result accuracy of the optimized neural network model by using the reserved 10% data as a test.
  • the establishment of an optimal metabolic model includes the following steps:
  • a single data can be a single transcript/protein/metabolite label data.
  • Each piece of label data used for prediction fluctuates within a preset interval while keeping other data values unchanged. Fluctuations can be both uniform and non-uniform.
  • Random prediction data generation Random prediction data can be multi-transcript/protein/metabolite label data, which is used to randomly generate prediction data within a certain data interval.
  • Data visualization Use polar coordinate charts and any picture display method that can reflect the objective distribution of metabolites to visualize the prediction data.
  • Target product optimization model generation Confirm the up- and down-regulation ratios of other transcripts at the maximum transcription expression level of ethanol synthase.
  • Example 4 Method for establishing a high expression model of methanol synthase based on yeast cell heterogeneity
  • This embodiment consists of three parts. That is, single-cell proteomics data collection, target-based deep learning and establishment of optimal metabolic models. Single-cell proteomic data acquisition was performed using MALDI2-TIMSTOF.
  • Target-based deep learning includes the following steps:
  • Target selection Select methanol synthase as the target protein.
  • Deep learning training As shown in Figure 4, the optimizable neural network for methanol synthase Y and other features X is established through the regression learner of MATLAB 2021b. Obtain the best neural network model through hyperparameter selection. The neural network is built from inputs Y and X and is automatically trained by a regression learner. Obtain the best neural network model through hyperparameter selection. The regression value R represents the correlation between the prediction output and the target output. The closer the R value is to 1, the closer the relationship between the prediction and the output data. The closer the R value is to 0, the greater the randomness of the relationship between the prediction and the output data. .
  • the mean square error MSE represents the difference between the predicted values (y) and (y_) of n samples.
  • Model testing Verify the final result accuracy of the optimized neural network model by using the reserved 10% data as a test.
  • the establishment of an optimal metabolic model includes the following steps:
  • Multi-transcript/protein/metabolite label data generation Randomly generate prediction data within a certain data interval.
  • Data visualization Use polar coordinate diagrams and any image display method that can reflect the objective distribution of various proteins to visualize the prediction data, providing a higher weight/contribution rate/density representation of other factors corresponding to the high protein expression of methanol synthase. Protein expression pattern.
  • Target product optimization model generation Confirm the up- and down-regulation ratios of other proteins under the maximum protein expression of methanol synthase.
  • Example 5 Method for establishing the most productive metabolic model of propylene glycol based on yeast cell heterogeneity
  • This embodiment consists of three parts. That is, single-cell metabolomics data collection, target-based deep learning and establishment of optimal metabolic models. Single-cell metabolomics data acquisition using MALDI2-TIMSTOF.
  • Target-based deep learning includes the following steps:
  • Target selection Select the target metabolite of propylene glycol.
  • Deep learning training As shown in Figure 5, the optimizable neural network for propylene glycol Y and other features X is established through the regression learner of MATLAB 2021b.
  • the neural network is built from inputs Y and X and is automatically trained by a regression learner. Obtain the best neural network model through hyperparameter selection.
  • the regression value R represents the correlation between the prediction output and the target output. The closer the R value is to 1, the closer the relationship between the prediction and the output data. The closer the R value is to 0, the greater the randomness of the relationship between the prediction and the output data. .
  • the mean square error MSE represents the difference between the predicted values (y) and (y_) of n samples.
  • Model testing Verify the final result accuracy of the optimized neural network model by using the reserved 10% data as a test.
  • the establishment of an optimal metabolic model includes the following steps:
  • Multi-transcript/protein/metabolite label data generation Randomly generate prediction data within a certain data interval.
  • Data visualization Use polar coordinate diagrams and any picture display method that can reflect the objective distribution of metabolites to visualize the prediction data, and provide a metabolic model with higher weight/contribution rate/density that reflects the high abundance of propylene glycol.
  • Target product optimization model generation Confirm the up- and down-regulation ratios of other metabolites under the maximum accumulation of propylene glycol.
  • Example 6 Data collection of Chlamydomonas reinhardtii based on single-cell technology
  • Chlamydomonas reinhardtii data collection based on single-cell technology mainly includes three aspects. That is, collecting transcription, protein, and metabolomics data from single cells. It mainly includes the following aspects:
  • Chlamydomonas reinhardtii single-cell transcriptomic data collection Use the 10x Genomics platform to collect data on the cell contents after reverse transcription; the collection results are shown in Figures 6 and 7.
  • Figure 6 is the collection using t-SNE diagrams. The obtained single cell data is displayed after dimensionality reduction;
  • Figure 7 is a statistical histogram after data scaling using the Matlab mapminmax function on the collected single cell transcriptomic data. The transcript response value is distributed between 0-0.5.
  • Collection of single-cell metabolomics data of Chlamydomonas reinhardtii dilute the single cells of Chlamydomonas reinhardtii to 100 cells/microliter, then spot 0.5 microliters of it on a conductive glass slide and take a microscopic photo, and further use a mass spectrometer imager. data collection.
  • Matrix generation Use transcriptome library to characterize single-cell transcripts; use Protein discover or Mascot to characterize single-cell proteome; use Compound discover or QI to characterize single-cell metabolome. Organize the above data to create dense/sparse data matrices respectively;
  • Batch calibration Use batch calibration software such as Harmony, MetNormalizer, etc. to perform calibration between different collection batches;
  • Functional analysis Use commercial/non-commercial software such as Seurat for cell screening and functional analysis. Cells with more than 100 features of transcript/protein/metabolite expression were retained, and transcripts/proteins/metabolites shared by more than 1/5 cells were retained.
  • Example 8 Method for establishing a high transcription expression model of glycerol synthase based on cell heterogeneity of Chlamydomonas reinhardtii
  • This embodiment consists of three parts. That is, single-cell transcriptomic data collection, target-based deep learning, and optimal metabolic model establishment. Single-cell transcriptomic data acquisition was performed using 10X genomics standard procedures.
  • Target-based deep learning includes the following steps:
  • Target selection Select glycerol synthase as the target transcript.
  • Model testing Verify the final result accuracy of the optimized neural network model by using the reserved 10% data as a test.
  • the establishment of an optimal metabolic model includes the following steps:
  • Random prediction data generation Randomly generate prediction data within a certain data interval.
  • Data visualization use polar coordinate diagrams and any picture display method that can reflect the objective distribution of metabolites to visualize the prediction data, and provide other transcripts with higher weight/contribution rate/density that reflect the high transcription expression of glycerol dehydrogenase. Expression patterns.
  • Target product optimization model generation Establish a data-based target product optimization metabolic model based on the visualization results. The up- and down-regulation ratios of other transcripts under high transcription expression of glycerol synthase were obtained.
  • Example 9 Method for establishing a high expression model of glycerol synthase based on cell heterogeneity of Chlamydomonas reinhardtii
  • This embodiment consists of three parts. That is, single-cell proteomics data collection, target-based deep learning and establishment of optimal metabolic models. Single-cell proteomics data acquisition using MALDI2-TIMSTOF.
  • Target-based deep learning includes the following steps:
  • Target selection Select glycerol synthase as the target protein.
  • Deep learning training As shown in Figure 9, the neural network is automatically trained by inputting Y and X through a regression learner. Obtain the best neural network model through hyperparameter selection.
  • the regression value R represents the correlation between the prediction output and the target output. The closer the R value is to 1, the closer the relationship between the prediction and the output data. The closer the R value is to 0, the greater the randomness of the relationship between the prediction and the output data.
  • the mean square error MSE represents the difference between the predicted values (y) and (y_) of n samples.
  • Model testing Verify the final result accuracy of the optimized neural network model by using the reserved 10% data as a test.
  • the establishment of an optimal metabolic model includes the following steps:
  • Multi-transcript/protein/metabolite label data generation Randomly generate prediction data within a certain data interval.
  • Data visualization Use polar coordinate diagrams and any picture display method that can reflect the objective distribution of metabolites to visualize the prediction data, providing a higher weight/contribution rate/density representation of other proteins corresponding to the high protein expression of glycerol synthase. distribution pattern.
  • Target product optimization model generation Confirm the up- and down-regulation ratios of other proteins under the maximum accumulation of glycerol synthase.
  • Example 10 Method for establishing a metabolic model with the highest yield of triglycerides based on cell heterogeneity of Chlamydomonas reinhardtii
  • This embodiment consists of three parts. That is, single-cell metabolomics data collection, target-based deep learning and establishment of optimal metabolic models. Single-cell metabolomics data acquisition using MALDI2-TIMSTOF.
  • Target-based deep learning includes the following steps:
  • Target selection Select triglycerides as the target metabolite.
  • Deep learning training As shown in Figure 10, the neural network is automatically trained by inputting Y and X through a regression learner. Obtain the best neural network model through hyperparameter selection.
  • the regression value R represents the correlation between the prediction output and the target output. The closer the R value is to 1, the closer the relationship between the prediction and the output data. The closer the R value is to 0, the greater the randomness of the relationship between the prediction and the output data.
  • the mean square error MSE represents the difference between the predicted values (y) and (y_) of n samples.
  • Model testing Verify the final result accuracy of the optimized neural network model by using the reserved 10% data as a test.
  • the establishment of an optimal metabolic model includes the following steps:
  • Single transcript/protein/metabolite label data generation While keeping other data values unchanged, perform data fluctuation within a certain interval for each piece of data used for prediction. Fluctuations can be both uniform and non-uniform.
  • Multi-transcript/protein/metabolite label data generation Randomly generate prediction data within a certain data interval.
  • Data visualization Use polar coordinate charts and any picture display method that can reflect the objective distribution of metabolites to visualize the prediction data, and provide a metabolic model with higher weight/contribution rate/density that reflects the high abundance of triglycerides.
  • Target product optimization model generation Establish a data-based target product optimization metabolic model based on the visualization results. The up- and down-regulation ratios of other metabolites under the maximum accumulation of triglycerides were obtained.
  • Example 11 Prediction of triglyceride metabolism levels based on measured data
  • This embodiment consists of three parts. That is, single-cell metabolomics data collection, target-based deep learning, and triglyceride metabolism level prediction based on measured data. Single-cell metabolomics data acquisition using MALDI2-TIMSTOF.
  • Target-based deep learning includes the following steps:
  • Target selection Select triglycerides as the target metabolite.
  • Deep learning training The neural network is automatically trained by the input Y and X through the regression learner. Through hyperparameter selection, the optimal neural network model is obtained.
  • the regression value R represents the correlation between the prediction output and the target output. The closer the R value is to 1, the closer the relationship between the prediction and the output data. The closer the R value is to 0, the greater the randomness of the relationship between the prediction and the output data.
  • the mean square error MSE represents the difference between the predicted values (y) and (y_) of n samples.
  • Model testing Verify the final result accuracy of the optimized neural network model by using the reserved 10% data as a test.
  • the prediction of triglyceride metabolism levels based on measured data includes two parts:
  • the invention overcomes the shortcomings of the current synthetic biology technology route based on the design-build-test-learn cycle (DBTL), which has a small amount of test data, cannot effectively learn the internal correlations of complex metabolic networks, and further lacks a rational basis for the design part.
  • the method provided by the present invention has the characteristics of large amount of data and high collection throughput, and can directly analyze and learn complex metabolic networks based on data and AI, thereby establishing a calculable and predictable cell model, and thereby being able to perform cell analysis on cells. Physiological response prediction and rational design.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Public Health (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

一种基于异质性的细胞代谢网络建模方法及其在细胞生理响应预测、细胞设计中的应用,所述方法包括:基于异质性采集单细胞数据;处理所述单细胞数据;基于人工智能构建细胞代谢模型。所述方法具有数据量大、采集通量高的特点,能直接对复杂代谢网络进行基于数据和AI的解析与学习,从而建立可计算、可预测的细胞模型,进而能够对细胞进行生理响应预测和理性设计。

Description

一种基于异质性的细胞代谢网络建模方法及其应用 技术领域
本发明涉及生物学领域,尤其涉及一种基于异质性的细胞代谢网络建模方法及其应用。
背景技术
合成生物学是生物科学近年来兴起的新兴领域,近年来合成生物的研究进展很快。合成生物学与传统生物学通过解剖生命体以研究其内在构造的办法不同,合成生物学的研究策略是相反的,它是从最基本的要素开始一步步建立生物体系,从而揭示生物的内部运行规律。
同时,合成生物学作为一门科学领域,也能够针对特定目的重新设计有机体,赋予其新的能力。世界各地的合成生物学研究人员和公司致力于解决医学、制造业和农业方面的问题。现阶段,合成生物学应用技术开发的模式很大程度上仍然是一个反复试错的过程,涉及设计-建造-测试-学习(DBTL)循环的多次迭代,该循环时间较长,迭代较慢,严重制约了合成生物学的应用。DBTL循环之所以缓慢和低效的一个重要原因在于其学习步骤的低效率。这是因为可供使用的学习数据量少,数据采集成本高通量低,无法全面准确的反映与复杂的代谢网络的特性。这同时也从根本上限制了先进的人工智能方法的应用。
异质性(heterogeneity)是一个普遍存在的生物学现象。多细胞生物个体由多种形态功能不同的细胞组成,多种类型细胞有序地结合在一起,形成了组织和器官。在疾病发生的情况下,异常的细胞常常藏匿于正常细胞之中。同样地,对于微生物而言,由同一祖先细胞繁殖形成的培养物(菌落或生物膜)中的微生物细胞之间也具有异质性。微生物种群内的这种细胞的异质性分化是由多种原因早成的,不仅有表观遗传学的差异,还有通过细胞间的遗传物质交换以及细胞内的自发突变造成的群体水平的遗传差异。微生物的这种异质性可使微生物种群在面临胁迫环境冲击时拥有更大的生存机会,是细菌适应环境的重要手段。值得注意的是,细胞的异质性可以通过影响群体稳定性进而作用于整体宏观表现。譬如,异质性影响生物制造的产量,在医学领域异质性影响耐药性等。异质性本质上是由于细胞的内部差异造成的,因而在基因、转录、蛋白、代谢等不同的维度具有体现,可以在不同的层面上进行表征。
因此,本领域的技术人员致力于开发一种基于异质性的细胞代谢网络建模方法及其应用,具有数据量大、采集通量高的特点,能直接对复杂代谢网络进行基于数据和AI的解析与学习,从而建立可计算、可预测的细胞模型,进而能够对细胞进行生理响 应预测和理性设计。
发明内容
为实现上述目的,本发明提供了一种基于异质性的细胞代谢网络建模方法,包括以下步骤:
基于异质性采集单细胞数据;
处理所述单细胞数据;
基于人工智能构建细胞代谢模型。
进一步地,在采集所述单细胞数据时,基于所述异质性的不同维度采集所述单细胞数据。
进一步地,选取单细胞转录组、单细胞蛋白质组、单细胞代谢组中的一个或多个采集所述单细胞数据。
进一步地,利用单细胞数据采集平台采集所述单细胞转录组的数据。
进一步地,使用细胞壁消化酶对单细胞进行细胞裂解,然后使用所述单细胞数据采集平台对逆转录后的细胞内容物进行数据采集。
进一步地,利用单细胞数据采集平台或质谱成像设备采集所述单细胞蛋白组的数据。
进一步地,利用所述质谱成像设备采集所述单细胞蛋白组的数据包括:将单细胞稀释后,置于导电玻片上,进行显微拍照,使用所述质谱成像设备采集数据。
进一步地,利用单细胞数据采集平台或质谱成像设备采集所述单细胞代谢组的数据。
进一步地,利用所述质谱成像设备采集所述单细胞蛋白组的数据的步骤包括:将单细胞稀释后,置于导电玻片上,进行显微拍照,使用所述质谱成像设备采集数据。
进一步地,处理所述单细胞数据的步骤包括:
生成所述单细胞数据的密集/稀疏数据矩阵;
对所述密集/稀疏数据矩阵进行批次校正;
对经过校正后的所述密集/稀疏数据矩阵进行细胞群体分析和兴趣指标筛选。
进一步地,进行所述细胞群体分析和兴趣指标筛选时,将符合预设条件的单细胞予以保留。
进一步地,所述预设条件包括:转录本/蛋白/代谢物表达超过100个特征的细胞,和/或超过1/5细胞共有的转录本/蛋白/代谢物。
进一步地,基于人工智能构建细胞代谢模型的步骤包括:
选择需要预测的目标物;
进行归一化;
选择人工智能算法建立针对所述目标物的模型。
进一步地,建立针对所述目标物的模型的步骤包括:
利用所述人工智能算法建立第一模型,对所述第一模型进行深度学习训练;
建立最优化代谢模型。
进一步地,在所述深度学习训练之后,使用预留的所述单细胞数据对所述第一模型进行测试。
进一步地,建立最优化代谢模型的步骤包括:
生成单一预测数据;
生成随机预测数据;
对所述单一预测数据和所述随机预测数据进行可视化处理;
根据所述可视化处理的结果建立所述目标物的所述最优化代谢模型。
进一步地,生成所述单一预测数据时,在保持其他数据值不变的前提下对每一条用于预测的数据在预设区间内进行波动。
进一步地,使用极坐标以及反映代谢物客观分布的图片对所述单一预测数据和所述随机预测数进行可视化。
本发明提供的上述细胞代谢网络建模方法可以应用在细胞生理响应预测中,包括以下步骤:
给定一组能够代表代谢特征数据组成的特征向量;
将所述特征向量输入利用上述细胞代谢网络建模方法建立的细胞代谢模型。
本发明提供的上述细胞代谢网络建模方法可以应用在细胞设计中,包括以下步骤:
生成数据并进行归一化;
利用上述细胞代谢网络建模方法建立的细胞代谢模型对所述生成数据进行预测;
获取最优化代谢模型。
本发明克服了目前基于设计-建造-测试-学习循环(DBTL)的合成生物学技术路线中测试数据量少,无法对复杂代谢网络内部关联进行有效学习,进而设计部分缺乏理性基础的缺点。本发明提供的方法,具有数据量大、采集通量高的特点,能直接对复杂代谢网络进行基于数据和AI的解析与学习,从而建立可计算、可预测的细胞模型,进而能够对细胞进行生理响应预测和理性设计。
以下将结合附图对本发明的构思、具体结构及产生的技术效果作进一步说明,以充分地了解本发明的目的、特征和效果。
附图说明
图1是酵母单细胞转录组学数据的Uniform Manifold Approximation and Projection(UMAP)示意图;
图2为酵母单细胞转录组数据的数据分布示意图;
图3为基于酵母单细胞转录组数据针对乙醇合成酶高转录表达的深度学习模型训 练示意图;
图4为基于酵母单细胞转录组数据针对甲醇合成酶高蛋白表达的深度学习模型训练示意图;
图5为基于酵母单细胞转录组数据针对丙二醇高产的深度学习模型训练示意图;
图6是莱茵衣藻单细胞转录组学数据的Uniform Manifold Approximation and Projection(UMAP)示意图;
图7为莱茵衣藻单细胞转录组数据的数据分布示意图;
图8为基于莱茵衣藻单细胞转录组数据针对甘油合成酶高转录表达的深度学习模型训练示意图;
图9为基于莱茵衣藻单细胞转录组数据针对甘油合成酶高蛋白表达的深度学习模型训练示意图;
图10为基于莱茵衣藻单细胞转录组数据针对高产甘油三酯的深度学习模型训练示意图;
图11为本发明的基于异质性的细胞代谢网络建模方法流程图。
具体实施方式
以下参考说明书附图介绍本发明的多个优选实施例,使其技术内容更加清楚和便于理解。本发明可以通过许多不同形式的实施例来得以体现,本发明的保护范围并非仅限于文中提到的实施例。
在附图中,结构相同的部件以相同数字标号表示,各处结构或功能相似的组件以相似数字标号表示。附图所示的每一组件的尺寸和厚度是任意示出的,本发明并没有限定每个组件的尺寸和厚度。为了使图示更清晰,附图中有些地方适当夸大了部件的厚度。
合成生物学涉及设计-建造-测试-学习(DBTL)循环的多次迭代,但由于可供使用的学习数据量少,数据采集成本高、通量低,无法全面准确的反映与复杂的代谢网络的特性,从而限制了将人工智能方法在其中的应用。如果能够通过低成本、高通量的方法采集大量的单细胞数据用作学习数据,则有望将机器学习等人工智能方法应用于合成生物学。
异质性作为普遍存在的生物学现象,本质上是由于细胞的内部差异造成的,在基因、转录、蛋白、代谢等不同的维度具有体现,可以在不同的层面上进行表征。鉴于细胞的微观尺度,在任何一个宏观的生物体系中,比如在一个菌落、一个组织或者一个培养物中,都蕴含海量具备异质性的单细胞。比如在一个典型的细菌菌落中,其中的微生物细胞数据在10亿的级别。因此,使用单细胞技术对具备异质性的细胞进行信息采集,可以获得海量级别的单细胞的信息,即单细胞大数据,并且这些数据隐含着代谢网络的应激机制。这种基于异质性的单细胞数据采集,可提供适于机器学习的大数 据。在此基础上现代的人工智能方法可以引入,从而建立细胞代谢模型,将可以从根本上改变DBTL循环的面貌,促生合成生物学领域的革命性进展。基于此,本发明提供了一种基于异质性的细胞代谢网络建模方法,利用细胞异质性在各种维度的特征,采集海量的单细胞数据,然后对单细胞数据进行处理,再利用人工智能算法构件细胞代谢模型。
本发明提供的基于异质性的细胞代谢网络建模方法,包括以下步骤:
S1、单细胞数据采集和获取。单细胞数据采集基于细胞异质性在不同维度的体现来进行。例如,异质性在基因、转录、蛋白、代谢等不同的维度具有体现,可以选取一个或几个维度来进行数据采集。
S2、单细胞数据处理。对步骤S1中采集的数据进行处理,针对不同的单细胞数据进行不同的处理,提取对应的数据矩阵,然后进行校正,并进行细胞筛选与功能分析,确定最终保留的数据。
S3、基于人工智能的细胞代谢模型构建。采用人工智能算法,选择目标物,利用所采集的数据进行深度学习训练和建立算法模型,然后进行测试,最终建立目标物最优化模型。
在一些实施方式中,步骤S1包括:
S1.1、单细胞转录组数据采集:可以利用商业化或非商业化单细胞数据采集平台进行单细胞转录组数据采集,例如,数据采集平台包括但不限于10X genomics、BD Rhapsody、Fluidigm C1、Bio-Rad等;亦可以采用现有技术中已知的单细胞转录组技术采集,例如Smart-seq、CEL-Seq、Quartz-Seq、Drop-seq、InDrop-seq、Smart-seq2等。
S1.2、单细胞蛋白质组数据采集:可以利用商业化或非商业化单细胞数据采集平台进行单细胞转录组数据采集,亦可以利用质谱成像仪进行数据采集。
S1.3、单细胞代谢组数据采集:可以利用商业化或非商业化单细胞数据采集平台进行单细胞代谢组数据采集,亦可以利用质谱成像仪进行数据采集。
应当理解,根据选取的细胞异质性的维度的不同,步骤S1.1-S1.3可以根据实际需求有选择性的删减,也可以增加其他维度的数据采集步骤。
在一些实施方式中,步骤S2包括:
S2.1、矩阵生成:对单细胞数据进行矩阵提取,例如,对转录组学数据建库,然后定性单细胞转录本,使用Seurat程序包进行矩阵提取;对单细胞蛋白组定性,使用Seurat程序包进行矩阵提取;对单细胞代谢组定性,使用SCiLS Lab软件进行矩阵提取;分别整理上述数据,分别建立密集/稀疏数据矩阵。
S2.2、批次校正:针对单细胞转录,蛋白组数据,我们使用Seurat,Harmony程序包进行单细胞矩阵批次校正;针对单细胞代谢组数据,我们使用MetNormalizer程序包进行单细胞矩阵批次校正。通过校正,可以避免不同数据采集批次间带来的差异, 即规避批次效应。
S2.3、功能分析:对经过数据预处理的矩阵进行细胞群体分析与兴趣指标筛选。针对细胞群体分析,以单细胞转录组学数据为例,使用在一些细胞群体中特异性表达的转录本作为兴趣指标将该细胞群体与其他细胞进行区分;将符合预设条件的细胞予以保留,预设条件可以根据实际需求设定,例如,保留转录本/蛋白/代谢物表达超过100个特征的细胞,保留超过1/5细胞共有的转录本/蛋白/代谢物。
在一些实施方式中,步骤S3包括:
S3.1、目标物选择:可以选择一个/一类/多个/多类目标进行预测。例如,选择转录本/蛋白质/代谢物的一个/一类/多个/多类。在选择转录本/蛋白质/代谢物进行预测时,应当对转录本/蛋白质/代谢物矩阵数据进行归一化;在选择转录本/蛋白质/代谢物进行预测时,应当对转录本/蛋白质/代谢物矩阵数据加和后再进行归一化;例如:在以某一代谢物作为目标进行预测时,应当对除该代谢物外其它所有代谢物对应的矩阵数值进行加和后再归一化。
S3.2、归一化方式:选择最大值最小值归一化或选择公式归一化。归一化区间为(-1-1)或(0-1)等任何可对数据进行合理缩放的区间之间。
S3.3、模型建立:选择人工智能算法进行模型建立。可以选择合适的人工智能算法,例如神经网络、贝叶斯、决策树、线性分类器、聚类分析等任意一种人工智能算法。利用人工智能算法建立模型,通过训练、测试,最终得到针对目标物的最优化模型。可以选择Matlab、Python、Perl、R等常见编程语言或商业化软件进行模型建立、训练、测试和优化。
以上描述了本发明提供的基于细胞异质性的代谢网络建模方法,通过该方法建立模型后,可以应用在不同的技术场景中。
在一些实施方式中,可以利用上述模型对细胞生理响应预测,预测方法包括:给定一组能够代表代谢特征数据组成的特征向量,直接输入上述模型计算,可得到对应的参数、生理状态以及对应的目标参数。
在一些实施方式中,可以利用上述模型进行细胞设计,细胞设计方法包括:
S4.1、数据生成:使用python,matlab或excel进行转录本/蛋白质/代谢物对应的矩阵数据生成,可以使用函数或非函数形式进行数据归一化。以函数形式为例,使用sigmoid函数进行数据归一化;以非函数形式为例,使用matlab中mapminmax功能进行数据归一化处理。
S4.2、数据预测:使用步骤S3.3建立的模型进行生成数据预测;其中,可以使用已经训练好的模型对生成的数据进行预测。
S4.3、最优代谢模型获取:选择预测结果数值排名靠前的数据作为候选对象,通过计算特征向量与参考向量的距离,选择距离较短的作为最优代谢模式。例如,可以 使用极坐标图或任何类似的数据展现形式对生成数据预测结果进行可视化,并使用现存所有距离计算方式计算特征向量与参考向量之间距离。
以下通过多个实施例来进一步描述本发明的实施过程和所达到的技术效果。
实施例1基于单细胞技术的酵母数据采集
基于单细胞技术的酵母数据采集主要包含三个方面。即对单细胞进行转录,蛋白,代谢组学的数据采集。主要包含以下几个方面:
1、酵母单细胞转录组学数据采集:在cDNA建库前使用zymolyase(一种细胞壁消化酶)进行细胞裂解。使用10x Genomics平台对逆转录后的细胞内容物进行数据采集;采集结果如图1和图2所示,图1是使用t-SNE图对采集到的单细胞数据进行降维后展示;图2是对采集到的单细胞转录组学数据使用Matlab mapminmax函数进行数据缩放后的统计柱状图,转录本响应值在0-0.5之间的分布。
2、酵母单细胞蛋白组学数据采集:对酵母单细胞稀释至100细胞/微升后将其点样0.5微升于导电玻片上并进行显微拍照,进一步使用质谱成像仪进行数据采集。
3、酵母单细胞代谢组学数据采集:对酵母单细胞稀释至100细胞/微升后将其点样0.5微升于导电玻片上并进行显微拍照,进一步使用质谱成像仪进行数据采集。
实施例2基于单细胞技术的酵母数据处理
基于单细胞技术的酵母数据采集后,针对不同的单细胞数据应当进行如下的数据处理:
1、矩阵生成:输入测序原始数据,使用STAR将原始数据比对到酵母参考基因组上,获取转录本矩阵;输入原始数据,使用Protein discover或Mascot进行自动化蛋白定性,获得蛋白矩阵;输入原始数据,使用Compound discover或QI对原始数据进行自动化注释,获得代谢物矩阵。分别整理上述数据分别建立密集/稀疏数据矩阵;
2、批次校正:使用开源批次校正软件包如Harmony、MetNormalizer等根据代码(可以从Github网站免费获取)进行不同数据采集批次间的校正。通过批次校正,消除不同数据采集批次间的内部差异;
3、功能分析:根据软件说明书,使用Seurat等商业化/非商业化软件进行细胞筛选与功能分析。分别保留转录本/蛋白/代谢物表达超过100个特征的细胞,保留超过1/5细胞共有的转录本/蛋白/代谢物。
实施例3基于酵母细胞异质性的乙醇合成酶高转录表达模型建立方法
本实施例由三部分组成。即单细胞转录组学数据采集,基于目标物的深度学习以及最优化代谢模型建立。单细胞转录组学数据采集使用10X genomics最新Chromium TM,包含逆转录cDNA建库,细胞计数与上机。
基于目标物的深度学习包含以下步骤:
1、目标物选择:选择乙醇合成酶作为目标转录本。
2、深度学习训练:如图3所示,通过MATLAB 2021b的回归学习器进行乙醇合成酶Y及其它特征X的可优化神经网络建立。该神经网络建立由输入Y与X通过回归学习器自动训练。通过超参数选取,获得最佳神经网络模型。回归值R代表预测输出和目标输出之间的相关性,R值越接近1表示预测和输出数据之间的关系越密切,R值越接近0表示预测和输出数据之间的关系随机性越大。均方误差MSE代表n个样本的预测值(y)与(y_)的差距。在训练神经网络时,通过不断的改变神经网络中的所有参数,使损失函数不断减小,从而训练出更高准确率的神经网络模型。训练结果显示,模型的R:0.8591,MSE=0.00078563。
3.模型测试:通过使用预留的10%数据作为测试,验证可优化神经网络模型的最终结果准确性。
最优化代谢模型建立包括以下步骤:
1、单一数据生成:单一数据可以为单转录本/蛋白质/代谢物标签数据,在保持其他数据值不变的前提下对每一条用于预测的标签数据在预设区间内进行波动。波动可以是均匀和非均匀的。
2、随机预测数据生成:随机预测数据可以为多转录本/蛋白质/代谢物标签数据,在一定数据区间内进行用于预测数据随机生成。
3、数据可视化:使用极坐标图及任何可以反映代谢物客观分布的图片展现方式对预测数据进行可视化。
4、目标产物最优化模型生成:确认乙醇合成酶最大转录表达量下的其它转录本上下调比例。
实施例4基于酵母细胞异质性的甲醇合成酶高表达模型建立方法
本实施例由三部分组成。即单细胞蛋白组学数据采集,基于目标物的深度学习以及最优化代谢模型建立。单细胞蛋白组学数据采集使用MALDI2-TIMSTOF进行。
基于目标物的深度学习包含以下步骤:
1、目标物选择:选择甲醇合成酶作为目标蛋白。
2、深度学习训练:如图4所示,通过MATLAB 2021b的回归学习器进行甲醇合成酶Y及其它特征X的可优化神经网络建立。通过超参数选取,获得最佳神经网络模型。该神经网络建立由输入Y与X通过回归学习器自动训练。通过超参数选取,获得最佳神经网络模型。回归值R代表预测输出和目标输出之间的相关性,R值越接近1表示预测和输出数据之间的关系越密切,R值越接近0表示预测和输出数据之间的关系随机性越大。均方误差MSE代表n个样本的预测值(y)与(y_)的差距。在训练神经网络时,通过不断的改变神经网络中的所有参数,使损失函数不断减小,从而训 练出更高准确率的神经网络模型。训练结果显示,模型的R:0.8668,MSE=0.00075214。
3.模型测试:通过使用预留的10%数据作为测试,验证可优化神经网络模型的最终结果准确性。
最优化代谢模型建立包括以下步骤:
1、单转录本/蛋白质/代谢物标签数据生成:在保持其他数据值不变的前提下对每一条用于预测的数据在一定区间内进行波动。波动可以是均匀和非均匀的。
2、多转录本/蛋白质/代谢物标签数据生成:在一定数据区间内进行用于预测数据随机生成。
3、数据可视化:使用极坐标图及任何可以反映各种蛋白客观分布的图片展现方式对预测数据进行可视化,提供权重/贡献率/密度较高的体现甲醇合成酶高蛋白表达量对应下的其他蛋白表达模式。
4、目标产物最优化模型生成:确认甲醇合成酶最大蛋白表达量下的其它蛋白的上下调比例。
实施例5基于酵母细胞异质性的丙二醇最高产代谢模型建立方法
本实施例由三部分组成。即单细胞代谢组学数据采集,基于目标物的深度学习以及最优化代谢模型建立。单细胞代谢组学数据采集使用MALDI2-TIMSTOF。
基于目标物的深度学习包含以下步骤:
1、目标物选择:选择丙二醇目标代谢物。
2、深度学习训练:如图5所示,通过MATLAB 2021b的回归学习器进行丙二醇Y及其它特征X的可优化神经网络建立。该神经网络建立由输入Y与X通过回归学习器自动训练。通过超参数选取,获得最佳神经网络模型。回归值R代表预测输出和目标输出之间的相关性,R值越接近1表示预测和输出数据之间的关系越密切,R值越接近0表示预测和输出数据之间的关系随机性越大。均方误差MSE代表n个样本的预测值(y)与(y_)的差距。在训练神经网络时,通过不断的改变神经网络中的所有参数,使损失函数不断减小,从而训练出更高准确率的神经网络模型。训练结果显示,通过超参数选取,获得最佳神经网络模型。R:0.8592,MSE=0.00078902。
3.模型测试:通过使用预留的10%数据作为测试,验证可优化神经网络模型的最终结果准确性。
最优化代谢模型建立包括以下步骤:
1、单转录本/蛋白质/代谢物标签数据生成:在保持其他数据值不变的前提下对每一条用于预测的数据在一定区间内进行波动。波动可以是均匀和非均匀的。
2、多转录本/蛋白质/代谢物标签数据生成:在一定数据区间内进行用于预测数据随机生成。
3、数据可视化:使用极坐标图及任何可以反映代谢物客观分布的图片展现方式对 预测数据进行可视化,提供权重/贡献率/密度较高的体现丙二醇高丰度下的代谢模式。
4、目标产物最优化模型生成:确认丙二醇最大累积量下的其它代谢物上下调比例。
实施例6基于单细胞技术的莱茵衣藻数据采集
基于单细胞技术的莱茵衣藻数据采集主要包含三个方面。即对单细胞进行转录,蛋白,代谢组学的数据采集。主要包含以下几个方面:
1、莱茵衣藻单细胞转录组学数据采集:使用10x Genomics平台对逆转录后的细胞内容物进行数据采集;采集结果如图6和图7所示,图6是使用t-SNE图对采集到的单细胞数据进行降维后展示;图7是对采集到的单细胞转录组学数据使用Matlab mapminmax函数进行数据缩放后的统计柱状图,转录本响应值在0-0.5之间的分布。
2、莱茵衣藻单细胞蛋白组学数据采集:对莱茵衣藻单细胞稀释至100细胞/微升后将其点样0.5微升于导电玻片上并进行显微拍照,进一步使用质谱成像仪进行数据采集。
3、莱茵衣藻单细胞代谢组学数据采集:对莱茵衣藻单细胞稀释至100细胞/微升后将其点样0.5微升于导电玻片上并进行显微拍照,进一步使用质谱成像仪进行数据采集。
实施例7基于单细胞技术的莱茵衣藻数据处理
基于单细胞技术的莱茵衣藻数据采集后,针对不同的单细胞数据应当进行如下的数据处理:
1、矩阵生成:使用转录组建库后定性单细胞转录本;使用Protein discover或Mascot进行单细胞蛋白组定性;使用Compound discover或QI进行单细胞代谢组定性。分别整理上述数据分别建立密集/稀疏数据矩阵;
2、批次校正:使用批次校正软件如Harmony、MetNormalizer等进行不同采集批次间的校正;
3、功能分析:使用Seurat等商业化/非商业化软件进行细胞筛选与功能分析。分别保留转录本/蛋白/代谢物表达超过100个特征的细胞,保留超过1/5细胞共有的转录本/蛋白/代谢物。
实施例8基于莱茵衣藻细胞异质性的甘油合成酶高转录表达模型建立方法
本实施例由三部分组成。即单细胞转录组学数据采集,基于目标物的深度学习以及最优化代谢模型建立。单细胞转录组学数据采集使用10X genomics标准步骤进行。
基于目标物的深度学习包含以下步骤:
1、目标物选择:选择甘油合成酶作为目标转录本。
2、深度学习训练:如图8所示,通过MATLAB 2021b的回归学习器进行甘油合 成酶Y及其它特征X的可优化神经网络建立。通过超参数选取,获得最佳神经网络模型。R:0.8352,MSE=0.00090754。
3.模型测试:通过使用预留的10%数据作为测试,验证可优化神经网络模型的最终结果准确性。
最优化代谢模型建立包括以下步骤:
1、单一预测数据生成:在保持其他数据值不变的前提下对每一条用于预测的数据在一定区间内进行波动。波动可以是均匀和非均匀的。
2、随机预测数据生成:在一定数据区间内进行用于预测数据随机生成。
3、数据可视化:使用极坐标图及任何可以反映代谢物客观分布的图片展现方式对预测数据进行可视化,提供权重/贡献率/密度较高的体现甘油脱氢酶高转录表达下的其他转录本表达模式。
4、目标产物最优化模型生成:根据可视化结果建立基于数据的目标产物最优化代谢模型。得到甘油合成酶高转录表达量下其它转录本上下调比例。
实施例9基于莱茵衣藻细胞异质性的甘油合成酶高表达模型建立方法
本实施例由三部分组成。即单细胞蛋白组学数据采集,基于目标物的深度学习以及最优化代谢模型建立。单细胞蛋白组学数据采集使用MALDI2-TIMSTOF。
基于目标物的深度学习包含以下步骤:
1、目标物选择:选择甘油合成酶作为目标蛋白。
2、深度学习训练:如图9所示,该神经网络建立由输入Y与X通过回归学习器自动训练。通过超参数选取,获得最佳神经网络模型。回归值R代表预测输出和目标输出之间的相关性,R值越接近1表示预测和输出数据之间的关系越密切,R值越接近0表示预测和输出数据之间的关系随机性越大。均方误差MSE代表n个样本的预测值(y)与(y_)的差距。在训练神经网络时,通过不断的改变神经网络中的所有参数,使损失函数不断减小,从而训练出更高准确率的神经网络模型。训练结果显示,R:0.8589,MSE=0.00078724。
3.模型测试:通过使用预留的10%数据作为测试,验证可优化神经网络模型的最终结果准确性。
最优化代谢模型建立包括以下步骤:
1、单转录本/蛋白质/代谢物标签数据生成:在保持其他数据值不变的前提下对每一条用于预测的数据在一定区间内进行波动。波动可以是均匀和非均匀的。
2、多转录本/蛋白质/代谢物标签数据生成:在一定数据区间内进行用于预测数据随机生成。
3、数据可视化:使用极坐标图及任何可以反映代谢物客观分布的图片展现方式对预测数据进行可视化,提供权重/贡献率/密度较高的体现甘油合成酶高蛋白表达量下对 应的其它蛋白分布模式。
4、目标产物最优化模型生成:确认甘油合成酶最大累积量下的其它蛋白的上下调比例。
实施例10基于莱茵衣藻细胞异质性的甘油三酯最高产代谢模型建立方法
本实施例由三部分组成。即单细胞代谢组学数据采集,基于目标物的深度学习以及最优化代谢模型建立。单细胞代谢组学数据采集使用MALDI2-TIMSTOF。
基于目标物的深度学习包含以下步骤:
1、目标物选择:选择甘油三酯作为目标代谢物。
2、深度学习训练:如图10所示,该神经网络建立由输入Y与X通过回归学习器自动训练。通过超参数选取,获得最佳神经网络模型。回归值R代表预测输出和目标输出之间的相关性,R值越接近1表示预测和输出数据之间的关系越密切,R值越接近0表示预测和输出数据之间的关系随机性越大。均方误差MSE代表n个样本的预测值(y)与(y_)的差距。在训练神经网络时,通过不断的改变神经网络中的所有参数,使损失函数不断减小,从而训练出更高准确率的神经网络模型。训练结果显示,R:0.8664,MSE=0.00076168。
3.模型测试:通过使用预留的10%数据作为测试,验证可优化神经网络模型的最终结果准确性。
最优化代谢模型建立包括以下步骤:
1、单转录本/蛋白质/代谢物标签数据生成:在保持其他数据值不变的前提下对每一条用于预测的数据在一定区间内进行数据波动。波动可以是均匀和非均匀的。
2、多转录本/蛋白质/代谢物标签数据生成:在一定数据区间内进行用于预测数据随机生成。
3、数据可视化:使用极坐标图及任何可以反映代谢物客观分布的图片展现方式对预测数据进行可视化,提供权重/贡献率/密度较高的体现甘油三酯高丰度的代谢模式。
4、目标产物最优化模型生成:根据可视化结果建立基于数据的目标产物最优化代谢模型。得到甘油三酯最大累积量下的其它代谢物上下调比例。
实施例11基于实测数据的甘油三酯代谢水平预测
本实施例由三部分组成。即单细胞代谢组学数据采集,基于目标物的深度学习以及基于实测数据的甘油三酯代谢水平预测。单细胞代谢组学数据采集使用MALDI2-TIMSTOF。
基于目标物的深度学习包含以下步骤:
1、目标物选择:选择甘油三酯作为目标代谢物。
2、深度学习训练:该神经网络建立由输入Y与X通过回归学习器自动训练。通 过超参数选取,获得最佳神经网络模型。回归值R代表预测输出和目标输出之间的相关性,R值越接近1表示预测和输出数据之间的关系越密切,R值越接近0表示预测和输出数据之间的关系随机性越大。均方误差MSE代表n个样本的预测值(y)与(y_)的差距。在训练神经网络时,通过不断的改变神经网络中的所有参数,使损失函数不断减小,从而训练出更高准确率的神经网络模型。训练结果显示,R:0.8664,MSE=0.00076168。
3.模型测试:通过使用预留的10%数据作为测试,验证可优化神经网络模型的最终结果准确性。
基于实测数据的甘油三酯代谢水平预测包括两部分:
1、采集除甘油三酯外的其它单细胞代谢组学数据。
2、使用采集的数据按照Matlab回归学习器要求输入,得到甘油三酯代谢水平的结果。
本发明克服了目前基于设计-建造-测试-学习循环(DBTL)的合成生物学技术路线中测试数据量少,无法对复杂代谢网络内部关联进行有效学习,进而设计部分缺乏理性基础的缺点。本发明提供的方法,具有数据量大、采集通量高的特点,能直接对复杂代谢网络进行基于数据和AI的解析与学习,从而建立可计算、可预测的细胞模型,进而能够对细胞进行生理响应预测和理性设计。
以上详细描述了本发明的较佳具体实施例。应当理解,本领域的普通技术无需创造性劳动就可以根据本发明的构思作出诸多修改和变化。因此,凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案,皆应在由权利要求书所确定的保护范围内。

Claims (20)

  1. 一种基于异质性的细胞代谢网络建模方法,其特征在于,包括以下步骤:
    基于异质性采集单细胞数据;
    处理所述单细胞数据;
    基于人工智能构建细胞代谢模型。
  2. 如权利要求1所述的细胞代谢网络建模方法,其特征在于,在采集所述单细胞数据时,基于所述异质性的不同维度采集所述单细胞数据。
  3. 如权利要求2所述的细胞代谢网络建模方法,其特征在于,选取单细胞转录组、单细胞蛋白质组、单细胞代谢组中的一个或多个采集所述单细胞数据。
  4. 如权利要求3所述的细胞代谢网络建模方法,其特征在于,利用单细胞数据采集平台采集所述单细胞转录组的数据。
  5. 如权利要求4所述的细胞代谢网络建模方法,其特征在于,使用细胞壁消化酶对单细胞进行细胞裂解,然后使用所述单细胞数据采集平台对逆转录后的细胞内容物进行数据采集。
  6. 如权利要求3所述的细胞代谢网络建模方法,其特征在于,利用单细胞数据采集平台或质谱成像设备采集所述单细胞蛋白组的数据。
  7. 如权利要求6所述的细胞代谢网络建模方法,其特征在于,利用所述质谱成像设备采集所述单细胞蛋白组的数据包括:将单细胞稀释后,置于导电玻片上,进行显微拍照,使用所述质谱成像设备采集数据。
  8. 如权利要求3所述的细胞代谢网络建模方法,其特征在于,利用单细胞数据采集平台或质谱成像设备采集所述单细胞代谢组的数据。
  9. 如权利要求8所述的细胞代谢网络建模方法,其特征在于,利用所述质谱成像设备采集所述单细胞蛋白组的数据的步骤包括:将单细胞稀释后,置于导电玻片上,进行显微拍照,使用所述质谱成像设备采集数据。
  10. 如权利要求1所述的细胞代谢网络建模方法,其特征在于,处理所述单细胞数据的步骤包括:
    生成所述单细胞数据的密集/稀疏数据矩阵;
    对所述密集/稀疏数据矩阵进行批次校正;
    对经过校正后的所述密集/稀疏数据矩阵进行细胞群体分析和兴趣指标筛选。
  11. 如权利要求10所述的细胞代谢网络建模方法,其特征在于,进行所述细胞群体分析和兴趣指标筛选时,将符合预设条件的单细胞予以保留。
  12. 如权利要求11所述的细胞代谢网络建模方法,其特征在于,所述预设条件包括:转录本/蛋白/代谢物表达超过100个特征的细胞,和/或超过1/5细胞共有的转录本/蛋白/代谢物。
  13. 如权利要求1所述的细胞代谢网络建模方法,其特征在于,基于人工智能构建细胞代谢模型的步骤包括:
    选择需要预测的目标物;
    进行归一化;
    选择人工智能算法建立针对所述目标物的模型。
  14. 如权利要求13所述的细胞代谢网络建模方法,其特征在于,建立针对所述目标物的模型的步骤包括:
    利用所述人工智能算法建立第一模型,对所述第一模型进行深度学习训练;
    建立最优化代谢模型。
  15. 如权利要求14所述的细胞代谢网络建模方法,其特征在于,在所述深度学习训练之后,使用预留的所述单细胞数据对所述第一模型进行测试。
  16. 如权利要求14所述的细胞代谢网络建模方法,其特征在于,建立最优化代谢模型的步骤包括:
    生成单一预测数据;
    生成随机预测数据;
    对所述单一预测数据和所述随机预测数据进行可视化处理;
    根据所述可视化处理的结果建立所述目标物的所述最优化代谢模型。
  17. 如权利要求16所述的细胞代谢网络建模方法,其特征在于,生成所述单一预测数据时,在保持其他数据值不变的前提下对每一条用于预测的数据在预设区间内进行波动。
  18. 如权利要求16所述的细胞代谢网络建模方法,其特征在于,使用极坐标以及反映代谢物客观分布的图片对所述单一预测数据和所述随机预测数进行可视化。
  19. 一种如权利要求1-18任一项所述的细胞代谢网络建模方法在细胞生理响应预测中的应用,其特征在于,包括以下步骤:
    给定一组能够代表代谢特征数据组成的特征向量;
    将所述特征向量输入细胞代谢模型。
  20. 一种如权利要求1-18任一项所述的细胞代谢网络建模方法在细胞设计中的应用,其特征在于,包括以下步骤:
    生成数据并进行归一化;
    利用细胞代谢模型对所述生成数据进行预测;
    获取最优化代谢模型。
PCT/CN2022/112025 2022-07-21 2022-08-12 一种基于异质性的细胞代谢网络建模方法及其应用 WO2024016407A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210861942.3A CN117476092A (zh) 2022-07-21 2022-07-21 一种基于异质性的细胞代谢网络建模方法及其应用
CN202210861942.3 2022-07-21

Publications (1)

Publication Number Publication Date
WO2024016407A1 true WO2024016407A1 (zh) 2024-01-25

Family

ID=89616870

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/112025 WO2024016407A1 (zh) 2022-07-21 2022-08-12 一种基于异质性的细胞代谢网络建模方法及其应用

Country Status (2)

Country Link
CN (1) CN117476092A (zh)
WO (1) WO2024016407A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160314580A1 (en) * 2013-11-06 2016-10-27 H. Lee Moffitt Cancer Center And Research Institute, Inc. Pathology case review, analysis and prediction
CN111341382A (zh) * 2020-02-20 2020-06-26 江南大学 赖氨酸生物制造中宏观动力学与细胞代谢通量耦合建模方法
CN112466401A (zh) * 2019-09-09 2021-03-09 华为技术有限公司 利用人工智能ai模型组分析多类数据的方法及装置
CN113160986A (zh) * 2021-04-23 2021-07-23 桥恩(北京)生物科技有限公司 用于预测全身炎症反应综合征发展的模型构建方法及系统
CN113989294A (zh) * 2021-12-29 2022-01-28 北京航空航天大学 基于机器学习的细胞分割和分型方法、装置、设备及介质
CN114019010A (zh) * 2021-11-04 2022-02-08 上海交通大学 一种微生物单细胞代谢组学分析方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160314580A1 (en) * 2013-11-06 2016-10-27 H. Lee Moffitt Cancer Center And Research Institute, Inc. Pathology case review, analysis and prediction
CN112466401A (zh) * 2019-09-09 2021-03-09 华为技术有限公司 利用人工智能ai模型组分析多类数据的方法及装置
CN111341382A (zh) * 2020-02-20 2020-06-26 江南大学 赖氨酸生物制造中宏观动力学与细胞代谢通量耦合建模方法
CN113160986A (zh) * 2021-04-23 2021-07-23 桥恩(北京)生物科技有限公司 用于预测全身炎症反应综合征发展的模型构建方法及系统
CN114019010A (zh) * 2021-11-04 2022-02-08 上海交通大学 一种微生物单细胞代谢组学分析方法
CN113989294A (zh) * 2021-12-29 2022-01-28 北京航空航天大学 基于机器学习的细胞分割和分型方法、装置、设备及介质

Also Published As

Publication number Publication date
CN117476092A (zh) 2024-01-30

Similar Documents

Publication Publication Date Title
Tahir et al. iRNA-PseKNC (2methyl): Identify RNA 2'-O-methylation sites by convolution neural network and Chou's pseudo components
Charlebois et al. Modeling cell population dynamics
Helmy et al. Systems biology approaches integrated with artificial intelligence for optimized metabolic engineering
Aggarwal et al. Functional genomics and proteomics as a foundation for systems biology
US20240054365A1 (en) Method, apparatus, and computer-readable medium for efficiently optimizing a phenotype with a specialized prediction model
CN108335756B (zh) 鼻咽癌数据库及基于所述数据库的综合诊疗决策方法
CN111370073B (zh) 一种基于深度学习的药物互作规则预测方法
WO2022042506A1 (zh) 基于卷积神经网络的细胞筛选方法和装置
CN111312334A (zh) 一种影响细胞间通讯的受体-配体系统分析方法
CN115798598B (zh) 一种基于超图的miRNA-疾病关联预测模型及方法
Zhao et al. Learning cellular objectives from fluxes by inverse optimization
Hu et al. A novel network-based algorithm for predicting protein-protein interactions using gene ontology
Li et al. An improved residual network using deep fusion for identifying RNA 5-methylcytosine sites
CN108320797B (zh) 一种鼻咽癌数据库及基于所述数据库的综合诊疗决策方法
WO2024016407A1 (zh) 一种基于异质性的细胞代谢网络建模方法及其应用
CN117423391A (zh) 一种基因调控网络数据库的建立方法、系统及设备
CN116338502A (zh) 一种基于随机噪声增强和循环神经网络的燃料电池寿命预测方法
CN113921084B (zh) 疾病相关非编码rna调控轴多维靶向预测方法及系统
CN115881232A (zh) 一种基于图神经网络和特征融合的scRNA-seq细胞类型注释方法
CN115631793A (zh) 一种单细胞转录组Pseudo-Cell分析方法、模型及存储介质和设备
CN114664382A (zh) 多组学联合分析方法、装置及计算设备
US11735289B2 (en) Method and system for analyzing metabolic state of a cell by measuring concentrations of metabolites
Cai et al. Application and research progress of machine learning in Bioinformatics
Ye et al. Multi-scale methodology: a key to deciphering systems biology
CN116665764B (zh) 一种预测代谢网络中的缺失反应的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22951674

Country of ref document: EP

Kind code of ref document: A1