CN116230087B - Method and device for optimizing culture medium components - Google Patents
Method and device for optimizing culture medium components Download PDFInfo
- Publication number
- CN116230087B CN116230087B CN202211538905.5A CN202211538905A CN116230087B CN 116230087 B CN116230087 B CN 116230087B CN 202211538905 A CN202211538905 A CN 202211538905A CN 116230087 B CN116230087 B CN 116230087B
- Authority
- CN
- China
- Prior art keywords
- components
- negative
- response
- culture medium
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000000306 component Substances 0.000 title claims abstract description 128
- 239000001963 growth medium Substances 0.000 title claims abstract description 67
- 238000000034 method Methods 0.000 title claims abstract description 63
- 230000004044 response Effects 0.000 claims abstract description 122
- 238000010801 machine learning Methods 0.000 claims abstract description 39
- 238000012216 screening Methods 0.000 claims abstract description 18
- 238000005457 optimization Methods 0.000 claims abstract description 9
- 239000002609 medium Substances 0.000 claims description 21
- 238000002474 experimental method Methods 0.000 claims description 20
- 239000011159 matrix material Substances 0.000 claims description 19
- 239000013028 medium composition Substances 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 12
- 230000003247 decreasing effect Effects 0.000 claims description 10
- 239000000203 mixture Substances 0.000 claims description 10
- 238000003860 storage Methods 0.000 claims description 10
- 239000012533 medium component Substances 0.000 claims description 8
- 230000007423 decrease Effects 0.000 claims description 7
- 238000012795 verification Methods 0.000 claims description 7
- 238000000611 regression analysis Methods 0.000 claims description 6
- 230000003833 cell viability Effects 0.000 claims description 4
- 239000012092 media component Substances 0.000 claims description 3
- 238000010200 validation analysis Methods 0.000 claims description 2
- 238000012360 testing method Methods 0.000 abstract description 21
- 230000003993 interaction Effects 0.000 abstract description 8
- 238000010219 correlation analysis Methods 0.000 abstract description 6
- 238000012827 research and development Methods 0.000 abstract description 5
- 230000000694 effects Effects 0.000 description 13
- 238000012217 deletion Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 238000009472 formulation Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- CTKXFMQHOOWWEB-UHFFFAOYSA-N Ethylene oxide/propylene oxide copolymer Chemical compound CCCOC(C)COCCO CTKXFMQHOOWWEB-UHFFFAOYSA-N 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 239000000470 constituent Substances 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 229920001993 poloxamer 188 Polymers 0.000 description 3
- 229940044519 poloxamer 188 Drugs 0.000 description 3
- HZAXFHJVJLSVMW-UHFFFAOYSA-N 2-Aminoethan-1-ol Chemical compound NCCO HZAXFHJVJLSVMW-UHFFFAOYSA-N 0.000 description 2
- 238000004166 bioassay Methods 0.000 description 2
- 230000010261 cell growth Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- PWKSKIMOESPYIA-UHFFFAOYSA-N 2-acetamido-3-sulfanylpropanoic acid Chemical compound CC(=O)NC(CS)C(O)=O PWKSKIMOESPYIA-UHFFFAOYSA-N 0.000 description 1
- QNAYBMKLOCPYGJ-UHFFFAOYSA-N D-alpha-Ala Natural products CC([NH3+])C([O-])=O QNAYBMKLOCPYGJ-UHFFFAOYSA-N 0.000 description 1
- DHMQDGOQFOQNFH-UHFFFAOYSA-N Glycine Chemical compound NCC(O)=O DHMQDGOQFOQNFH-UHFFFAOYSA-N 0.000 description 1
- QNAYBMKLOCPYGJ-UWTATZPHSA-N L-Alanine Natural products C[C@@H](N)C(O)=O QNAYBMKLOCPYGJ-UWTATZPHSA-N 0.000 description 1
- QNAYBMKLOCPYGJ-REOHCLBHSA-N L-alanine Chemical compound C[C@H](N)C(O)=O QNAYBMKLOCPYGJ-REOHCLBHSA-N 0.000 description 1
- UIIMBOGNXHQVGW-DEQYMQKBSA-M Sodium bicarbonate-14C Chemical compound [Na+].O[14C]([O-])=O UIIMBOGNXHQVGW-DEQYMQKBSA-M 0.000 description 1
- 229960003767 alanine Drugs 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003110 anti-inflammatory effect Effects 0.000 description 1
- 230000002785 anti-thrombosis Effects 0.000 description 1
- 239000003146 anticoagulant agent Substances 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- GFHNAMRJFCEERV-UHFFFAOYSA-L cobalt chloride hexahydrate Chemical compound O.O.O.O.O.O.[Cl-].[Cl-].[Co+2] GFHNAMRJFCEERV-UHFFFAOYSA-L 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001120 cytoprotective effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000002900 effect on cell Effects 0.000 description 1
- 235000020776 essential amino acid Nutrition 0.000 description 1
- 239000003797 essential amino acid Substances 0.000 description 1
- 238000013401 experimental design Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 229920005684 linear copolymer Polymers 0.000 description 1
- ISPYRSDWRDQNSW-UHFFFAOYSA-L manganese(II) sulfate monohydrate Chemical compound O.[Mn+2].[O-]S([O-])(=O)=O ISPYRSDWRDQNSW-UHFFFAOYSA-L 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 244000005700 microbiome Species 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- FCHXJFJNDJXENQ-UHFFFAOYSA-N pyridoxal hydrochloride Chemical compound Cl.CC1=NC=C(CO)C(C=O)=C1O FCHXJFJNDJXENQ-UHFFFAOYSA-N 0.000 description 1
- RADKZDMFGJYCBB-UHFFFAOYSA-N pyridoxal hydrochloride Natural products CC1=NC=C(CO)C(C=O)=C1O RADKZDMFGJYCBB-UHFFFAOYSA-N 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- IFGCUJZIWBUILZ-UHFFFAOYSA-N sodium 2-[[2-[[hydroxy-(3,4,5-trihydroxy-6-methyloxan-2-yl)oxyphosphoryl]amino]-4-methylpentanoyl]amino]-3-(1H-indol-3-yl)propanoic acid Chemical compound [Na+].C=1NC2=CC=CC=C2C=1CC(C(O)=O)NC(=O)C(CC(C)C)NP(O)(=O)OC1OC(C)C(O)C(O)C1O IFGCUJZIWBUILZ-UHFFFAOYSA-N 0.000 description 1
- 239000004094 surface-active agent Substances 0.000 description 1
- 208000037816 tissue injury Diseases 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Epidemiology (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a method and a device for optimizing culture medium components. The method comprises the steps of taking each component of a culture medium as an input characteristic, taking daily response as a target value, establishing a machine learning model and calculating a correlation coefficient; calculating the feature importance of each input feature, and picking out the first k items of the feature importance scores to be classified as a first set; all components with negative correlation coefficients for the daily response are marked as negative factors and all the negative factors are classified into a second set; taking the intersection of the first set and the second set to obtain a negative factor set; and removing one or more negative factor sets from the culture medium component set to obtain an optimized culture medium component. According to the invention, the importance of the components of each culture medium is calculated through machine learning, the interaction among different components in the culture medium is considered, the components with negative correlation are removed by combining the correlation analysis correlation method, the accuracy of component screening is improved, the optimization of small sample size is realized, the research and development period is shortened, and the repeated test is reduced.
Description
Technical Field
The invention relates to the field of screening of effective components of a culture medium, in particular to a method and a device for optimizing the components of the culture medium.
Background
Orthogonal test design refers to a test design method for researching multiple factors and multiple levels. And selecting partial representative points from the comprehensive test according to the orthogonality to test, wherein the representative points have the characteristics of uniform dispersion and alignment. The main tool of the orthogonal test design is an orthogonal table, a tester can search a corresponding orthogonal table according to the requirements of the factor number, the level number of factors, interaction and the like of the test, and select partial representative points from the comprehensive test to test according to the orthogonality of the orthogonal table, so that the equivalent result of a large number of comprehensive tests can be achieved with the minimum test times.
The existing culture medium component screening method is based on the experience of biological research personnel, refers to related documents, and adds components useful for cell growth and expression into the culture medium. The final composition is then determined by orthogonal experimental design and single factor experimental analysis. However, in orthogonal test designs, when the factors involved in the test are 3 or more and there is an interaction between the factors, the test effort becomes large or even difficult to implement.
Machine learning is a branch of artificial intelligence. The machine learning theory mainly designs and analyzes some algorithms which enable a computer to automatically learn, and comprises methods such as Support Vector Regression (SVR), decision trees, gradient lifting trees (Boosting type algorithms, GBDT), random forests, multi-layer perceptrons (MLP) and the like. Machine learning considers individual components in the medium and then fits the data by building different models. The machine learning has the greatest advantage that the modeling can be performed by using the existing test results without expert experience, and accurate prediction is given. After a certain training, the matching and characteristic (component) importance of the optimal culture medium can be calculated through machine learning. Since prediction is performed by machine learning, it is necessary to collect a large amount of test data (at least 10 times the number of samples as the number of features are required, in deep learning, the number of samples is usually several tens of thousands) based on a large amount of test data, and in actual production, it takes a relatively high time cost as well as an economic cost.
Therefore, the existing screening method of the effective components of the culture medium has the problems of high cost and low efficiency caused by neglecting interaction among various factors in the culture medium or needing a large number of culture tests.
Disclosure of Invention
The invention mainly aims to provide a method and a device for optimizing culture medium components, which are used for solving the problems of long culture medium optimizing time period and high cost in the prior art.
In order to achieve the above object, according to one aspect of the present invention, there is provided a method for optimizing a medium composition, the method comprising: taking each component of the culture medium as an input characteristic, taking the daily response as a target value, establishing a machine learning model, and calculating a correlation coefficient; calculating the feature importance of each input feature, picking out the first k items of the feature importance scores, and classifying the first k items into a first set; all components with negative correlation coefficients for the daily response are marked as negative factors, and all the negative factors are classified into a second set; taking the intersection of the first set and the second set to obtain a negative factor set; and removing one or more negative factor sets from the culture medium component set to obtain an optimized culture medium component.
Further, the daily response comprises at least one of: cell expression level, cell density and cell viability; preferably, the machine learning model is a regression analysis model, and preferably, the method for calculating the correlation coefficient includes a partial least square method or a Pearson correlation coefficient.
Further, recording all components having negative correlation coefficients for the daily response as negative factors, and classifying all negative factors into a second set includes: negative factors of which the negative correlation coefficients monotonically decrease along with time are selected and classified into a second set.
Further, the sample data set is divided into a first response volume data set and a second response volume data set, wherein the daily response volume of the first response volume data set is higher than the corresponding daily response volume of the second response volume data set; calculating a correlation coefficient matrix of each input characteristic and daily response in the first response data set and the second response data set; screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing along with time according to the correlation coefficient matrix, and marking the components as a second set; preferably, the sample data with the first 20% -30% of the response is divided into a first response data set, and the rest is divided into a second response data set.
Further, after obtaining the optimized culture medium components, the method further comprises the step of experimental verification; preferably, the negative factor set is removed from the medium composition set in at least one of 1) deleting all components of the negative factor set from the medium composition set; 2) Deleting the components in the negative factor set from the culture medium components one by one; 3) And removing known essential components from the negative factor set according to the known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.
In order to achieve the above object, according to a second aspect of the present invention, there is provided an apparatus for optimizing a medium composition, comprising: the model building module is used for building a machine learning model by taking the culture medium component set as an input characteristic and the daily response as a target value and calculating a correlation coefficient; the important feature selection module is used for calculating the feature importance of each input feature, picking out the first k items of the feature importance scores and classifying the first k items into a first set; the negative factor selecting module is used for marking all components with the negative correlation coefficient of the daily response as negative factors and classifying all the negative factors into a second set; the intersection module is set to take the intersection of the first set and the second set to obtain a negative factor set; and the rejecting module is used for rejecting the components in one or more negative factor sets from the culture medium component set to obtain the optimized culture medium component.
Further, the daily response comprises at least one of: cell expression level, cell density and cell viability; preferably, the machine learning model is a regression analysis model, and preferably, the method for calculating the correlation coefficient includes a partial least square method or a Pearson correlation coefficient.
Further, the negative factor selection module includes: a data set dividing module configured to divide the sample data set into a first response volume data set and a second response volume data set, wherein a daily response volume of the first response volume data set is higher than a corresponding daily response volume of the second response volume data set; a correlation coefficient matrix calculation module configured to calculate a correlation coefficient matrix of each input feature and the daily response in the first response data set and the second response data set; the screening module is arranged for screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing along with time according to the correlation coefficient matrix, and recording the components as a second set; preferably, the sample data with the first 20% -30% of the daily response is divided into a first response data set, and the rest is divided into a second response data set.
Further, the apparatus further comprises an experiment verification module configured to perform a biological experiment on the optimized medium composition after the negative factor set is removed from the medium composition set, and to measure a daily response of the optimized medium composition, thereby determining a final negative factor.
Further, the culling module comprises at least one culling sub-module 1 arranged to delete all components of the negative factor set from the medium component set; a culling sub-module 2 arranged to delete components of the negative factor set from the medium components one by one; and the rejecting submodule 3 is used for removing known necessary components from the negative factor set according to known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.
According to a third aspect of the present invention there is provided a computer readable storage medium comprising a stored program, wherein the program when run controls a device in which the storage medium is located to perform a method of optimizing a composition of any of the above media.
According to a fourth aspect of the present invention there is provided a processor for running a program, wherein the program is run to perform a method of optimizing any one of the above media composition.
By applying the technical scheme of the application, the components with negative correlation effect on the culture effect are removed by combining the correlation method of correlation analysis on the basis of calculating the characteristic importance of each culture medium component by machine learning, and in addition, the interaction among different components in the culture medium can be considered by the characteristic importance screening in the machine learning model, so that the accuracy of component screening can be improved, and the better effect can be obtained in a small sample scene. By adopting the machine learning method to model and analyze the data, the research and development period of culture medium optimization is shortened, and repeated tests are reduced.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:
FIG. 1 shows a schematic flow diagram of a method for medium optimization in a preferred embodiment according to the invention;
FIG. 2 is a schematic view showing the construction of a culture medium optimizing apparatus according to a preferred embodiment of the present invention;
FIG. 3 is a block diagram showing the hardware structure of a method for optimizing a medium in a preferred embodiment according to the present invention.
Detailed Description
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The present application will be described in detail with reference to examples.
As mentioned in the background art, in order to solve the problems of long time period and high cost in the optimization of the culture medium components in the prior art, the inventor tries to improve the existing machine learning method, and finds that the method can shorten the research and development period of the culture medium optimization and reduce repeated experiments, so as to put forward a series of protection schemes of the application.
In a first exemplary embodiment of the present application, a method for optimizing a culture medium is provided, the method comprising the steps of:
s1, taking a culture medium component set as an input characteristic, taking a daily response as a target value, establishing a machine learning model, and calculating a correlation coefficient;
s2, calculating the feature importance of each input feature, picking out the first k items of the feature importance scores, and classifying the first k items into a first set;
S3, marking all components with negative correlation coefficients corresponding to the response quantity as negative factors, and classifying all the negative factors into a second set;
s4, taking an intersection of the first set and the second set to obtain a negative factor set;
S5, removing one or more negative factor sets from the culture medium component set to obtain the optimized culture medium component.
According to the method for optimizing the culture medium, components which have negative correlation effects on culture effects are removed by combining a correlation method of correlation analysis on the basis of calculation of the feature importance of each culture medium component by machine learning, and in addition, interaction among different components in the culture medium (which is reflected by the feature importance in a machine learning model, in the modeling process, the model calculates interaction among all features, specifically, the influence among the features is calculated by means of conditional probability, feature column sampling and the like), so that the accuracy of component screening can be improved, and a better effect can be obtained in a small sample scene. By adopting the machine learning method to model and analyze the data, the research and development period of culture medium optimization is shortened, and repeated tests are reduced.
In the above method, the daily response amount varies depending on the specific microorganism species or cell type to be cultured, and specifically, the daily response amount includes, but is not limited to, the cell expression amount and/or the cell density. The machine learning model can be reasonably selected from the existing machine learning methods, such as methods of Support Vector Regression (SVR), decision trees, gradient lifting trees (Boosting algorithm, GBDT), random forests, multi-layer perceptrons (MLP), and the like. In the present application, a regression analysis model is preferable, and the calculation method of the correlation coefficient preferably includes a partial least square method or a Pearson correlation coefficient.
In the above step S3, the negative factors, which are components having negative correlation coefficients with respect to the response, are selected so that the negative factors are removed from the medium components in order to obtain an optimized medium. In particular, there are a plurality of methods for selecting negative factors according to the negative correlation coefficient, in order to increase the stability of the method, in a preferred embodiment of the present application, all components having negative correlation coefficients for the daily response are marked as negative factors, and classifying all negative factors into the second set includes: negative factors of which the negative correlation coefficients monotonically decrease along with time are selected and classified into a second set.
Monotonically decreasing is used to describe the increasing and decreasing of a function value over a certain interval with x, and if it is now known that a function f (x) decreases monotonically over interval D, it is intuitive that the function value (increasing with x) decreases all the time over interval D, rather than having two alternating cases of increasing and decreasing. By selecting such monotonically decreasing negative factors, components having low importance to the culture effect can be selected relatively directly and rapidly from the method for subsequent biological experimental verification. That is, the medium candidate formulation can be optimized quickly.
In order to screen negative factors relatively more accurately, in a more preferred embodiment, step S3 comprises: dividing the sample data set into a first response volume data set and a second response volume data set, wherein the response volume of the first response volume data set is higher than the corresponding response volume of the second response volume data set; calculating a correlation coefficient matrix of each input characteristic and daily response in the first response data set and the second response data set; and screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing along with time as a second set according to the correlation coefficient matrix. Negative factors that can be screened from both the high and low response data sets are relatively more accurate with less predictive of the effect on daily response.
The first response volume data set is a data set with high response volume, and the second response volume data set is a data set with low response volume, wherein the threshold value of the response volume is optionally determined. Preferably, the sample data with the first 20% -30% of the response is divided into a first response data set, and the rest is divided into a second response data set.
According to the method, the machine learning method is combined with the correlation analysis, the negative factors are removed, and then the culture effect of the culture medium formula optimized after the negative factors are removed is verified through a biological experiment method, so that the experiment period can be shortened. Specific ways of specifically rejecting negative factors for biological assay validation include, but are not limited to, at least one of the following ways: 1) Deleting all components in the negative factor set from the culture medium component set; 2) Deleting the components in the negative factor set from the culture medium components one by one; 3) And removing known essential components from the negative factor set according to the known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.
The three negative factors eliminating modes can be reasonably selected according to actual needs. Biological experiment verification can also be carried out one by one in three ways.
Example 2
The embodiment provides an improved culture medium component screening method based on a machine learning model, which is shown in the attached figure 1, and comprises the following steps:
1) Establishing a sample formula database (namely a database of different specific formulas formed by combining multiple components): l-tryptophan, L-cysteine, L-glycine, L-alanine, manganese sulfate monohydrate, cobalt chloride hexahydrate, pyridoxal hydrochloride, ethanolamine, sodium bicarbonate, poloxamer188 (Poloxamer 188 is a nonionic linear copolymer with surfactant properties. Poloxamer188 exhibits antithrombotic, antiinflammatory and cytoprotective activity in various tissue injury models), and the like. The machine learning model is trained based on the existing recipe database. Where the input features are X, (X is the recipe constituent) (i.e., the collection of constituents of all recipes, e.g., the database has 200 recipes, where each recipe consists of 100 constituents X refers to a 200X 100 matrix (200 rows, 100 columns) with a target value of Y (final cell density or final cell expression).
2) The R 2 coefficients are calculated according to a machine learning model obtained by training about 1600 samples (namely 1600 specific formulas) of a culture medium formula database. Which is used in statistics to measure the proportion of variability of the dependent variable that can be accounted for by the independent variable interpretation portion, to determine the interpretation ability of the regression model. The calculation formula is as follows:
Where y i is the observed value of the response of the recipe sample, Is the model predictive value corresponding to the formula sample,/>Is the average of the response of the formulation samples. When R 2 >0.80, we consider the machine learning model to be more accurate in predicting recipe response. The machine learning model in this embodiment is GBDT, R 2 =0.81.
3) And calculating the feature importance of each feature, and picking out the top k items of the feature importance scores. Feature importance is a means of scoring input features based on how useful the input features are in predicting a target variable. The relative score may highlight which features may be relevant to the target and vice versa which features are least relevant. The calculation method of the feature importance is different according to different machine learning models. In this embodiment, GBDT models are used, and the feature importance is calculated based on the average gain of the feature segmentation.
4) The data set is divided into a high-response data set and a low-response data set. In this embodiment, the samples with the top 25% of the response values are divided into high-response data sets, and the remaining samples are divided into low-response data sets. The following steps are performed in the two data sets, respectively:
and calculating a Pearson correlation coefficient matrix of each input characteristic and the daily response. The correlation coefficient is the amount of linear correlation between the study variables. The larger the correlation coefficient is, the stronger the correlation between the two variables is. The calculation formula is as follows:
r (X, Y) represents the correlation coefficient between the variable X and the variable Y, cov (X, Y) is the covariance between the variable X and the variable Y. σ X,σY represents the standard deviation of variable X and variable Y, respectively. Taking daily cell density as a daily response as an example, the correlation coefficient between each input characteristic x 1、x2、x3、……、xm and the daily response y1, y2, y3 and … yn is calculated, m is the total number of culture medium components, and n is the number of days of cell growth. We can derive a correlation coefficient matrix Rm, n,
For row i, if the elements in the row are all less than 0, this means that the component has a negative effect on cell density.
If r (x 1, y 1) > r (x 2, y 3) > … > r (xi, yn), i.e., the correlation coefficient of component i decreases monotonically with time, and component i belongs to the k features before feature importance scoring, then the component is determined to be a negative factor. Taking the intersection of the negative factors in the low and high response data sets allows for the deletion of this component in subsequent experiments. All negative factors meeting the above conditions are formed into a negative factor set. The daily response in this example is the cell density in the medium at day 3, day 5 and day 7, respectively.
5) Designing an experimental scheme for eliminating negative factors, wherein the experimental scheme for eliminating the negative factors comprises three types:
Scheme 1: all components of the negative factor collection were deleted from the media formulation.
Scheme 2: the components in the negative factor set are deleted from the medium formulation one by one.
Scheme 3: according to known experience, if non-deletable components, such as essential amino acids, are present in the negative factors, these components are deleted from the negative factor set and the negative factor set is updated. And deleting all components in the negative factor set from the culture medium formula.
6) And (3) carrying out a biological experiment again on the culture medium from which the negative factor components are removed, measuring the response value of the culture medium, checking the removal effect, and finally determining the components which can be removed. In this example 7 negative factors were screened out, of which 4 components gave an approximately 10% increase in cell density in the medium after removal, after bioassay verification.
Table 1 shows the results of the component deletion experiments.
In Table 1, experiment 8 was a control group, i.e., a medium without any component deletion. Wherein each negative factor was deleted in experiments 1, 2, 3, and 4, respectively. It can be seen that the average culture effect of the medium after the re-deletion of components X1, X2, X3, X4 was improved by about 13%.
Experiments 5-7 in the above table also show that the method of machine learning plus correlation analysis cannot guarantee that the selected negative factor is negative by 100%. Experiments are therefore also required for the experiments. In addition, since the cell density of experiment 7 was within 10% from that of experiment 8 (control group), it was considered that the component could be deleted because the cell density did not significantly decrease after deletion.
Table 2 shows a correlation coefficient matrix
Description: the relationship of the daily response is omitted from the table, and only the correlation coefficient relationship of the response of the last day is intercepted.
As can be seen from table 2 above, each negative factor is inversely related to Y (cell density).
Further description is provided below in connection with alternative embodiments.
Example 3
This embodiment provides a device for optimizing the composition of a culture medium, as shown in fig. 2, the device comprising: a model building module 10, an important feature selection module 20, a negative factor selection module 30, an intersection module 40, and a culling module 50, wherein,
A model building module 10 configured to build a machine learning model with the set of medium components as input features and the daily response as a target value, and calculate a correlation coefficient, wherein the response includes a cell expression amount and/or a cell density;
An important feature selection module 20 configured to calculate feature importance of each input feature and pick the top k terms of the feature importance score, categorized as a first set;
A negative factor selection module 30 configured to record all components for which the response is a negative correlation coefficient as negative factors, and to classify all negative factors into a second set;
An intersection module 40 arranged to take the intersection of the first set and the second set to obtain a negative factor set;
A culling module 50 is arranged to cull components of the one or more negative factor sets from the set of media components to obtain an optimized media component.
It should be noted that the machine learning model may be reasonably selected according to various known machine learning algorithms, and is preferably a regression analysis model in the present application. Preferably, the calculation method of the correlation coefficient includes, but is not limited to, a partial least square method or a Pearson correlation coefficient.
Optionally, the negative factor selection module includes: a data set dividing module configured to divide the sample data set into a first response volume data set and a second response volume data set, wherein the response volume of the first response volume data set is higher than the corresponding response volume in the second response volume data set; a correlation coefficient matrix calculation module configured to calculate a correlation coefficient matrix of each input feature and the daily response in the first response data set and the second response data set; and the screening module is arranged for screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing along with time according to the correlation coefficient matrix, and recording the components as a second set.
Optionally, sample data with the first 20% -30% of the response is divided into a first response data set, and the rest is divided into a second response data set.
Optionally, the apparatus further comprises an experiment verification module configured to perform a biological experiment on the optimized medium composition after the negative factor set is removed from the medium composition set, and measure the response of the optimized medium composition, thereby determining the final negative factor.
Optionally, the foregoing rejecting module includes at least one rejecting sub-module 1 configured to delete all components of the negative factor set from the medium component set; a culling sub-module 2 arranged to delete components of the negative factor set from the medium components one by one; and the rejecting submodule 3 is used for removing known necessary components from the negative factor set according to known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.
Example 4
The embodiment provides a computer readable storage medium, the storage medium includes a stored program, wherein when the program runs, a device where the storage medium is controlled to execute any one of the above methods for optimizing a culture medium.
A processor is also provided for running a program, wherein the program runs on performing any of the methods of media optimization.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required for the present invention.
From the above description of the embodiments, it will be clear to those skilled in the art that the present application may be implemented by means of hardware devices such as software and detection devices. With such understanding, portions of the data processing in the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, magnetic disk, optical disk, etc., including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods of various embodiments or portions of embodiments of the application.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The method provided by the application can be executed in a terminal, a computer terminal or similar computing device. Taking the example of running on the terminal, FIG. 3 is a block diagram of the hardware structure of the terminal of a method for eliminating base sequencing errors and/or a method for identifying low frequency mutations according to an embodiment of the present application. As shown in fig. 3, the terminal may include one or more processors 102 (only one is shown in fig. 3) (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 3 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 3, or have a different configuration than shown in fig. 3.
The memory 104 may be used to store computer programs, such as software programs and modules of application software, such as computer programs corresponding to the methods of read splicing, clustering, consistency processing, etc. in the embodiments of the present invention, and the processor 102 executes the computer programs stored in the memory 104 to perform various functional applications and data processing, i.e., implement the methods described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. The specific example of the network described above may include a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
It will be apparent to those skilled in the art that some of the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by a computing device, so that they may be stored in a memory device for execution by the computing device, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
From the above description, it can be seen that the above embodiments of the present application achieve the following technical effects: based on the calculation of the importance of the machine learning to each feature, a correlation method of correlation analysis is combined. Meanwhile, interaction among different components in the culture medium can be considered, and the accuracy of component screening can be improved, so that a better effect can be obtained in a small sample scene. By adopting the machine learning method to model and analyze the data, the research and development period of culture medium optimization is shortened, and repeated tests are reduced.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (16)
1. A method of medium composition optimization, the method comprising:
Taking each component of the culture medium as an input characteristic, taking the daily response as a target value, establishing a machine learning model, and calculating a correlation coefficient;
Calculating the feature importance of each input feature, picking out the first k items of the feature importance scores, and classifying the first k items into a first set;
All components with negative correlation coefficients for the daily response are marked as negative factors, and all the negative factors are classified into a second set;
taking the intersection of the first set and the second set to obtain a negative factor set;
Removing one or more components in the negative factor set from the culture medium component set to obtain optimized culture medium components;
Recording all components of negative correlation coefficients for the daily response as negative factors and classifying all negative factors into a second set comprises: selecting the negative factors of which the negative correlation coefficients monotonically decrease along with time, and classifying the negative factors into the second set;
Wherein the selecting of the negative factors includes:
Dividing a sample data set into a first response volume data set and a second response volume data set, wherein the daily response volume of the first response volume data set is higher than the corresponding daily response volume of the second response volume data set;
Calculating a correlation coefficient matrix of each of the input features and the daily response in the first response data set and the second response data set;
and screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing with time according to the correlation coefficient matrix, and marking the components as the second set.
2. The method of claim 1, wherein the daily response comprises at least one of: cell expression level, cell density and cell viability.
3. The method of claim 1, wherein the machine learning model is a regression analysis model.
4. The method of claim 1, wherein the correlation coefficient calculation method includes a partial least squares method or a Pearson correlation coefficient.
5. The method of claim 1, wherein the first 20% -30% of sample data of the response is divided into the first response data set, and the rest is divided into the second response data set.
6. The method of claim 1, wherein after obtaining the optimized medium composition, the method further comprises the step of experimental validation.
7. The method of claim 6, wherein the negative factor set is eliminated from the collection of media components in at least one of the following,
1) Deleting all components in the negative factor set from the culture medium component set;
2) Deleting components in the negative factor set from the culture medium components one by one;
3) And removing known essential components from the negative factor set according to known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.
8. An apparatus for optimizing the composition of a culture medium, the apparatus comprising:
The model building module is used for building a machine learning model by taking the culture medium component set as an input characteristic and the daily response as a target value and calculating a correlation coefficient;
The important feature selection module is used for calculating the feature importance of each input feature, picking out the first k items of the feature importance scores and classifying the first k items into a first set;
a negative factor selection module configured to record all components having negative correlation coefficients for the daily responses as negative factors, and to classify all the negative factors into a second set;
An intersection module configured to take an intersection of the first set and the second set to obtain a negative factor set;
A rejecting module configured to reject one or more components of the negative factor set from the culture medium component set to obtain an optimized culture medium component;
wherein, the negative factor selection module includes:
A data set dividing module arranged to divide a sample data set into a first and a second response volume data set, wherein the daily response volume of the first response volume data set is higher than the corresponding daily response volume in the second response volume data set;
A correlation coefficient matrix calculation module configured to calculate a correlation coefficient matrix for each of the input features and the daily response in the first response data set and the second response data set;
and the screening module is arranged for screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing along with time according to the correlation coefficient matrix, and recording the components as the second set.
9. The apparatus of claim 8, wherein the daily response comprises at least one of: cell expression level, cell density and cell viability.
10. The apparatus of claim 8, wherein the machine learning model is a regression analysis model.
11. The apparatus of claim 8, wherein the correlation coefficient calculation method includes a partial least squares method or a Pearson correlation coefficient.
12. The apparatus of claim 8, wherein the first 20% -30% of sample data of the daily response is divided into the first response data set, and the rest is divided into the second response data set.
13. The apparatus of claim 8, further comprising an experiment verification module configured to determine a final negative factor by performing a biological experiment on the optimized medium composition after the negative factor set is removed from the medium composition set, and measuring the daily response of the optimized medium composition.
14. The apparatus of claim 8, wherein the culling module comprises at least one culling sub-module 1 configured to delete all components of the negative-going factor set from the medium component set;
A culling sub-module 2 arranged to delete components of the negative factor set from the medium components one by one;
and the rejecting submodule 3 is used for removing known essential components from the negative factor set according to known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.
15. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run controls a device in which the storage medium is located to perform the method of optimizing the composition of a medium according to any one of claims 1 to 7.
16. A processor for running a program, wherein the program runs on performing the method of optimizing the composition of a medium according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211538905.5A CN116230087B (en) | 2022-12-02 | 2022-12-02 | Method and device for optimizing culture medium components |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211538905.5A CN116230087B (en) | 2022-12-02 | 2022-12-02 | Method and device for optimizing culture medium components |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116230087A CN116230087A (en) | 2023-06-06 |
CN116230087B true CN116230087B (en) | 2024-05-14 |
Family
ID=86589924
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211538905.5A Active CN116230087B (en) | 2022-12-02 | 2022-12-02 | Method and device for optimizing culture medium components |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116230087B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108694991A (en) * | 2018-05-14 | 2018-10-23 | 武汉大学中南医院 | It is a kind of to integrate the reorientation drug discovery method with drug targets information based on multiple transcription group data sets |
CN109490393A (en) * | 2018-11-06 | 2019-03-19 | 四川理工学院 | Physical and chemical index Eigenvalue Extraction Method material Quality Analysis Methods and system in yeast |
CN111178377A (en) * | 2019-10-12 | 2020-05-19 | 未鲲(上海)科技服务有限公司 | Visual feature screening method, server and storage medium |
WO2021056116A1 (en) * | 2019-09-26 | 2021-04-01 | Terramera, Inc. | Systems and methods for synergistic pesticide screening |
CN113962819A (en) * | 2021-10-08 | 2022-01-21 | 无锡学院 | Method for predicting dissolved oxygen in industrial aquaculture based on extreme learning machine |
CN114360652A (en) * | 2022-01-28 | 2022-04-15 | 深圳太力生物技术有限责任公司 | Cell strain similarity evaluation method and similar cell strain culture medium formula recommendation method |
CN114373503A (en) * | 2022-01-10 | 2022-04-19 | 江苏护理职业学院 | Microbial culture medium optimization method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409416B (en) * | 2018-09-29 | 2021-06-18 | 上海联影智能医疗科技有限公司 | Feature vector dimension reduction method, medical image identification method, device and storage medium |
-
2022
- 2022-12-02 CN CN202211538905.5A patent/CN116230087B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108694991A (en) * | 2018-05-14 | 2018-10-23 | 武汉大学中南医院 | It is a kind of to integrate the reorientation drug discovery method with drug targets information based on multiple transcription group data sets |
CN109490393A (en) * | 2018-11-06 | 2019-03-19 | 四川理工学院 | Physical and chemical index Eigenvalue Extraction Method material Quality Analysis Methods and system in yeast |
WO2021056116A1 (en) * | 2019-09-26 | 2021-04-01 | Terramera, Inc. | Systems and methods for synergistic pesticide screening |
CN111178377A (en) * | 2019-10-12 | 2020-05-19 | 未鲲(上海)科技服务有限公司 | Visual feature screening method, server and storage medium |
CN113962819A (en) * | 2021-10-08 | 2022-01-21 | 无锡学院 | Method for predicting dissolved oxygen in industrial aquaculture based on extreme learning machine |
CN114373503A (en) * | 2022-01-10 | 2022-04-19 | 江苏护理职业学院 | Microbial culture medium optimization method and system |
CN114360652A (en) * | 2022-01-28 | 2022-04-15 | 深圳太力生物技术有限责任公司 | Cell strain similarity evaluation method and similar cell strain culture medium formula recommendation method |
Non-Patent Citations (1)
Title |
---|
γ-聚谷氨酸发酵培养基的Plackett-Burman法优化;疏秀林 等;《生物技术通报》;20070826(第04期);第173-177页 * |
Also Published As
Publication number | Publication date |
---|---|
CN116230087A (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ronellenfitsch et al. | Topological phenotypes constitute a new dimension in the phenotypic space of leaf venation networks | |
EP3963589A1 (en) | Data-driven predictive modeling for cell line selection in biopharmaceutical production | |
Babatunde et al. | Impact of climatic change on agricultural product yield using k-means and multiple linear regressions | |
US20220327398A1 (en) | Technology maturity judgment method and system based on science and technology data | |
US10748166B2 (en) | Method and system for mining churn factor causing user churn for network application | |
CN109002492A (en) | A kind of point prediction technique based on LightGBM | |
CN110310012B (en) | Data analysis method, device, equipment and computer readable storage medium | |
CN110737805B (en) | Method and device for processing graph model data and terminal equipment | |
CN108647729A (en) | A kind of user's portrait acquisition methods | |
Yang et al. | Experimental analysis and evaluation of wide residual networks based agricultural disease identification in smart agriculture system | |
CN112528022A (en) | Method for extracting characteristic words corresponding to theme categories and identifying text theme categories | |
CN117992805A (en) | Zero sample cross-modal retrieval method and system based on tensor product graph fusion diffusion | |
CN116129189A (en) | Plant disease identification method, plant disease identification equipment, storage medium and plant disease identification device | |
Adeyemo et al. | Effects of normalization techniques on logistic regression in data science | |
Vanarase et al. | Crop Prediction Using Data Mining and Machine Learning Techniques | |
CN112785156B (en) | Industrial collar and sleeve identification method based on clustering and comprehensive evaluation | |
CN116230087B (en) | Method and device for optimizing culture medium components | |
CN112668365A (en) | Material warehousing identification method, device, equipment and storage medium | |
KR101913952B1 (en) | Automatic Recognition Method of iPSC Colony through V-CNN Approach | |
Santos et al. | Activity archetypes in question-and-answer (q8a) websites—a study of 50 stack exchange instances | |
CN114781582A (en) | Method, device, equipment and storage medium for learning diagram characteristics with distribution generalization | |
CN114529399A (en) | User data processing method, device, computer equipment and storage medium | |
Dong et al. | Measurement method of plant phenotypic parameters based on image deep learning | |
Phuc et al. | Using SOM based graph clustering for extracting main ideas from documents | |
CN117351432B (en) | Training system for multi-target recognition model of scenic spot tourist |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |