Disclosure of Invention
The invention aims to provide a method for identifying the generation times of stem cell populations.
It is another object of the invention to identify marker sets for the generation of stem cell populations.
Another object of the invention is a method of screening for markers useful for identifying stem cell populations.
To solve the above technical problem, according to a first aspect of the present invention, there is provided a method for identifying a stem cell population, the method comprising the steps of:
judging the probability of each generation in the stem cell population to be identified in N known generation based on a preset generation identification model;
comparing the characteristic quantity of the stem cell group to be identified with the characteristic quantity of the stem cell group of N known generations respectively to obtain the generation range of the stem cell group to be identified;
And obtaining the generation number of the stem cell group to be identified according to the generation number range of the stem cell group to be identified and the probability that the stem cell group to be identified is in each generation number.
In some preferred embodiments, the value of N is an integer not less than 3, such as 3 or 4.
In some preferred embodiments, the predetermined generation verification model is constructed by:
constructing a training data set, wherein the characteristic quantity of the stem cell population with the N known passage times and the corresponding passage times value are obtained, and the characteristic quantity of the stem cell population comprises a plurality of characteristic peak-to-peak values and displacement values obtained by SR-FTIR analysis of the stem cells;
constructing an initial identification model, wherein the initial identification model comprises an independent variable and a dependent variable, the independent variable is a characteristic quantity of the stem cell population, and the dependent variable is the generation number of the stem cell population;
and training the initial identification model based on the training data set to obtain a final generation identification model.
In some preferred embodiments, the initial authentication model is trained using GBDT algorithm to obtain the final generation authentication model. In some preferred embodiments, the characteristic amount of the stem cells comprises νas (CH) obtained by SR-FTIR analysis of stem cells 3 ) N-H bending of Amide I, alpha-helical ofAmide I and alpha-helical of Amide II.
In some preferred embodiments, the characteristic amount of the stem cells further comprises νas (CH) obtained by SR-FTIR analysis of the stem cells 2 ),νs(CH 3 ),νs(CH 2 ),β-pleated sheet of Amide I,β-pleated sheet of Amide II,‘‘Tyrosine’’ band,νas(PO 2- ),νs(PO 2- ) C-O stretching vibratios, DNA and C-O stretching vibratios, characteristic peak to peak values and corresponding shift values of RNA.
In some preferred schemes, the feature quantity of the stem cell group to be identified and the feature quantity of the stem cell group of N known generations are subjected to dimension reduction treatment, and then the feature quantity and the feature quantity are compared to obtain the generation range of the stem cell group to be identified.
In some preferred embodiments, the dimension reduction process causes the characteristic of the stem cell population to be identified and the characteristic of the N known generations of stem cell populations to be in the same dimension.
In some preferred schemes, performing cluster analysis on the feature quantity of the stem cell group to be identified subjected to the dimension reduction treatment to obtain non-outlier points of the stem cell group to be detected; and
performing cluster analysis on the characteristic quantities of the N stem cell groups of known generations subjected to dimension reduction treatment to obtain cluster centers of the N stem cell groups of known generations;
And comparing the distances between the non-outlier points of the stem cell population to be identified and the clustering centers of the N stem cell populations with known generation numbers, and obtaining the generation number range of the stem cell population to be identified.
In some preferred embodiments, the range of generations is composed by comparing the distances between the non-outlier points of the stem cell population to be identified and the cluster centers of the N known generation stem cell populations, selecting at least two target cluster centers among the N known generation stem cell populations, according to the generation of the known generation stem cell population to which the target cluster center belongs.
In some preferred embodiments, a first target cluster center closest to the non-outlier point of the stem cell population to be identified is selected among the cluster centers of the N known generations of stem cell populations;
selecting cluster centers of two stem cell groups of known generation adjacent to the generation to which the first target cluster center belongs, comparing the distances between the cluster centers of the two adjacent stem cell groups of known generation and the non-outlier of the stem cell group to be identified, and selecting the cluster center of the stem cell group of known generation, which is closer to the non-outlier of the stem cell group to be identified, as a second target cluster center;
And utilizing the generation times of the stem cell group of the known generation times of the first target clustering center and the second target clustering center to form the generation time range.
In some preferred embodiments, when the distances between the cluster centers of the two known generation stem cell populations adjacent to the first target cluster center and the non-outlier point of the stem cell population to be identified are equal, the generation range is formed by using the first target cluster center and the generation of the known generation stem cell population to which the cluster centers of the two known generation stem cell populations adjacent to the first target cluster center belong.
In some preferred embodiments, a first target cluster center closest to the non-outlier point of the stem cell population to be identified and a second target cluster center closest to the cluster center of the stem cell population to be identified are selected among the cluster centers of the N known generations; and utilizing the generation times of the stem cell group of the known generation times of the first target clustering center and the second target clustering center to form the generation time range.
In some preferred embodiments, the cluster analysis is either or a combination of DBCAN cluster analysis and K-means cluster analysis.
In some preferred embodiments, the feature values of the N known generations of stem cell populations subjected to dimension reduction treatment and/or the feature values of the stem cell populations to be identified are subjected to K-means and/or DBSCAN cluster analysis.
In some preferred embodiments, the dimension reduction process is PCA dimension reduction process.
In some preferred embodiments, a weighted average formula is used to calculate the range of generations to which the stem cell population to be identified belongs and the probability that the stem cell population to be identified is in each generation, so as to obtain the generation of the stem cell population to be identified.
In some preferred embodiments, the weighted average formula is as follows:
generation number of stem cell population to be identified= (ProbP X1 ×X1+ProbP X2 ×X2+……+ProbP Xn ×Xn)/ (ProbP X1 +ProbP X2 +……+ ProbP Xn );
Wherein, probP X1 For the stem cell population to be identified at generation times P X1 Is a function of the probability of (1),
Prob P X2 for the stem cell population to be identified at generation times P X2 Is a function of the probability of (1),
Prob P Xn for the stem cell population to be identified at generation times P Xn Is a probability of (2).
In some preferred embodiments, the number of generations of the stem cell population to be identified is the probability of the resulting number of generations being identified in the method
=1-ProbP Xn -ProbP Xn-1 -……-ProbP X1 ×ProbP X2 /2,
Wherein, probP X1 For the stem cell population to be identified at generation times P X1 Is a function of the probability of (1),
Prob P X2 for the stem cell population to be identified at generation times P X2 Is a function of the probability of (1),
Prob P Xn for the stem cell population to be identified at generation times P Xn Is a probability of (2).
In some preferred embodiments, the weighted average formula is as follows:
generation of stem cell population to be identified = ProbP X1 ×X1+ProbP X2 ×X2)/( ProbP X1 + ProbP X2 );
Wherein, probP X1 For the stem cell population to be identified at generation times P X1 Is a function of the probability of (1),
Prob P X2 for the stem cell population to be identified in the generation and the sub-generationIs P X2 Is a function of the probability of (1),
P X1 is the number of times of the X1 th generation,
P X2 is the number of times of the X2 th generation,
the generation range of the stem cell population to be identified is P X1 -P X2 。
In some preferred embodiments, the probability formula for the generation and sub-identification of stem cell populations is as follows:
the number of generations of stem cell population to be identified is identified by the method
=1-ProbP Xn -ProbP Xn-1 -……-ProbP X1 *×ProbP X2 /2;
Wherein, probP X1 For the stem cell population to be identified at generation times P X1 Is a function of the probability of (1),
Prob P X2 for the stem cell population to be identified at generation times P X2 Is a function of the probability of (1),
P X1 is the number of times of the X1 th generation,
P X2 is the number of times of the X2 th generation,
the generation range of the stem cell population to be identified is P X1 -P X2 。
In a second aspect of the invention, there is provided a marker panel for identifying a population of stem cells, the marker panel comprising the following characteristic peaks: nuas (CH) 3 ) N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II.
In some preferred embodiments, the marker panel further comprises any one of the characteristic peaks selected from the group consisting of: nuas (CH) 2 ),νs(CH 3 ),νs(CH 2 ),β-pleated sheet of Amide I,β-pleated sheet of Amide II,‘‘Tyrosine’’ band,νas(PO 2- ),νs(PO 2- ) C-O stretching vibratios DNA and C-O stretching vibrations RNA.
Compared with the prior art, the invention has at least the following advantages:
(1) The method for identifying the stem cell population times provided by the invention adopts an SR-FTIR technology at first, and further establishes the correlation between the stem cell infrared spectrum and the stem cell physiological state by establishing a related mathematical model, so that the method can be used for rapidly obtaining standardized information of the stem cell quality, and provides a quality evaluation means required by clinical application for stem cell treatment of related diseases;
(2) The SR-FTIR technology has high spatial resolution of diffraction limit and high sensitivity for identifying intercellular variation, and the stem cell population generation identification method of the invention overcomes the defects of the traditional biological technology and can identify the relative physiological generation of stem cells with higher accuracy;
(3) The invention also discovers the correlation between the SR-FTIR spectrum of the corresponding characteristic region of the cell protein and lipid and the physiological state of the stem cells, and provides basis and guarantee for the subsequent development of a cell characteristic spectrometer for rapidly acquiring the SR-FTIR for evaluating the physiological state of the stem cells;
(4) The invention also defines that the corresponding cell generation times identification model is required to be established for identifying stem cells from different sources.
It is understood that within the scope of the present invention, the above-described technical features of the present invention and technical features specifically described below (e.g., in the examples) may be combined with each other to constitute new or preferred technical solutions. And are limited to a space, and are not described in detail herein.
Detailed Description
The invention researches and develops a method for identifying the generation times of stem cells based on a machine learning method, and the method can identify the generation times of a given batch of stem cells and can be used for assisting in controlling the quality of the stem cells.
Model and method for identifying stem cell population generation times
Embodiments of the present invention relate to a method of identifying a population of stem cells, in particular comprising steps S1-S3.
S1, judging the probability that the stem cell population to be identified is in each generation of N known generations based on a preset generation identification model.
The method for constructing the preset generation verification model used in the step S1 includes the following sub-steps S1.1-S1.3.
S1.1, constructing a training data set, wherein the training data set is the feature quantity information of the cell population with known passage times, which is measured in advance. The characteristic amount information of these cell populations is the peak value of those characteristic peaks and the corresponding shift value obtained by subjecting the cell populations of N known passage times to SR-FTIR analysis, where the N value is not less than 2, but preferably not less than 3, and may be, for example, 3, 4, 5, 6. For example, the training data set may be composed using four preset generations of cell population characteristic amounts P5, P10, P15, and P20, or may be composed using six preset generations of cell population characteristic amounts P5, P10, P15, P20, P25, and P30. It should be appreciated that the number of cell populations and the number of cell populations selected in the training dataset are not limited thereto.
In some embodiments, the characteristic amount of the stem cell population includes at least a characteristic amount of νas (CH) obtained by SR-FTIR analysis of the stem cell population 3 ) N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II.
S1.2, constructing an initial identification model, wherein the initial identification model is constructed by using characteristic quantities of stem cell groups, such as peak values of characteristic peaks of the stem cell groups and corresponding displacement values, as independent variables and corresponding passage times of the cell groups as independent variables.
S1.3, training the initial identification model based on the training data set to obtain a final generation identification model.
In step S1.3, when the initial identification model is trained, a GBDT algorithm is used to obtain a final generation identification model. "GBDT (Gradient Boosting Decision Tree)" refers to a gradient descent tree, "GradientBoosting" is an algorithm in the integrated method boosting, iterating a new learner through gradient descent, and is an iterative decision tree algorithm consisting of multiple decision trees, the conclusions of all trees being accumulated to make the final answer.
Cell populations can also be identified by the generation-number identification model described above alone. However, the inventor finds that the generation identification model obtained by the single GBDT algorithm has redundant information interference on generation identification, so that the model accuracy is low. For example, when a cell population of unknown generation is obtained, the probability that the cell population to be identified is located in each generation is obtained by using a single GBDT algorithm, and the probabilities are weighted and averaged, so that the probabilities of a plurality of preset generations having poor correlation with the cell population to be identified are incorporated into the algorithm, and the accuracy is reduced.
For example, when the model is trained using the cell population data of four preset generations P5, P10, P15, and P20 to obtain the generation identification model A1, and the generation identification model A1 is used to perform the identification, the probabilities that the cell population to be identified is at four preset generations P5, P10, P15, and P20 are respectively marked as ProbP5, probP10, probP15, and ProbP20, and when the weighted average is directly performed using the probabilities, the identified generation of the cells can be obtained by the following formula I:
generation number of stem cell population to be identified= (probp5×5+probp10×10+probp15×15+probp20×20)/(5+10+15+20) formula I.
However, among the four preset generations P5, P10, P15 and P20, the cell population of a part of the preset generations has smaller correlation with the cell population to be identified, and if probability data of the preset generations with smaller correlation are taken into a formula to perform weighted average, the obtained result is inevitably different from the real passage times of the cell population to be identified. Thus, the inventors optimized the model, adding a generation number P for identifying the cell population of the two preset generation numbers most relevant to the true characteristics of the cell population to be identified x1 And P x2 The true passage number of the cells to be identified should be P x1 -P x2 Within the range of substitution times. At this time, only the two most relevant generations P need to be used x1 And P x2 And calculating the corresponding probability, and removing other items with lower relativity from the calculation formula, so that the improvement of the identification accuracy can be realized. The generation number corresponding to the two preset generation number of cell groups most relevant to the characteristics of the cell group to be identifiedP x1 And P x2 The range of passages described with respect to the number of true passages of the cells to be identified can be obtained by the following step S2.
S2, comparing the characteristic quantity of the stem cell group to be identified with the characteristic quantity of the stem cell group of N known generations respectively to obtain the generation range of the stem cell group to be identified.
In step S2, preferably, in order to remove the data with low influencing factors and facilitate comparison, the feature quantity of the stem cell group to be identified and the feature quantity of the stem cell group of N known generations may be subjected to dimension reduction treatment, so that the feature quantity and the feature quantity are in the same dimension, and then the feature quantity and the feature quantity are compared to obtain the generation range to which the stem cell group to be identified belongs. Preferably, the dimension reduction process is PCA dimension reduction process.
The method for comparing the characteristic quantity of the stem cell group to be identified with the characteristic quantity of the stem cell group of N known generations, respectively, is optionally the following method:
performing K-means and/or DBSCAN cluster analysis on the characteristic quantity of the stem cell population of N known generations subjected to dimension reduction treatment; and obtaining the generation sub-range P to which the stem cell population to be identified belongs by comparing the distances between the non-outlier points of the stem cell population to be identified and the cluster centers of the N stem cell populations of known generation sub-numbers x1 -P x2 。
In a preferred embodiment, the distances of the non-outlier points of the stem cell population to be identified and the cluster centers of the stem cell populations of N known generations are compared, and the cluster center of at least two stem cell populations of known generations close to the non-outlier points of the stem cell population to be identified is selected as the target cluster center among the cluster centers of the stem cell populations of N known generations. In one example, the distances between the cluster centers of 1 to N stem cell populations of known generation and the non-outlier point of the stem cell population to be identified are denoted as d1, d2 … … dn (where the d values in d1, d2 … … dn are not equal), the cluster center of the stem cell population of known generation with the smallest d value is selected as the first target cluster center, the cluster center of the stem cell population of known generation with the small d value (the second small) is selected as the second target cluster center, and the generation ranges of the stem cell population to be identified can be determined by using the generation numbers of the stem cell populations of known generation to which the first target cluster center and the second target cluster center belong. Assuming that the minimum d value is d1, the clustering center of the stem cell group of the 1 st known generation is the first target clustering center, assuming that the d value of the second known generation is d2, the clustering center of the stem cell group of the 2 nd known generation is the second target clustering center, and the generation range of the stem cell group to be identified is between the generation corresponding to the clustering center of the stem cell group of the 1 st known generation and the generation corresponding to the clustering center of the stem cell group of the 2 nd known generation. In the present invention, "secondary" and "second" refer to values next to the highest value.
In one example, the distances between the cluster centers of 1 to N known generation stem cell populations and the non-outlier points of the stem cell populations to be identified are denoted as d1, d2 … … dn, respectively (where the d values in d1, d2 … … dn may be equal), where the cluster center of the known generation stem cell population corresponding to the smallest d value may not be unique. For example, assuming that the smallest d values are d1, d2, and d3, and d1=d2=d3, or that the cluster center of a known generation of stem cell population corresponding to the smallest d value is unique, the cluster center corresponding to the next smallest d value may not be unique. For example, assuming that the smallest d value is d1, the next smallest d values are d2 and d3, and d2=d3, the generation minimum range to which the cluster center capable of including the two or three known stem cell populations belongs is determined as the range to which the generation of the stem cell population to be identified belongs.
Once the generation P corresponding to the two generation cells most relevant to the characteristics of the cell population to be identified is obtained x1 And P x2 The generation number of stem cell populations can be identified by the following step S3.
S3, obtaining the generation number of the stem cell group to be identified according to the generation number range of the stem cell group to be identified and the probability that the stem cell group to be identified is in each generation number.
Specifically, the generation number of stem cell populations to be identified can be obtained by the following formula II:
Generation number of stem cell population to be identified= (ProbP X1 ×X1+ProbP X2 ×X2+…+ProbP Xn × Xn)/( ProbP X1 + ProbP X2 +……+ ProbP Xn ) The method comprises the steps of carrying out a first treatment on the surface of the Formula II.
Wherein, probP X1 For the stem cell population to be identified at generation times P X1 Is a function of the probability of (1),
ProbP X2 for the stem cell population to be identified at generation times P X2 Is a function of the probability of (1),
ProbP Xn for the stem cell population to be identified at generation times P Xn Is a probability of (2).
And (3) substituting the generation number value of the clustering center of the stem cell group of at least two known generation numbers, which is close to the non-outlier point of the stem cell group to be identified, obtained in the step (S2) into the formula (II) to obtain the generation number of the stem cell group to be identified.
In general, the cluster centers of the known-generation stem cell population that are next to the non-outlier point of the stem cell population to be identified appear in the cluster centers of two generation stem cell populations that are next to the generation to which the cluster center of the known-generation stem cell population that is closest to the non-outlier point of the stem cell population to be identified belongs. For example, when the model is trained using the cell data of four preset generations P5, P10, P15, and P20, when the cluster center to which the stem cell population to be identified belongs is closest to the cluster center of the P15 generation stem cell population, the number of generations to which the cluster center next closest to the non-outlier of the stem cell population to be identified belongs is at least one of the P10 and P20 generations, or may be 2. Then the generation sub-range to which the stem cell population to be identified belongs can be directly determined by comparing which of the distances between the cluster centers of the P10 and P20 stem cell populations and the non-outlier points of the stem cell population to be identified are relatively closer, respectively. Namely, the generation sub-range of the stem cell group to be identified is determined by comparing the distances between the two generation sub-stem cell group centers adjacent to the generation sub-to-generation of the cluster center of the stem cell group of the known generation sub-to-generation closest to the non-outlier point of the stem cell group to be identified and the non-outlier point of the stem cell group to be identified.
In some examples, when the number of cluster centers of the stem cell population of the known generation closest to the non-outlier point of the stem cell population to be identified is greater than 1, for example 2 or 3, substituting two generation sub-values P including the generation sub-minimum range of the stem cell population of the 2 or 3 known generation times and the probability ProbP that the stem cell population to be identified is at the two generation times into formula II yields the generation times and the identification probability size of the stem cell population to be identified.
In some examples, when the number of the cluster centers of the known-generation stem cell group closest to the non-outlier point of the stem cell group to be identified is one, but the number of the cluster centers of the known-generation stem cell group closest to the non-outlier point of the stem cell group to be identified is greater than 1, for example, 2 or 3, the generation value P corresponding to the cluster center of the known-generation stem cell group closest to the non-outlier point of the stem cell group to be identified and the probability Prob that the stem cell group to be identified is at that generation are substituted into formula II, and the generation value P corresponding to the cluster centers of the known-generation stem cell group closest to the non-outlier point of the stem cell group to be identified and the probability Prob that the stem cell group to be identified is at that generation are substituted into formula II, the generation of the stem cell group to be identified is obtained.
In some examples, the number of the cluster centers of the stem cell population of the known generation closest to the cluster center of the stem cell population to be identified is one, and the number of the cluster centers of the stem cell population of the known generation closest to the cluster center of the stem cell population to be identified is 1, and at this time, the generation of the stem cell population to be identified can be calculated by the following formula III.
Generation of stem cell population to be identified = Prob P X1 × P X1 +Prob P X2 × P X2 )/( Prob P X1 + Prob P X2 ) Equation III.
Prob P in formula III X1 For the probability that the stem cell population to be identified is at passage X1,
Prob P X2 for the probability that the stem cell population to be identified is at passage X2,
P X1 is the number of times of the X1 th generation,
P X2 is the number of times of the X2 th generation,
X1≠X2,
the generation range of the stem cell population to be identified is P X1 To P X2 。
On the other hand, the probability that the generation number of the stem cell population to be identified is calculated as the generation number of formula II can be obtained by the following formula III:
probability = 1-ProbP Xn -ProbP Xn-1 -……-ProbP X1 ×ProbP X2 2; equation III.
Marker set
Embodiments of the invention also relate to identifying a marker panel of stem cell populations that includes the following characteristic peaks: nuas (CH) 3 ) N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II.
In the invention, the identification of the stem cell population generation times can be performed by comparing the peak value and the displacement of the marker characteristic peaks of the control stem cell population and the stem cell population to be identified generation times by collecting the peak value of the marker characteristic peaks of the control stem cell population and the stem cell population to be identified generation times by using SR-FTIR spectrum analysis; wherein the characteristic peak of the marker at least comprises νas (CH 3 ) N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II. Preferably, the marker panel further comprises at least one characteristic peak selected from the group consisting of: nuas (CH) 2 ),νs(CH 3 ),νs(CH 2 ),β-pleated sheet of Amide I,β-pleated sheet of Amide II,‘‘Tyrosine’’ band,νas(PO 2- ),νs(PO 2- ) C-O stretching vibratios DNA and C-O stretching vibrations RNA.
Self model building group
The embodiment of the invention also relates to mutual verification of the cell generation identification model established for stem cells from different sources. The inventor finds that for the identification of the generation times of stem cells from any sources, the corresponding generation times SR-FTIR spectrum of each stem cell group needs to be acquired, and the self-specific generation times identification model of the stem cell group is established by applying the generation times identification model construction method of the stem cell group.
The present invention will be further described with reference to specific embodiments in order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. The experimental methods, in which specific conditions are not noted in the following examples, are generally conducted under conventional conditions or under conditions recommended by the manufacturer. Percentages and parts are weight percentages and parts unless otherwise indicated. The experimental materials and reagents used in the following examples were obtained from commercial sources unless otherwise specified.
Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, it is to be noted that the terms used herein are used merely to describe specific embodiments and are not intended to limit exemplary embodiments of the application.
Example 1
In this example, a set of reliable SR-FTIR spectral data was collected for different cell generations of stem cells, demonstrating that SR-FTIR is able to identify differences in biochemical composition or structure of stem cells of different generations of the same source. The specific implementation steps are as follows:
two stem cell lines (SIAISI 011-A, TUSMi 002) derived from healthy people are selected, different cell generations (P5 + -2, P10 + -2, P15 + -2, P20 + -2, P25 + -2, P30 + -2, P35 + -2) of the stem cells of continuous passage are selected respectively, hundreds of stem cells of each generation are inoculated on a calcium fluoride slide paved with matrigel, fixed after being cultured for 24 hours, and are completely dried at room temperature, and SR-FTIR spectrum acquisition is carried out through BL01B beam lines at the center of a light source of the Shanghai. For each group of samples, single-cell spectra of at least 60 cells are randomly collected, the aperture setting of the diaphragm is equivalent to the diameter of the single cell, and the spectrum range is 4000-900cm -1 Number of scans 64, spectral resolution 4cm -1 . After all SR-FTIR data are subjected to automatic baseline and SG-15 point smoothing, rnies-EMSC correction is used for Mie scattering correction, and finally, a second derivative spectrum is obtained through a Savitzky-Golay algorithm, and a data processing flow chart is shown in figure 1.
The different wavenumber ranges of the second derivative spectrum characterize different types of substance chemical bonds: lipid region (3000-2830 cm) −1 ) Protein region (1770-1475 cm) −1 ) And nucleic acid region (1300-1000 cm) −1 ). Wherein there is a known series of characteristic peaks of specific chemical bonds, each representative characteristic peak range is shown in table 1.
In table 1 above, vs represents asymmetric stretching vibration, vs represents symmetric stretching vibration;
νas(CH 3 ) Represents the asymmetric stretching vibration of methyl;
νas(CH 2 ) Represents the asymmetric stretching vibration of methylene;
νs(CH 3 ) Represents the asymmetric stretching vibration of methyl;
νs(CH 2 ) Represents the asymmetric stretching vibration of methylene;
N-H bending of Amide I represents the N-H bending vibration of the amide I band;
alpha-helical of Amide I represents the alpha-helix of the amide I band;
beta-pleated sheet of Amide I represents the beta sheet of the amide I band;
alpha-helical of Amide II represents the alpha-helix of the amide II band;
beta-pleated sheet of Amide II represents the beta sheet of the amide II band;
"Tyrosine" band represents a Tyrosine band;
νas(PO 2- ) Representation (PO) 2- ) Asymmetric telescopic vibration of (a);
νs(PO 2- ) Representation (PO) 2- ) Symmetrically stretching and vibrating;
C-O stretching vibratios DNA represents C-O bond stretching vibration of DNA;
C-O stretching vibratios RNA represents the C-O bond stretching vibration of RNA.
The characteristic peak value and the displacement represent the content and the intensity of the chemical bond related to each single cell, the higher the peak value is, the more the content of the chemical bond in the cell is indicated, the displacement is shifted to a high wave number, the intensity of the chemical bond intensity in the cell is indicated to be increased, and the characteristic peak νas (CH) in each single cell SR-FTIR spectrum is taken 3 ) , νas(CH 2 ), νs(CH 3 ) , νs(CH 2 ) , N-H bending of Amide I, α-helical of Amide I, β-pleated sheet of Amide I, α-helical ofAmide II, β-pleated sheet of Amide II “Tyrosine” band , C4-C5 and C=N stretching in imidazole ring of RNA, νas(PO 2- ) , νs(PO 2- ) The shift and peak data of C-O stretching vibratios: DNA, C-O stretching vibratios: RNA) and analysis of variance (tables 2-1 to 2-4) demonstrate that SR-FTIR is able to identify differences in the biochemical composition or structure of stem cells of the same source but different generations.
Example 2
In this embodiment, a data classification model is constructed, an identification system for evaluating the relative physiological state of stem cells is established, and the relative cell generation times and the identification probability of single cells are confirmed through data weighted analysis.
(1) Construction of cell generation identification model and cell generation identification method
Constructing a cell generation identification model first requires collecting several sets of cell dataset information for known generation of comparison, which are characteristic peak-to-peak and displacement data collected by the method of example 1, which constitute a training dataset. And using the GBDT classification model to make a generation and sub-identification model, and training the initial generation and sub-identification model by using a training data set to obtain a final cell generation and sub-identification model.
To improve the accuracy of the model, the inventors have established additional methods to provide additional information to the classification, such methods being abbreviated as "feature extraction alignment".
The specific steps of the feature extraction comparison method are as follows: removing redundant information in cells to be identified by PCA dimension reduction, clustering by using DBSCAN, and removing outlier cell data; and reducing the dimension of stem cell information of known generations in the training data set by using PCA dimension reduction, clustering the stem cell information of the known generations in the same dimension of the cells to be identified by using kmeans, finding out core characteristics (clustering centers) of the cell data, and comparing the characteristics of the cell data to be identified with the known generation cell data to infer the generation of the current cell data in a fuzzy manner. Combining the fuzzy inference results with the GBDT determinations, a weighted average is applied to infer the generation data of the batch of cells and their probabilities relatively accurately (see fig. 2 for a specific process).
The method comprises the steps of establishing a model by SIAISI011-A data, selecting data results of P5, P15, P25 and P35 as cell data sets of known generation times for comparison, which are input in advance, and obtaining two models according to the number of the input data sets: four-point data sets (pre-input P5, P15, P25, P35, simply Model a) and three-point data sets (pre-input P5, P15, P25, simply Model B and pre-input P15, P25, P35, simply Model C). When new P15 cell data (not used in modeling) is input again, the GBDT classification model can calculate and obtain the similarity between the input P15 cell data result and each group of data in the known modeling data set, and the generation of new P15 cells is identified with high accuracy by combining the generation range of the cell data given by the characteristic extraction comparison method, and other cell data (such as P5, P25 and P35) are input similarly, so that accurate generation information is obtained. The results are shown in FIG. 3, table 3.
(2) Accuracy verification
To verify the accuracy of the Model A, model B and Model C models described above, the models were validated using other SIAISI011-A generations, e.g., as re-entered cell datasets, the results are shown in FIG. 4, table 4.
From the above results, it can be seen that when other generations are used as verification sets to verify the Model, the four-point dataset Model A can accurately determine the relative cell generations of SIAISI011-A P and P20, i.e. the Model determination cell generation result is similar to the actual cell generation. In addition, the three-point dataset Model B and Model C verification sets can accurately judge the cell generation times in the modeling range, namely, the Model B can accurately judge the SIAISI011-A P generation times, and the Model C can accurately judge the SIAISI011-A P generation times, so that the Model can accurately judge the cell generation times which are input again as long as the cell generation times which are input again are within the cell generation time range which is input in advance, in other words, the cell generation time identification method can accurately judge any cell generation times of the same source as long as the cell generation time range which is input in advance is wide enough.
In addition, the accuracy of the Model A Model judgment result of the four-point data set is analyzed, the cell generation number identified by the Model is compared with the actual cell generation number, the Model judgment cell generation number is calculated to be correct within +/-2 generations of the actual cell generation number, otherwise, the Model judgment cell generation number is calculated to be incorrect, and the accuracy of the Model obtained by comparing the Model judgment cell generation number with the actual cell generation number is as follows: accuracy = correct number of cells/total number of cells x 100%, the model models the modeling set with a total accuracy of 97.03%, and verifies that the result accuracy reaches 83.63% with other generation stem cells of the same source with a total accuracy of 91.40%.
Similarly, to verify the universality of the method, the inventors used another source cell line TUSMi002 (P5+ -2, P10+ -2, P15+ -2, P20+ -2, P25+ -2, P30+ -2, P35+ -2) for the same model creation and verification. The TUSMi 002P 5, P15, P25 and P35 data are input in advance as modeling data sets of known generation times for comparison, and two models are obtained according to the number of the input data sets in advance: four-point data sets (pre-input P5-P15-P25-P35, model A-T for short) and three-point data sets (pre-input P5-P15-P25, model B-T for short and Model C-T for short). And sequentially inputting the data of P5, P15, P25 and P35 into a cell generation sub-judgment model respectively, and combining the data similarity of each group of modeling data sets obtained by the GBDT classification model with cell generation sub-inference information given by the characteristic extraction comparison model. Finally, the conclusion is obtained: the model can accurately determine the cell number of the inputted modeling group, and can be used as a TUSMi002 cell number determination model (FIG. 5, table 5).
Other generation TUSMi 002P 10, P20, P30 of the same source were used as the cell dataset to be identified to test the accuracy of Model A-T, model B-T and Model C-T models for cell generation decisions, and the results are shown in FIG. 6 and Table 6.
From the above results, it can be seen that the Model A-T can accurately judge the cell generations of P10, P20 and P30, while the Model B-T accurately deduces the cell generations of P10, P20, and the Model C-T accurately deduces the cell generations of P20 and P30, that is, the Model B-T and the Model C-T can accurately judge the cell generations in the modeling range respectively, and the result again verifies the previous conclusion, so long as the range of the cell generations input in advance is wide enough, the "cell generation number judging Model" in the invention can accurately judge any cell generation number of the same source. The accuracy of Model A-T of a 'large-scale Model' is obtained by analysis and calculation, the total accuracy of modeling of a modeling set is 97.88%, the accuracy of the results of other stem cell verification of the same source reaches 64.60%, and the total accuracy is 83.57%.
In summary, a "cell generation identification model" composed of SR-FTIR spectral characteristic peak information is established and verified through a classification model algorithm, and when a cell line of any one source is given, the SR-FTIR spectral characteristic peak information of stem cells of different generation in a sufficient range is collected to establish a model, so that the relative physiological state of stem cells of other generation in the same source can be determined with high accuracy, and the rapid panoramic standardized quality evaluation is realized.
Example 3
In this embodiment, the model is simplified by deep analysis of the characteristic peaks involved in the model, and selecting the characteristic peak regions with high weights.
The similarity of characterization information among characteristic peaks of three areas of lipid, protein and nucleic acid is calculated through KL divergence, and the KL divergence is a method for measuring the matching degree between two distributions. Firstly, carrying out normalization processing on characteristic peak data, secondly, carrying out KL divergence calculation on each characteristic peak pair by pair, judging that the similarity of the characteristic peak data and the characteristic peak pair is higher by using a standard that KL is less than 0.1, and screening a minimum characteristic peak data set (hereinafter referred to as a minimum characteristic peak set) required by a cell substitution judgment model to the greatest extent so as to obtain a simplified cell substitution judgment model. The inventors found that the combination of characteristic peaks α -helical of Amide I with N-H bending of Amide I or α -helical of Amide I with β -pleatedsheet of Amide I can to a high extent characterize the cellular information of SIAISI011-A, while finding νas (CH 3 ) The 4 characteristic peak combinations of N-H bending of Amide I, alpha-helical ofAmide I and alpha-helical of Amide II can be used for characterizing the cell information of TUSMi002 to a high degree. Finally we get the aggregate of the characteristic peaks screened in SIAISI011-A and TUSMi002 to get νas (CH) 3 ) N-H bending of Amide I, alpha-helical ofAmide I and alpha-helical of Amide II can be used for characterizing the minimum characteristic peak set of stem cell information to the greatest extent. To confirm whether this minimum feature peak set can be modeled to obtain a simplified "cell order determination model", we applied this set of feature peaks, performed cell order determination based on the "cell order determination model", and drawn a model determination cell order distribution histogram and model determination cell order and probability scatter diagrams (fig. 7 and 8) in the same way (tables 7 to 10).
TABLE 7 SIAISI011-A simplified model construction module cell generation number determination results
|
Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
|
SIAISi011-A P5
|
5.33-5.78(5.56)/0.83-0.91(0.89)
|
SIAISi011-A P15
|
15.44-16.06(15.87)/0.67-0.71(0.69)
|
SIAISi011-A P25
|
23.71-24.51(24.33)/0.82-0.84(0.83)
|
SIAISi011-A P35
|
34.67-34.72(34.68)/0.88-0.91(0.90) |
Table 8 TUSMi002 simplified model building block cell generation number determination results
|
Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
|
TUSMi002 P5
|
5.23-5.34(5.25)/0.93-0.94(0.94)
|
TUSMi002 P15
|
15.31-15.44(15.31)/0.68-0.73(0.71)
|
TUSMi002 P25
|
24.30-24.68(24.54)/0.86-0.90(0.88)
|
TUSMi002 P35
|
34.62-34.70(34.68)/0.93-0.95(0.94) |
TABLE 9 SIAISI011-A simplified model to-be-identified group cell generation number determination results
|
Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
|
SIAISi011 P10
|
9.01-9.89(9.25)/0.57-0.70(0.61)
|
SIAISi011 P20
|
19.89-20.31(20.09)/0.31-0.41(0.34)
|
SIAISi011 P30
|
30.00-30.55(30.53)/0.37-0.42(0.41) |
Table 10 TUSMi002 simplified model to-be-identified group cell generation determination results
|
Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
|
TUSMi002 P10
|
6.03-6.82(6.36)/0.66-0.77(0.71)
|
TUSMi002 P20
|
20.09-20.52(20.35)/0.31-0.49(0.40)
|
TUSMi002 P30
|
27.34-28.16(27.77)/0.58-0.73(0.68) |
And (3) establishing a simplified 'cell generation time judging model' by using the screened minimum characteristic peak set, wherein the judging results of the simplified model pair establishing module and the group to be identified are close to the actual cell generation time, and no conclusion of complete errors is obtained, so that the simplified modeling mode is feasible. After that, according to the evaluation criteria set forth above, evaluating the determination results thereof, calculating the accuracy of the model, the inventors found that the simplified "cell generation determination model" had a higher accuracy in terms of the old, and the accuracy of the SIAISi011-a modeling group determination was: 84.76%, the judging accuracy of the group to be identified is 77.78%, and the total accuracy is 81.82%; the TUSMi002 modeling group determination accuracy is: 92.37%, the judging accuracy of the group to be identified is 56.75%, and the total accuracy is 77.05%;
this resultProved by νas (CH 3 ) The combination of the four characteristic peaks of N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II is determined to be the minimum characteristic peak set which can be used for establishing a cell generation secondary judgment model, and further supports that the minimum characteristic peak set of single-cell SR-FTIR obtained by the model can be used for a spectrometer for rapidly acquiring SR-FTIR according to the subsequent invention, and can provide basis for stem cell clinical treatment according to the judged and estimated physiological state of cells.
Example 4
In this example, the stem cell generation number identification model system was used to verify the identification ability of stem cell generation numbers from different sources by cross-validation.
The inventors have determined that when given any one source of cell line, the cell generation identification Model can accurately identify any cell generation within the modeling range, and that the generation of other generation stem cells from the source can be accurately determined by obtaining a "large-scale Model" Model A and Model A-T established by two different sources (SIAISI 011-A and TUSMi 002) of cells. However, the cell generation identification Model has not been validated for generation identification of stem cells of other sources, in other words, for generation identification of stem cells of any source, whether it is necessary to build a "extensive Model" of its own source, or whether it can be validated with stem cell models of other sources known as Model A or Model A-T.
To solve this problem, the inventors first performed mutual validation of the cell data of Model A and Model A-T modules, i.e., using Model A Model to identify TUSMi 002P 5, P15, P25, P35 relative to the cell generation times, and using Model A-T to identify SIAISI011-A P5, P15, P25, P35 relative to the cell generation times, the results are shown in FIG. 9, table 11, table 12
Table 11: model A cell generation number identification result of TUSMi002 building block
|
Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
|
TUSMi002 P5
|
29.41-30.01(29.78)/0.36-0.44(0.39)
|
TUSMi002 P15
|
23.25-23.86(23.66)/0.69-0.77(0.75)
|
TUSMi002 P25
|
22.00-24.24(23.05)/0.68-0.76(0.73)
|
TUSMi002 P35
|
34.09-34.64(34.55)/0.82-0.92(0.89) |
Table 12: model A-T results of cell generation secondary identification of SIAISI011-A modeling group
|
Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
|
SIAISi011-A P5
|
19.60-19.96(19.84)/0.38-0.51(0.48)
|
SIAISi011-A P15
|
15.34-16.57(15.66)/0.42-0.49(0.46)
|
SIAISi011-A P25
|
15.06-15.98(15.09)/0.42-0.54(0.47)
|
SIAISi011-A P35
|
31.10-32.20(31.48)/0.75-0.79(0.78) |
The large-scale Model A established by the SIAISI011-A cannot accurately identify the cell generation number of the TUSMi002 building block, and likewise, the large-scale Model A-T established by the TUSMi002 cannot accurately identify the cell generation number of the SIAISI011-A building block, so that the cell generation number identification Model cannot accurately identify the cell generation number of other sources.
To re-verify this result, the inventors used Model A and Model A-T to mutually verify the cell data outside the building block, i.e., model A was used to identify TUSMi 002P 10, P20, P30 relative to the cell generation number, model A-T was used to identify SIAII 011-A P10, P20, P30 relative to the cell generation number, and the results are shown in FIG. 10, table 13, table 14
Table 13: model A results of identifying cell generations of TUSMi002 test set
|
Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
|
TUSMi002 P10
|
31.14-31.98(31.59)/0.50-0.63(0.58)
|
TUSMi002 P20
|
6.31-6.59(6.45)/0.72-0.81(0.77)
|
TUSMi002 P30
|
34.56-34.70(34.65)/0.86-0.90(0.87) |
Table 14: model A-T results of identifying cell generations of SIAISI011-A test group
|
Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
|
SIAISi011-A P10
|
19.87-20.65(20.12)/0.20-0.32(0.25)
|
SIAISi011-A P20
|
24.79-25.02(24.92)/0.27-0.57(0.34)
|
SIAISi011-A P30
|
18.74-19.63(19.12)/0.22-0.42(0.31) |
It was also found that Model A was difficult to accurately identify the cell passages of TUSMi 002P 10, P20, P30, and that Model A-T was also unable to identify the cell passages of the SIAISI011-A test set.
In summary, it was confirmed that the cell generation identification model can accurately identify any stem cell generation within the modeling range of its own origin, but cannot identify stem cell generation of other origins. It is concluded that for the identification of stem cell generation times of any source, a large-range SR-FTIR spectrum of each cell generation time of the stem cell generation time needs to be acquired, and a cell generation time identification model is applied to establish a large-range model of the stem cell generation time through characteristic peak selection.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.