CN116312813B

CN116312813B - Method and marker for identifying stem cell population generation times

Info

Publication number: CN116312813B
Application number: CN202310572331.1A
Authority: CN
Inventors: 丁燕飞; 何成章; 赵简; 涂君武; 邓钰蕾; 王瑛; 周晓洁; 曲悦荧
Original assignee: Shanghai Advanced Research Institute of CAS; Ruinjin Hospital Affiliated to Shanghai Jiaotong University School of Medicine Co Ltd; ShanghaiTech University
Current assignee: Shanghai Advanced Research Institute of CAS; Ruinjin Hospital Affiliated to Shanghai Jiaotong University School of Medicine Co Ltd; ShanghaiTech University
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-22
Anticipated expiration: 2043-05-22
Also published as: CN116312813A

Abstract

The application discloses a method and a marker for identifying the generation times of stem cell populations in any batch. In the application, the stem cell population generation identification method adopts an SR-FTIR technology, and further establishes the correlation between the stem cell infrared spectrum and the stem cell physiological state by establishing a related mathematical model, so that the stem cell population generation identification method can be used for rapidly obtaining the standardized information of the stem cell quality, and provides a quality evaluation means required by clinical application for stem cell treatment of related diseases. The application also discovers the correlation between the SR-FTIR spectrum of the corresponding characteristic region of the cell protein and lipid and the physiological state of the stem cells, and provides basis and guarantee for the subsequent development of a cell characteristic spectrometer for rapidly acquiring the SR-FTIR for the physiological state evaluation and the generation and identification and confirmation of the stem cells.

Description

Method and marker for identifying stem cell population generation times

Technical Field

The application relates to the field of biological medicine, in particular to a method and a marker for identifying the generation times of stem cells.

Background

Stem cells are a type of pluripotent cells with self-renewal ability, which exist in embryonic and adult tissues, and are classified into three types, totipotent stem cells, pluripotent stem cells, and multipotent stem cells according to their developmental potential. Totipotent stem cells have the highest differentiation potential and can differentiate into intact individuals. Pluripotent stem cells, including embryonic stem cells and induced pluripotent stem cells, can differentiate into all germ layer cells except for extracellular structures. The pluripotent stem cells not only can be used for analyzing key regulation and control mechanisms of normal differentiation and development processes of organisms, but also are increasingly applied to mechanism researches of various complex diseases. Pluripotent stem cells can be further differentiated into multipotent stem cells that can only be committed to differentiate into a particular type of cell or types. The application of pluripotent stem cells and multipotent stem cells in the biomedical field is continuously expanding.

The application of the stem cell-based disease model in screening of effective compounds and verification of drug toxicity greatly changes the methods for developing drugs and performing safety experiments. The proliferation and differentiation potential specific to stem cells makes it possible to repair damaged cells or tissues, and cell preparations prepared from stem cells have potential for treating numerous diseases and disabilities due to cell dysfunction or tissue destruction, including parkinson's disease, alzheimer's disease, spinal cord injury, burns, heart disease, diabetes, osteoarthritis, and the like.

When the stem cell preparation is applied to clinical cell therapy, the passage times in the cell preparation process are important factors influencing the activity, clinical effectiveness and safety of stem cells, and the biological characteristics of stem cells of different passages after passage culture can be changed, such as cell morphology, cell markers, chromosome karyotype, cytokines and the like. Traditional stem cell quality control methods are directed to one-by-one analysis of these biological characteristics. For example, morphological detection of stem cells using an optical microscope; analyzing cell activity, cell markers and the like by using flow cytometry, immunofluorescence staining and the like; the information on the levels of specific genes or proteins in stem cell populations was studied by PCR, elisa, et al. The existing stem cell quality control methods not only depend on experience of an experimenter seriously, but also have strong invasiveness in part of analysis means, are easy to change normal growth environment and physiological functions of cells, even destroy cell structures, cause irreversible cell damage and necrosis, and are not beneficial to rapid detection of large-scale cell samples.

The existing high-throughput sequencing technology provides possibility for comprehensive characterization of cells, but the technology is complex, sub-classes of subdivision technology are more, corresponding analysis emphasis is different, detection time is long, data acquisition and analysis are complex, and the technology still belongs to cross section analysis and cannot prompt information of passage and cell aging trend of cell culture. Therefore, there is still a need to develop a method for objectively identifying the number of generations of stem cells in any given batch of stem cell preparation, which assists in the quality control of the stem cell preparation.

Disclosure of Invention

The invention aims to provide a method for identifying the generation times of stem cell populations.

It is another object of the invention to identify marker sets for the generation of stem cell populations.

Another object of the invention is a method of screening for markers useful for identifying stem cell populations.

To solve the above technical problem, according to a first aspect of the present invention, there is provided a method for identifying a stem cell population, the method comprising the steps of:

judging the probability of each generation in the stem cell population to be identified in N known generation based on a preset generation identification model;

comparing the characteristic quantity of the stem cell group to be identified with the characteristic quantity of the stem cell group of N known generations respectively to obtain the generation range of the stem cell group to be identified;

And obtaining the generation number of the stem cell group to be identified according to the generation number range of the stem cell group to be identified and the probability that the stem cell group to be identified is in each generation number.

In some preferred embodiments, the value of N is an integer not less than 3, such as 3 or 4.

In some preferred embodiments, the predetermined generation verification model is constructed by:

constructing a training data set, wherein the characteristic quantity of the stem cell population with the N known passage times and the corresponding passage times value are obtained, and the characteristic quantity of the stem cell population comprises a plurality of characteristic peak-to-peak values and displacement values obtained by SR-FTIR analysis of the stem cells;

constructing an initial identification model, wherein the initial identification model comprises an independent variable and a dependent variable, the independent variable is a characteristic quantity of the stem cell population, and the dependent variable is the generation number of the stem cell population;

and training the initial identification model based on the training data set to obtain a final generation identification model.

In some preferred embodiments, the initial authentication model is trained using GBDT algorithm to obtain the final generation authentication model. In some preferred embodiments, the characteristic amount of the stem cells comprises νas (CH) obtained by SR-FTIR analysis of stem cells ₃ ) N-H bending of Amide I, alpha-helical ofAmide I and alpha-helical of Amide II.

In some preferred embodiments, the characteristic amount of the stem cells further comprises νas (CH) obtained by SR-FTIR analysis of the stem cells ₂ )，νs(CH ₃ )，νs(CH ₂ )，β-pleated sheet of Amide I，β-pleated sheet of Amide II，‘‘Tyrosine’’ band，νas(PO ^2- )，νs(PO ^2- ) C-O stretching vibratios, DNA and C-O stretching vibratios, characteristic peak to peak values and corresponding shift values of RNA.

In some preferred schemes, the feature quantity of the stem cell group to be identified and the feature quantity of the stem cell group of N known generations are subjected to dimension reduction treatment, and then the feature quantity and the feature quantity are compared to obtain the generation range of the stem cell group to be identified.

In some preferred embodiments, the dimension reduction process causes the characteristic of the stem cell population to be identified and the characteristic of the N known generations of stem cell populations to be in the same dimension.

In some preferred schemes, performing cluster analysis on the feature quantity of the stem cell group to be identified subjected to the dimension reduction treatment to obtain non-outlier points of the stem cell group to be detected; and

performing cluster analysis on the characteristic quantities of the N stem cell groups of known generations subjected to dimension reduction treatment to obtain cluster centers of the N stem cell groups of known generations;

And comparing the distances between the non-outlier points of the stem cell population to be identified and the clustering centers of the N stem cell populations with known generation numbers, and obtaining the generation number range of the stem cell population to be identified.

In some preferred embodiments, the range of generations is composed by comparing the distances between the non-outlier points of the stem cell population to be identified and the cluster centers of the N known generation stem cell populations, selecting at least two target cluster centers among the N known generation stem cell populations, according to the generation of the known generation stem cell population to which the target cluster center belongs.

In some preferred embodiments, a first target cluster center closest to the non-outlier point of the stem cell population to be identified is selected among the cluster centers of the N known generations of stem cell populations;

selecting cluster centers of two stem cell groups of known generation adjacent to the generation to which the first target cluster center belongs, comparing the distances between the cluster centers of the two adjacent stem cell groups of known generation and the non-outlier of the stem cell group to be identified, and selecting the cluster center of the stem cell group of known generation, which is closer to the non-outlier of the stem cell group to be identified, as a second target cluster center;

And utilizing the generation times of the stem cell group of the known generation times of the first target clustering center and the second target clustering center to form the generation time range.

In some preferred embodiments, when the distances between the cluster centers of the two known generation stem cell populations adjacent to the first target cluster center and the non-outlier point of the stem cell population to be identified are equal, the generation range is formed by using the first target cluster center and the generation of the known generation stem cell population to which the cluster centers of the two known generation stem cell populations adjacent to the first target cluster center belong.

In some preferred embodiments, a first target cluster center closest to the non-outlier point of the stem cell population to be identified and a second target cluster center closest to the cluster center of the stem cell population to be identified are selected among the cluster centers of the N known generations; and utilizing the generation times of the stem cell group of the known generation times of the first target clustering center and the second target clustering center to form the generation time range.

In some preferred embodiments, the cluster analysis is either or a combination of DBCAN cluster analysis and K-means cluster analysis.

In some preferred embodiments, the feature values of the N known generations of stem cell populations subjected to dimension reduction treatment and/or the feature values of the stem cell populations to be identified are subjected to K-means and/or DBSCAN cluster analysis.

In some preferred embodiments, the dimension reduction process is PCA dimension reduction process.

In some preferred embodiments, a weighted average formula is used to calculate the range of generations to which the stem cell population to be identified belongs and the probability that the stem cell population to be identified is in each generation, so as to obtain the generation of the stem cell population to be identified.

In some preferred embodiments, the weighted average formula is as follows:

generation number of stem cell population to be identified= (ProbP _X1 ×X1+ProbP _X2 ×X2+……+ProbP _Xn ×Xn)/ (ProbP _X1 +ProbP _X2 +……+ ProbP _Xn )；

Wherein, probP _X1 For the stem cell population to be identified at generation times P _X1 Is a function of the probability of (1),

Prob P _X2 for the stem cell population to be identified at generation times P _X2 Is a function of the probability of (1),

Prob P _Xn for the stem cell population to be identified at generation times P _Xn Is a probability of (2).

In some preferred embodiments, the number of generations of the stem cell population to be identified is the probability of the resulting number of generations being identified in the method

=1-ProbP _Xn -ProbP _Xn-1 -……-ProbP _X1 ×ProbP _X2 /2，

In some preferred embodiments, the weighted average formula is as follows:

generation of stem cell population to be identified = ProbP _X1 ×X1+ProbP _X2 ×X2)/( ProbP _X1 + ProbP _X2 )；

Prob P _X2 for the stem cell population to be identified in the generation and the sub-generationIs P _X2 Is a function of the probability of (1),

P _X1 is the number of times of the X1 th generation,

P _X2 is the number of times of the X2 th generation,

the generation range of the stem cell population to be identified is P _X1 -P _X2 。

In some preferred embodiments, the probability formula for the generation and sub-identification of stem cell populations is as follows:

the number of generations of stem cell population to be identified is identified by the method

=1-ProbP _Xn -ProbP _Xn-1 -……-ProbP _X1 *×ProbP _X2 /2；

P _X1 is the number of times of the X1 th generation,

P _X2 is the number of times of the X2 th generation,

In a second aspect of the invention, there is provided a marker panel for identifying a population of stem cells, the marker panel comprising the following characteristic peaks: nuas (CH) ₃ ) N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II.

In some preferred embodiments, the marker panel further comprises any one of the characteristic peaks selected from the group consisting of: nuas (CH) ₂ )，νs(CH ₃ )，νs(CH ₂ )，β-pleated sheet of Amide I，β-pleated sheet of Amide II，‘‘Tyrosine’’ band，νas(PO ^2- )，νs(PO ^2- ) C-O stretching vibratios DNA and C-O stretching vibrations RNA.

Compared with the prior art, the invention has at least the following advantages:

(1) The method for identifying the stem cell population times provided by the invention adopts an SR-FTIR technology at first, and further establishes the correlation between the stem cell infrared spectrum and the stem cell physiological state by establishing a related mathematical model, so that the method can be used for rapidly obtaining standardized information of the stem cell quality, and provides a quality evaluation means required by clinical application for stem cell treatment of related diseases;

(2) The SR-FTIR technology has high spatial resolution of diffraction limit and high sensitivity for identifying intercellular variation, and the stem cell population generation identification method of the invention overcomes the defects of the traditional biological technology and can identify the relative physiological generation of stem cells with higher accuracy;

(3) The invention also discovers the correlation between the SR-FTIR spectrum of the corresponding characteristic region of the cell protein and lipid and the physiological state of the stem cells, and provides basis and guarantee for the subsequent development of a cell characteristic spectrometer for rapidly acquiring the SR-FTIR for evaluating the physiological state of the stem cells;

(4) The invention also defines that the corresponding cell generation times identification model is required to be established for identifying stem cells from different sources.

It is understood that within the scope of the present invention, the above-described technical features of the present invention and technical features specifically described below (e.g., in the examples) may be combined with each other to constitute new or preferred technical solutions. And are limited to a space, and are not described in detail herein.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings.

FIG. 1 is a schematic diagram of a pretreatment process of single cell SR-FTIR spectrum data in an embodiment according to the invention;

FIG. 2 is a schematic diagram of a cell generation secondary identification model process in accordance with an embodiment of the present invention;

FIG. 3 is a histogram and scatter plot of Model A, model B and Model C modeling results, A: model A modeling result histogram and scatter diagram, B: model b modeling result histogram and scatter plot, C: modeling a result histogram and a scatter diagram by Model C, wherein a Model judgment generation distribution histogram result adopts Gaussian fitting, and each single cell Model judges generation and accuracy and marks a median and a 95% confidence interval in the scatter diagram;

FIG. 4 is a histogram and scatter plot of Model A, model B and Model C Model verification results, A: model A verifies the result histogram and scatter diagram, is SIAISI011-AP10, P20, P30 sequentially from left to right; b: model B Model verification result histograms and scatter diagrams are SIAISI011-AP10, P20, P30 and P35 in sequence from left to right; c: model C verifies the result histogram and scatter diagram, sequentially SIAISI011-A P, P10, P20 and P30 from left to right, adopting Gaussian fitting for Model judgment generation distribution histogram results, and marking the median and 95% confidence interval in each single cell Model judgment generation and accuracy scatter diagram;

FIG. 5 is a histogram and scatter plot of Model A-T, model B-T, and Model C-T modeling results, A: model A-T modeling result histogram and scatter plot, B: model B-T modeling result histogram and scatter plot, C: model C-T modeling result histogram and scatter diagram, model judgment generation distribution histogram result adopts Gaussian fitting, each single cell Model judgment generation and accuracy scatter diagram marks the median and 95% confidence interval;

FIG. 6 is a diagram of Model A-T, model B-T and Model C-T Model verification result histograms and scatter plots, A: model A-T Model verification result histograms and scatter diagrams are TUSMi 002P 10, P20 and P30 in sequence from left to right; b: model B-T Model verification result histograms and scatter diagrams are TUSMi 002P 10, P20, P30 and P35 in sequence from left to right; c: model C-T Model verification result histograms and scatter diagrams are TUSMi 002P 5, P10, P20 and P30 sequentially from left to right, gaussian fitting is adopted for Model judgment generation distribution histogram results, and each single cell Model judgment generation and accuracy scatter diagram marks a median and a 95% confidence interval;

FIG. 7 is a simplified model modeling result, A: SIAISI011-A stem cell modeling result; b: TUSMi002 stem cell modeling results. The model judges that the generation distribution histogram results adopt Gaussian fitting; judging the generation times and marking the median and 95% confidence interval in the accuracy scatter diagram by each cell model;

Fig. 8 is a simplified model of a group determination to be authenticated according to an embodiment of the invention, a: the judgment result of the to-be-identified group of the SIAISI011-A stem cells of the simplified model is P10, P20 and P30 from left to right; b: the judgment result of the group to be identified of the simplified model TUSMi002 stem cells is P10, P20 and P30 from left to right; the model judges that the generation distribution histogram results adopt Gaussian fitting; judging the generation times and the marked median and 95% confidence interval in the accuracy scatter diagram by each single cell model;

FIG. 9 is a graph and scatter plot of Model A versus TUSMi002 stem cells P5, P15, P25, P35 in order of A-D using Model A and Model A-T modeling sets of cross-validation Model decisions in accordance with an embodiment of the present invention; E-F: the result histograms and scatter plots were determined for SIAISI011-A stem cells P5, P15, P25, and P35 in this order. The model judges that the generation distribution histogram results adopt Gaussian fitting; judging the generation times and the marked median and 95% confidence interval in the accuracy scatter diagram by each single cell model;

FIG. 10 is a graph showing the histogram and scatter plot of Model A versus TUSMi002 stem cells P5, P15, P25, P35 in order of A-D using Model A and Model A-T modeling to cross-validate the test set determination in accordance with an embodiment of the present invention; E-F: sequentially determining a result histogram and a scatter diagram of the SIAISI011-A stem cells P5, P15, P25 and P35 by the Model A-T, and performing Gaussian fitting on the result of the Model determination generation distribution histogram; each single cell model was evaluated for generation and median and 95% confidence interval in the accuracy scatter plot.

Detailed Description

The invention researches and develops a method for identifying the generation times of stem cells based on a machine learning method, and the method can identify the generation times of a given batch of stem cells and can be used for assisting in controlling the quality of the stem cells.

Model and method for identifying stem cell population generation times

Embodiments of the present invention relate to a method of identifying a population of stem cells, in particular comprising steps S1-S3.

S1, judging the probability that the stem cell population to be identified is in each generation of N known generations based on a preset generation identification model.

The method for constructing the preset generation verification model used in the step S1 includes the following sub-steps S1.1-S1.3.

S1.1, constructing a training data set, wherein the training data set is the feature quantity information of the cell population with known passage times, which is measured in advance. The characteristic amount information of these cell populations is the peak value of those characteristic peaks and the corresponding shift value obtained by subjecting the cell populations of N known passage times to SR-FTIR analysis, where the N value is not less than 2, but preferably not less than 3, and may be, for example, 3, 4, 5, 6. For example, the training data set may be composed using four preset generations of cell population characteristic amounts P5, P10, P15, and P20, or may be composed using six preset generations of cell population characteristic amounts P5, P10, P15, P20, P25, and P30. It should be appreciated that the number of cell populations and the number of cell populations selected in the training dataset are not limited thereto.

In some embodiments, the characteristic amount of the stem cell population includes at least a characteristic amount of νas (CH) obtained by SR-FTIR analysis of the stem cell population ₃ ) N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II.

S1.2, constructing an initial identification model, wherein the initial identification model is constructed by using characteristic quantities of stem cell groups, such as peak values of characteristic peaks of the stem cell groups and corresponding displacement values, as independent variables and corresponding passage times of the cell groups as independent variables.

S1.3, training the initial identification model based on the training data set to obtain a final generation identification model.

In step S1.3, when the initial identification model is trained, a GBDT algorithm is used to obtain a final generation identification model. "GBDT (Gradient Boosting Decision Tree)" refers to a gradient descent tree, "GradientBoosting" is an algorithm in the integrated method boosting, iterating a new learner through gradient descent, and is an iterative decision tree algorithm consisting of multiple decision trees, the conclusions of all trees being accumulated to make the final answer.

Cell populations can also be identified by the generation-number identification model described above alone. However, the inventor finds that the generation identification model obtained by the single GBDT algorithm has redundant information interference on generation identification, so that the model accuracy is low. For example, when a cell population of unknown generation is obtained, the probability that the cell population to be identified is located in each generation is obtained by using a single GBDT algorithm, and the probabilities are weighted and averaged, so that the probabilities of a plurality of preset generations having poor correlation with the cell population to be identified are incorporated into the algorithm, and the accuracy is reduced.

For example, when the model is trained using the cell population data of four preset generations P5, P10, P15, and P20 to obtain the generation identification model A1, and the generation identification model A1 is used to perform the identification, the probabilities that the cell population to be identified is at four preset generations P5, P10, P15, and P20 are respectively marked as ProbP5, probP10, probP15, and ProbP20, and when the weighted average is directly performed using the probabilities, the identified generation of the cells can be obtained by the following formula I:

generation number of stem cell population to be identified= (probp5×5+probp10×10+probp15×15+probp20×20)/(5+10+15+20) formula I.

However, among the four preset generations P5, P10, P15 and P20, the cell population of a part of the preset generations has smaller correlation with the cell population to be identified, and if probability data of the preset generations with smaller correlation are taken into a formula to perform weighted average, the obtained result is inevitably different from the real passage times of the cell population to be identified. Thus, the inventors optimized the model, adding a generation number P for identifying the cell population of the two preset generation numbers most relevant to the true characteristics of the cell population to be identified _x1 And P _x2 The true passage number of the cells to be identified should be P _x1 -P _x2 Within the range of substitution times. At this time, only the two most relevant generations P need to be used _x1 And P _x2 And calculating the corresponding probability, and removing other items with lower relativity from the calculation formula, so that the improvement of the identification accuracy can be realized. The generation number corresponding to the two preset generation number of cell groups most relevant to the characteristics of the cell group to be identifiedP _x1 And P _x2 The range of passages described with respect to the number of true passages of the cells to be identified can be obtained by the following step S2.

S2, comparing the characteristic quantity of the stem cell group to be identified with the characteristic quantity of the stem cell group of N known generations respectively to obtain the generation range of the stem cell group to be identified.

In step S2, preferably, in order to remove the data with low influencing factors and facilitate comparison, the feature quantity of the stem cell group to be identified and the feature quantity of the stem cell group of N known generations may be subjected to dimension reduction treatment, so that the feature quantity and the feature quantity are in the same dimension, and then the feature quantity and the feature quantity are compared to obtain the generation range to which the stem cell group to be identified belongs. Preferably, the dimension reduction process is PCA dimension reduction process.

The method for comparing the characteristic quantity of the stem cell group to be identified with the characteristic quantity of the stem cell group of N known generations, respectively, is optionally the following method:

performing K-means and/or DBSCAN cluster analysis on the characteristic quantity of the stem cell population of N known generations subjected to dimension reduction treatment; and obtaining the generation sub-range P to which the stem cell population to be identified belongs by comparing the distances between the non-outlier points of the stem cell population to be identified and the cluster centers of the N stem cell populations of known generation sub-numbers _x1 -P _x2 。

In a preferred embodiment, the distances of the non-outlier points of the stem cell population to be identified and the cluster centers of the stem cell populations of N known generations are compared, and the cluster center of at least two stem cell populations of known generations close to the non-outlier points of the stem cell population to be identified is selected as the target cluster center among the cluster centers of the stem cell populations of N known generations. In one example, the distances between the cluster centers of 1 to N stem cell populations of known generation and the non-outlier point of the stem cell population to be identified are denoted as d1, d2 … … dn (where the d values in d1, d2 … … dn are not equal), the cluster center of the stem cell population of known generation with the smallest d value is selected as the first target cluster center, the cluster center of the stem cell population of known generation with the small d value (the second small) is selected as the second target cluster center, and the generation ranges of the stem cell population to be identified can be determined by using the generation numbers of the stem cell populations of known generation to which the first target cluster center and the second target cluster center belong. Assuming that the minimum d value is d1, the clustering center of the stem cell group of the 1 st known generation is the first target clustering center, assuming that the d value of the second known generation is d2, the clustering center of the stem cell group of the 2 nd known generation is the second target clustering center, and the generation range of the stem cell group to be identified is between the generation corresponding to the clustering center of the stem cell group of the 1 st known generation and the generation corresponding to the clustering center of the stem cell group of the 2 nd known generation. In the present invention, "secondary" and "second" refer to values next to the highest value.

In one example, the distances between the cluster centers of 1 to N known generation stem cell populations and the non-outlier points of the stem cell populations to be identified are denoted as d1, d2 … … dn, respectively (where the d values in d1, d2 … … dn may be equal), where the cluster center of the known generation stem cell population corresponding to the smallest d value may not be unique. For example, assuming that the smallest d values are d1, d2, and d3, and d1=d2=d3, or that the cluster center of a known generation of stem cell population corresponding to the smallest d value is unique, the cluster center corresponding to the next smallest d value may not be unique. For example, assuming that the smallest d value is d1, the next smallest d values are d2 and d3, and d2=d3, the generation minimum range to which the cluster center capable of including the two or three known stem cell populations belongs is determined as the range to which the generation of the stem cell population to be identified belongs.

Once the generation P corresponding to the two generation cells most relevant to the characteristics of the cell population to be identified is obtained _x1 And P _x2 The generation number of stem cell populations can be identified by the following step S3.

S3, obtaining the generation number of the stem cell group to be identified according to the generation number range of the stem cell group to be identified and the probability that the stem cell group to be identified is in each generation number.

Specifically, the generation number of stem cell populations to be identified can be obtained by the following formula II:

Generation number of stem cell population to be identified= (ProbP _X1 ×X1+ProbP _X2 ×X2+…+ProbP _Xn × Xn)/( ProbP _X1 + ProbP _X2 +……+ ProbP _Xn ) The method comprises the steps of carrying out a first treatment on the surface of the Formula II.

ProbP _X2 for the stem cell population to be identified at generation times P _X2 Is a function of the probability of (1),

ProbP _Xn for the stem cell population to be identified at generation times P _Xn Is a probability of (2).

And (3) substituting the generation number value of the clustering center of the stem cell group of at least two known generation numbers, which is close to the non-outlier point of the stem cell group to be identified, obtained in the step (S2) into the formula (II) to obtain the generation number of the stem cell group to be identified.

In general, the cluster centers of the known-generation stem cell population that are next to the non-outlier point of the stem cell population to be identified appear in the cluster centers of two generation stem cell populations that are next to the generation to which the cluster center of the known-generation stem cell population that is closest to the non-outlier point of the stem cell population to be identified belongs. For example, when the model is trained using the cell data of four preset generations P5, P10, P15, and P20, when the cluster center to which the stem cell population to be identified belongs is closest to the cluster center of the P15 generation stem cell population, the number of generations to which the cluster center next closest to the non-outlier of the stem cell population to be identified belongs is at least one of the P10 and P20 generations, or may be 2. Then the generation sub-range to which the stem cell population to be identified belongs can be directly determined by comparing which of the distances between the cluster centers of the P10 and P20 stem cell populations and the non-outlier points of the stem cell population to be identified are relatively closer, respectively. Namely, the generation sub-range of the stem cell group to be identified is determined by comparing the distances between the two generation sub-stem cell group centers adjacent to the generation sub-to-generation of the cluster center of the stem cell group of the known generation sub-to-generation closest to the non-outlier point of the stem cell group to be identified and the non-outlier point of the stem cell group to be identified.

In some examples, when the number of cluster centers of the stem cell population of the known generation closest to the non-outlier point of the stem cell population to be identified is greater than 1, for example 2 or 3, substituting two generation sub-values P including the generation sub-minimum range of the stem cell population of the 2 or 3 known generation times and the probability ProbP that the stem cell population to be identified is at the two generation times into formula II yields the generation times and the identification probability size of the stem cell population to be identified.

In some examples, when the number of the cluster centers of the known-generation stem cell group closest to the non-outlier point of the stem cell group to be identified is one, but the number of the cluster centers of the known-generation stem cell group closest to the non-outlier point of the stem cell group to be identified is greater than 1, for example, 2 or 3, the generation value P corresponding to the cluster center of the known-generation stem cell group closest to the non-outlier point of the stem cell group to be identified and the probability Prob that the stem cell group to be identified is at that generation are substituted into formula II, and the generation value P corresponding to the cluster centers of the known-generation stem cell group closest to the non-outlier point of the stem cell group to be identified and the probability Prob that the stem cell group to be identified is at that generation are substituted into formula II, the generation of the stem cell group to be identified is obtained.

In some examples, the number of the cluster centers of the stem cell population of the known generation closest to the cluster center of the stem cell population to be identified is one, and the number of the cluster centers of the stem cell population of the known generation closest to the cluster center of the stem cell population to be identified is 1, and at this time, the generation of the stem cell population to be identified can be calculated by the following formula III.

Generation of stem cell population to be identified = Prob P _X1 × P _X1 +Prob P _X2 × P _X2 )/( Prob P _X1 + Prob P _X2 ) Equation III.

Prob P in formula III _X1 For the probability that the stem cell population to be identified is at passage X1,

Prob P _X2 for the probability that the stem cell population to be identified is at passage X2,

P _X1 is the number of times of the X1 th generation,

P _X2 is the number of times of the X2 th generation,

X1≠X2，

the generation range of the stem cell population to be identified is P _X1 To P _X2 。

On the other hand, the probability that the generation number of the stem cell population to be identified is calculated as the generation number of formula II can be obtained by the following formula III:

probability = 1-ProbP _Xn -ProbP _Xn-1 -……-ProbP _X1 ×ProbP _X2 2; equation III.

Marker set

Embodiments of the invention also relate to identifying a marker panel of stem cell populations that includes the following characteristic peaks: nuas (CH) ₃ ) N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II.

In the invention, the identification of the stem cell population generation times can be performed by comparing the peak value and the displacement of the marker characteristic peaks of the control stem cell population and the stem cell population to be identified generation times by collecting the peak value of the marker characteristic peaks of the control stem cell population and the stem cell population to be identified generation times by using SR-FTIR spectrum analysis; wherein the characteristic peak of the marker at least comprises νas (CH ₃ ) N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II. Preferably, the marker panel further comprises at least one characteristic peak selected from the group consisting of: nuas (CH) ₂ )，νs(CH ₃ )，νs(CH ₂ )，β-pleated sheet of Amide I，β-pleated sheet of Amide II，‘‘Tyrosine’’ band，νas(PO ^2- )，νs(PO ^2- ) C-O stretching vibratios DNA and C-O stretching vibrations RNA.

Self model building group

The embodiment of the invention also relates to mutual verification of the cell generation identification model established for stem cells from different sources. The inventor finds that for the identification of the generation times of stem cells from any sources, the corresponding generation times SR-FTIR spectrum of each stem cell group needs to be acquired, and the self-specific generation times identification model of the stem cell group is established by applying the generation times identification model construction method of the stem cell group.

The present invention will be further described with reference to specific embodiments in order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. The experimental methods, in which specific conditions are not noted in the following examples, are generally conducted under conventional conditions or under conditions recommended by the manufacturer. Percentages and parts are weight percentages and parts unless otherwise indicated. The experimental materials and reagents used in the following examples were obtained from commercial sources unless otherwise specified.

Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, it is to be noted that the terms used herein are used merely to describe specific embodiments and are not intended to limit exemplary embodiments of the application.

Example 1

In this example, a set of reliable SR-FTIR spectral data was collected for different cell generations of stem cells, demonstrating that SR-FTIR is able to identify differences in biochemical composition or structure of stem cells of different generations of the same source. The specific implementation steps are as follows:

two stem cell lines (SIAISI 011-A, TUSMi 002) derived from healthy people are selected, different cell generations (P5 + -2, P10 + -2, P15 + -2, P20 + -2, P25 + -2, P30 + -2, P35 + -2) of the stem cells of continuous passage are selected respectively, hundreds of stem cells of each generation are inoculated on a calcium fluoride slide paved with matrigel, fixed after being cultured for 24 hours, and are completely dried at room temperature, and SR-FTIR spectrum acquisition is carried out through BL01B beam lines at the center of a light source of the Shanghai. For each group of samples, single-cell spectra of at least 60 cells are randomly collected, the aperture setting of the diaphragm is equivalent to the diameter of the single cell, and the spectrum range is 4000-900cm ^-1 Number of scans 64, spectral resolution 4cm ^-1 . After all SR-FTIR data are subjected to automatic baseline and SG-15 point smoothing, rnies-EMSC correction is used for Mie scattering correction, and finally, a second derivative spectrum is obtained through a Savitzky-Golay algorithm, and a data processing flow chart is shown in figure 1.

The different wavenumber ranges of the second derivative spectrum characterize different types of substance chemical bonds: lipid region (3000-2830 cm) ⁻¹ ) Protein region (1770-1475 cm) ⁻¹ ) And nucleic acid region (1300-1000 cm) ⁻¹ ). Wherein there is a known series of characteristic peaks of specific chemical bonds, each representative characteristic peak range is shown in table 1.

In table 1 above, vs represents asymmetric stretching vibration, vs represents symmetric stretching vibration;

νas(CH ₃ ) Represents the asymmetric stretching vibration of methyl;

νas(CH ₂ ) Represents the asymmetric stretching vibration of methylene;

νs(CH ₃ ) Represents the asymmetric stretching vibration of methyl;

νs(CH ₂ ) Represents the asymmetric stretching vibration of methylene;

N-H bending of Amide I represents the N-H bending vibration of the amide I band;

alpha-helical of Amide I represents the alpha-helix of the amide I band;

beta-pleated sheet of Amide I represents the beta sheet of the amide I band;

alpha-helical of Amide II represents the alpha-helix of the amide II band;

beta-pleated sheet of Amide II represents the beta sheet of the amide II band;

"Tyrosine" band represents a Tyrosine band;

νas(PO ^2- ) Representation (PO) ^2- ) Asymmetric telescopic vibration of (a);

νs(PO ^2- ) Representation (PO) ^2- ) Symmetrically stretching and vibrating;

C-O stretching vibratios DNA represents C-O bond stretching vibration of DNA;

C-O stretching vibratios RNA represents the C-O bond stretching vibration of RNA.

The characteristic peak value and the displacement represent the content and the intensity of the chemical bond related to each single cell, the higher the peak value is, the more the content of the chemical bond in the cell is indicated, the displacement is shifted to a high wave number, the intensity of the chemical bond intensity in the cell is indicated to be increased, and the characteristic peak νas (CH) in each single cell SR-FTIR spectrum is taken ₃ ) , νas(CH ₂ ), νs(CH ₃ ) , νs(CH ₂ ) , N-H bending of Amide I, α-helical of Amide I, β-pleated sheet of Amide I, α-helical ofAmide II, β-pleated sheet of Amide II “Tyrosine” band , C4-C5 and C=N stretching in imidazole ring of RNA, νas(PO ^2- ) , νs(PO ^2- ) The shift and peak data of C-O stretching vibratios: DNA, C-O stretching vibratios: RNA) and analysis of variance (tables 2-1 to 2-4) demonstrate that SR-FTIR is able to identify differences in the biochemical composition or structure of stem cells of the same source but different generations.

Example 2

In this embodiment, a data classification model is constructed, an identification system for evaluating the relative physiological state of stem cells is established, and the relative cell generation times and the identification probability of single cells are confirmed through data weighted analysis.

(1) Construction of cell generation identification model and cell generation identification method

Constructing a cell generation identification model first requires collecting several sets of cell dataset information for known generation of comparison, which are characteristic peak-to-peak and displacement data collected by the method of example 1, which constitute a training dataset. And using the GBDT classification model to make a generation and sub-identification model, and training the initial generation and sub-identification model by using a training data set to obtain a final cell generation and sub-identification model.

To improve the accuracy of the model, the inventors have established additional methods to provide additional information to the classification, such methods being abbreviated as "feature extraction alignment".

The specific steps of the feature extraction comparison method are as follows: removing redundant information in cells to be identified by PCA dimension reduction, clustering by using DBSCAN, and removing outlier cell data; and reducing the dimension of stem cell information of known generations in the training data set by using PCA dimension reduction, clustering the stem cell information of the known generations in the same dimension of the cells to be identified by using kmeans, finding out core characteristics (clustering centers) of the cell data, and comparing the characteristics of the cell data to be identified with the known generation cell data to infer the generation of the current cell data in a fuzzy manner. Combining the fuzzy inference results with the GBDT determinations, a weighted average is applied to infer the generation data of the batch of cells and their probabilities relatively accurately (see fig. 2 for a specific process).

The method comprises the steps of establishing a model by SIAISI011-A data, selecting data results of P5, P15, P25 and P35 as cell data sets of known generation times for comparison, which are input in advance, and obtaining two models according to the number of the input data sets: four-point data sets (pre-input P5, P15, P25, P35, simply Model a) and three-point data sets (pre-input P5, P15, P25, simply Model B and pre-input P15, P25, P35, simply Model C). When new P15 cell data (not used in modeling) is input again, the GBDT classification model can calculate and obtain the similarity between the input P15 cell data result and each group of data in the known modeling data set, and the generation of new P15 cells is identified with high accuracy by combining the generation range of the cell data given by the characteristic extraction comparison method, and other cell data (such as P5, P25 and P35) are input similarly, so that accurate generation information is obtained. The results are shown in FIG. 3, table 3.

(2) Accuracy verification

To verify the accuracy of the Model A, model B and Model C models described above, the models were validated using other SIAISI011-A generations, e.g., as re-entered cell datasets, the results are shown in FIG. 4, table 4.

From the above results, it can be seen that when other generations are used as verification sets to verify the Model, the four-point dataset Model A can accurately determine the relative cell generations of SIAISI011-A P and P20, i.e. the Model determination cell generation result is similar to the actual cell generation. In addition, the three-point dataset Model B and Model C verification sets can accurately judge the cell generation times in the modeling range, namely, the Model B can accurately judge the SIAISI011-A P generation times, and the Model C can accurately judge the SIAISI011-A P generation times, so that the Model can accurately judge the cell generation times which are input again as long as the cell generation times which are input again are within the cell generation time range which is input in advance, in other words, the cell generation time identification method can accurately judge any cell generation times of the same source as long as the cell generation time range which is input in advance is wide enough.

In addition, the accuracy of the Model A Model judgment result of the four-point data set is analyzed, the cell generation number identified by the Model is compared with the actual cell generation number, the Model judgment cell generation number is calculated to be correct within +/-2 generations of the actual cell generation number, otherwise, the Model judgment cell generation number is calculated to be incorrect, and the accuracy of the Model obtained by comparing the Model judgment cell generation number with the actual cell generation number is as follows: accuracy = correct number of cells/total number of cells x 100%, the model models the modeling set with a total accuracy of 97.03%, and verifies that the result accuracy reaches 83.63% with other generation stem cells of the same source with a total accuracy of 91.40%.

Similarly, to verify the universality of the method, the inventors used another source cell line TUSMi002 (P5+ -2, P10+ -2, P15+ -2, P20+ -2, P25+ -2, P30+ -2, P35+ -2) for the same model creation and verification. The TUSMi 002P 5, P15, P25 and P35 data are input in advance as modeling data sets of known generation times for comparison, and two models are obtained according to the number of the input data sets in advance: four-point data sets (pre-input P5-P15-P25-P35, model A-T for short) and three-point data sets (pre-input P5-P15-P25, model B-T for short and Model C-T for short). And sequentially inputting the data of P5, P15, P25 and P35 into a cell generation sub-judgment model respectively, and combining the data similarity of each group of modeling data sets obtained by the GBDT classification model with cell generation sub-inference information given by the characteristic extraction comparison model. Finally, the conclusion is obtained: the model can accurately determine the cell number of the inputted modeling group, and can be used as a TUSMi002 cell number determination model (FIG. 5, table 5).

Other generation TUSMi 002P 10, P20, P30 of the same source were used as the cell dataset to be identified to test the accuracy of Model A-T, model B-T and Model C-T models for cell generation decisions, and the results are shown in FIG. 6 and Table 6.

From the above results, it can be seen that the Model A-T can accurately judge the cell generations of P10, P20 and P30, while the Model B-T accurately deduces the cell generations of P10, P20, and the Model C-T accurately deduces the cell generations of P20 and P30, that is, the Model B-T and the Model C-T can accurately judge the cell generations in the modeling range respectively, and the result again verifies the previous conclusion, so long as the range of the cell generations input in advance is wide enough, the "cell generation number judging Model" in the invention can accurately judge any cell generation number of the same source. The accuracy of Model A-T of a 'large-scale Model' is obtained by analysis and calculation, the total accuracy of modeling of a modeling set is 97.88%, the accuracy of the results of other stem cell verification of the same source reaches 64.60%, and the total accuracy is 83.57%.

In summary, a "cell generation identification model" composed of SR-FTIR spectral characteristic peak information is established and verified through a classification model algorithm, and when a cell line of any one source is given, the SR-FTIR spectral characteristic peak information of stem cells of different generation in a sufficient range is collected to establish a model, so that the relative physiological state of stem cells of other generation in the same source can be determined with high accuracy, and the rapid panoramic standardized quality evaluation is realized.

Example 3

In this embodiment, the model is simplified by deep analysis of the characteristic peaks involved in the model, and selecting the characteristic peak regions with high weights.

The similarity of characterization information among characteristic peaks of three areas of lipid, protein and nucleic acid is calculated through KL divergence, and the KL divergence is a method for measuring the matching degree between two distributions. Firstly, carrying out normalization processing on characteristic peak data, secondly, carrying out KL divergence calculation on each characteristic peak pair by pair, judging that the similarity of the characteristic peak data and the characteristic peak pair is higher by using a standard that KL is less than 0.1, and screening a minimum characteristic peak data set (hereinafter referred to as a minimum characteristic peak set) required by a cell substitution judgment model to the greatest extent so as to obtain a simplified cell substitution judgment model. The inventors found that the combination of characteristic peaks α -helical of Amide I with N-H bending of Amide I or α -helical of Amide I with β -pleatedsheet of Amide I can to a high extent characterize the cellular information of SIAISI011-A, while finding νas (CH ₃ ) The 4 characteristic peak combinations of N-H bending of Amide I, alpha-helical ofAmide I and alpha-helical of Amide II can be used for characterizing the cell information of TUSMi002 to a high degree. Finally we get the aggregate of the characteristic peaks screened in SIAISI011-A and TUSMi002 to get νas (CH) ₃ ) N-H bending of Amide I, alpha-helical ofAmide I and alpha-helical of Amide II can be used for characterizing the minimum characteristic peak set of stem cell information to the greatest extent. To confirm whether this minimum feature peak set can be modeled to obtain a simplified "cell order determination model", we applied this set of feature peaks, performed cell order determination based on the "cell order determination model", and drawn a model determination cell order distribution histogram and model determination cell order and probability scatter diagrams (fig. 7 and 8) in the same way (tables 7 to 10).

TABLE 7 SIAISI011-A simplified model construction module cell generation number determination results

	Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
		SIAISi011-A P5	5.33-5.78(5.56)/0.83-0.91(0.89)
SIAISi011-A P15	15.44-16.06(15.87)/0.67-0.71(0.69)
		SIAISi011-A P25	23.71-24.51(24.33)/0.82-0.84(0.83)
SIAISi011-A P35	34.67-34.72(34.68)/0.88-0.91(0.90)

Table 8 TUSMi002 simplified model building block cell generation number determination results

	Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
		TUSMi002 P5	5.23-5.34(5.25)/0.93-0.94(0.94)
TUSMi002 P15	15.31-15.44(15.31)/0.68-0.73(0.71)
		TUSMi002 P25	24.30-24.68(24.54)/0.86-0.90(0.88)
TUSMi002 P35	34.62-34.70(34.68)/0.93-0.95(0.94)

TABLE 9 SIAISI011-A simplified model to-be-identified group cell generation number determination results

	Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
		SIAISi011 P10	9.01-9.89(9.25)/0.57-0.70(0.61)
SIAISi011 P20	19.89-20.31(20.09)/0.31-0.41(0.34)
		SIAISi011 P30	30.00-30.55(30.53)/0.37-0.42(0.41)

Table 10 TUSMi002 simplified model to-be-identified group cell generation determination results

	Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
		TUSMi002 P10	6.03-6.82(6.36)/0.66-0.77(0.71)
TUSMi002 P20	20.09-20.52(20.35)/0.31-0.49(0.40)
		TUSMi002 P30	27.34-28.16(27.77)/0.58-0.73(0.68)

And (3) establishing a simplified 'cell generation time judging model' by using the screened minimum characteristic peak set, wherein the judging results of the simplified model pair establishing module and the group to be identified are close to the actual cell generation time, and no conclusion of complete errors is obtained, so that the simplified modeling mode is feasible. After that, according to the evaluation criteria set forth above, evaluating the determination results thereof, calculating the accuracy of the model, the inventors found that the simplified "cell generation determination model" had a higher accuracy in terms of the old, and the accuracy of the SIAISi011-a modeling group determination was: 84.76%, the judging accuracy of the group to be identified is 77.78%, and the total accuracy is 81.82%; the TUSMi002 modeling group determination accuracy is: 92.37%, the judging accuracy of the group to be identified is 56.75%, and the total accuracy is 77.05%;

this resultProved by νas (CH ₃ ) The combination of the four characteristic peaks of N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II is determined to be the minimum characteristic peak set which can be used for establishing a cell generation secondary judgment model, and further supports that the minimum characteristic peak set of single-cell SR-FTIR obtained by the model can be used for a spectrometer for rapidly acquiring SR-FTIR according to the subsequent invention, and can provide basis for stem cell clinical treatment according to the judged and estimated physiological state of cells.

Example 4

In this example, the stem cell generation number identification model system was used to verify the identification ability of stem cell generation numbers from different sources by cross-validation.

The inventors have determined that when given any one source of cell line, the cell generation identification Model can accurately identify any cell generation within the modeling range, and that the generation of other generation stem cells from the source can be accurately determined by obtaining a "large-scale Model" Model A and Model A-T established by two different sources (SIAISI 011-A and TUSMi 002) of cells. However, the cell generation identification Model has not been validated for generation identification of stem cells of other sources, in other words, for generation identification of stem cells of any source, whether it is necessary to build a "extensive Model" of its own source, or whether it can be validated with stem cell models of other sources known as Model A or Model A-T.

To solve this problem, the inventors first performed mutual validation of the cell data of Model A and Model A-T modules, i.e., using Model A Model to identify TUSMi 002P 5, P15, P25, P35 relative to the cell generation times, and using Model A-T to identify SIAISI011-A P5, P15, P25, P35 relative to the cell generation times, the results are shown in FIG. 9, table 11, table 12

Table 11: model A cell generation number identification result of TUSMi002 building block

	Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
		TUSMi002 P5	29.41-30.01(29.78)/0.36-0.44(0.39)
TUSMi002 P15	23.25-23.86(23.66)/0.69-0.77(0.75)
		TUSMi002 P25	22.00-24.24(23.05)/0.68-0.76(0.73)
TUSMi002 P35	34.09-34.64(34.55)/0.82-0.92(0.89)

Table 12: model A-T results of cell generation secondary identification of SIAISI011-A modeling group

	Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
		SIAISi011-A P5	19.60-19.96(19.84)/0.38-0.51(0.48)
SIAISi011-A P15	15.34-16.57(15.66)/0.42-0.49(0.46)
		SIAISi011-A P25	15.06-15.98(15.09)/0.42-0.54(0.47)
SIAISi011-A P35	31.10-32.20(31.48)/0.75-0.79(0.78)

The large-scale Model A established by the SIAISI011-A cannot accurately identify the cell generation number of the TUSMi002 building block, and likewise, the large-scale Model A-T established by the TUSMi002 cannot accurately identify the cell generation number of the SIAISI011-A building block, so that the cell generation number identification Model cannot accurately identify the cell generation number of other sources.

To re-verify this result, the inventors used Model A and Model A-T to mutually verify the cell data outside the building block, i.e., model A was used to identify TUSMi 002P 10, P20, P30 relative to the cell generation number, model A-T was used to identify SIAII 011-A P10, P20, P30 relative to the cell generation number, and the results are shown in FIG. 10, table 13, table 14

Table 13: model A results of identifying cell generations of TUSMi002 test set

	Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
		TUSMi002 P10	31.14-31.98(31.59)/0.50-0.63(0.58)
TUSMi002 P20	6.31-6.59(6.45)/0.72-0.81(0.77)
		TUSMi002 P30	34.56-34.70(34.65)/0.86-0.90(0.87)

Table 14: model A-T results of identifying cell generations of SIAISI011-A test group

	Cell passage number (median and 95% confidence interval thereof)/probability (median and 95% confidence interval thereof)
		SIAISi011-A P10	19.87-20.65(20.12)/0.20-0.32(0.25)
SIAISi011-A P20	24.79-25.02(24.92)/0.27-0.57(0.34)
		SIAISi011-A P30	18.74-19.63(19.12)/0.22-0.42(0.31)

It was also found that Model A was difficult to accurately identify the cell passages of TUSMi 002P 10, P20, P30, and that Model A-T was also unable to identify the cell passages of the SIAISI011-A test set.

In summary, it was confirmed that the cell generation identification model can accurately identify any stem cell generation within the modeling range of its own origin, but cannot identify stem cell generation of other origins. It is concluded that for the identification of stem cell generation times of any source, a large-range SR-FTIR spectrum of each cell generation time of the stem cell generation time needs to be acquired, and a cell generation time identification model is applied to establish a large-range model of the stem cell generation time through characteristic peak selection.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A method of identifying a population of stem cells, the method comprising the steps of:

judging the probability of each generation of N known generation of stem cell groups to be identified based on a preset generation identification model;

obtaining the generation number of the stem cell group to be identified according to the generation number range of the stem cell group to be identified and the probability that the stem cell group to be identified is in each generation number;

the preset generation identification model is constructed by the following steps:

constructing a training data set, wherein the training data set comprises characteristic quantities of N stem cell groups with known passage times and corresponding passage times, and the characteristic quantities of the stem cell groups comprise a plurality of wavelength values obtained by SR-FTIR analysis of the stem cell groups;

constructing an initial identification model, wherein the initial identification model comprises an independent variable and a dependent variable, the independent variable is a characteristic quantity of the stem cell population, and the dependent variable is probability of each generation of the stem cell population;

2. The method of claim 1, wherein the characteristic of the population of stem cells comprises νas (CH) as a result of SR-FTIR analysis of stem cells ₃ ) N-H bending of Amide I, alpha-helical of Amide I and alpha-helical of Amide II.

3. The method according to claim 1, wherein the feature quantity of the stem cell population to be identified and the feature quantity of the stem cell population of the N known generations are subjected to dimension reduction treatment, and the two are compared to obtain the generation range to which the stem cell population to be identified belongs.

4. The method according to claim 3, wherein the feature quantity of the stem cell population to be identified subjected to the dimension reduction treatment is subjected to cluster analysis to obtain non-outliers of the stem cell population to be identified; and

and obtaining the generation range of the stem cell group to be identified by comparing the distances between the non-outlier point of the stem cell group to be identified and the clustering centers of the N stem cell groups with known generation times.

5. The method according to claim 4, wherein the range of generations is composed according to the generation number of the known generation number of stem cell population to which the target cluster center belongs by selecting at least two target cluster centers among the cluster centers of the N known generation number of stem cell populations by comparing distances of non-outlier points of the stem cell population to be identified and the cluster centers of the N known generation number of stem cell populations.

6. The method of claim 4, wherein a first target cluster center closest to the non-outlier of the stem cell population to be identified and a second target cluster center next closest to the non-outlier of the stem cell population to be identified are selected among the cluster centers of the N known generations; and utilizing the generation times of the stem cell group of the known generation times of the first target clustering center and the second target clustering center to form the generation time range.

7. The method according to claim 4, wherein a first target cluster center closest to the non-outlier point of the stem cell population to be identified is selected among the cluster centers of the N known generations of stem cell populations;

selecting the cluster centers of two stem cells of known generation adjacent to the generation to which the first target cluster center belongs, comparing the distances between the cluster centers of the two adjacent stem cells of known generation and the non-outlier of the stem cell group to be identified, and selecting the cluster center of the stem cell of known generation with the closer distance to the non-outlier of the stem cell group to be identified as a second target cluster center;

And utilizing the generation times of the stem cell groups of the known generation times of the first target clustering center and the second target clustering center to form the generation time range.

8. The method according to claim 1, wherein the generation number of the stem cell group to be identified is obtained by calculating the generation number range to which the stem cell group to be identified belongs and the probability that the stem cell group to be identified is in each generation number using a weighted average formula.