CN116741267A - Single cell clustering method and system based on consistency matrix scoring - Google Patents
Single cell clustering method and system based on consistency matrix scoring Download PDFInfo
- Publication number
- CN116741267A CN116741267A CN202310713091.2A CN202310713091A CN116741267A CN 116741267 A CN116741267 A CN 116741267A CN 202310713091 A CN202310713091 A CN 202310713091A CN 116741267 A CN116741267 A CN 116741267A
- Authority
- CN
- China
- Prior art keywords
- matrix
- consistency
- clustering
- distance
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000011159 matrix material Substances 0.000 title claims abstract description 177
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000009467 reduction Effects 0.000 claims abstract description 38
- 230000014509 gene expression Effects 0.000 claims abstract description 34
- 238000013077 scoring method Methods 0.000 claims abstract description 12
- 238000007781 pre-processing Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 14
- 108090000623 proteins and genes Proteins 0.000 claims description 10
- 238000003860 storage Methods 0.000 claims description 9
- 238000000513 principal component analysis Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 8
- 210000004027 cell Anatomy 0.000 description 75
- 238000004422 calculation algorithm Methods 0.000 description 21
- 238000005259 measurement Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000006872 improvement Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000012174 single-cell RNA sequencing Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000011551 log transformation method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 210000004460 N cell Anatomy 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013079 data visualisation Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 208000026278 immune system disease Diseases 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Genetics & Genomics (AREA)
- Analytical Chemistry (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Biotechnology (AREA)
- General Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
Abstract
The invention belongs to the field of single-cell clustering methods, and provides a single-cell clustering method and system based on consistency matrix grading, wherein after combination dimension reduction is performed based on gene expression data, a plurality of consistency matrixes are obtained, and each consistency matrix is clustered to obtain a corresponding clustering result; combining the consistency matrixes and the clustering results corresponding to the consistency matrixes, and calculating f-value of each consistency matrix by adopting a scoring method, wherein the consistency matrix corresponding to the highest f-value score is the optimal consistency matrix; based on the obtained optimal consistency matrix, constructing a distance matrix among cells, and hierarchical clustering is adopted on the distance matrix among cells to obtain a final clustering result. The indirect distance of the cells is fully utilized, and the clustering effect is improved.
Description
Technical Field
The invention belongs to the field of single-cell clustering methods, and particularly relates to a single-cell clustering method and system based on consistency matrix scoring.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Single cell sequencing (scRNA-seq) technology takes single cells as resolution to acquire transcriptome information, so that people observe cells with higher precision to better identify rare cell types, and the method plays a great role in tumor research, nerve diseases, immune diseases and other diseases involving intercellular heterogeneity, and can effectively help the invention to deeply explore the characteristics, fate and functional structures of the cells. Clustering is one of the most commonly used basic analysis methods in single-cell RNA sequencing data analysis, and the method can realize the distinction of single-cell categories, which plays an extremely important role in single-cell research of complex organ tissues, diagnosis and treatment of clinical diseases and the like. Therefore, the realization of accurate clustering of single cell data has important research significance in the field of bioinformatics. As an unsupervised algorithm, on the premise that the real label of the data is unknown, effective features in the data are extracted, the similarity between different samples is judged, and samples with similar features are classified into the same cluster, so that the classification of the samples is realized.
The existing single-cell clustering algorithm is traced back, and most of the existing clustering algorithms have the following problems:
1. currently, most single-cell clustering algorithms perform fixed and single key operations such as data preprocessing and dimension reduction for different data sets. However, in practical applications, there is a great difference in the sensitivity of different data sets to the preprocessing method and the dimension reduction mode.
2. Lacking a reasonable index, it is not possible to provide a reference for choosing a combination of preprocessing and dimension reduction methods with data specificity for different data sets.
3. The calculation of the distance between cells is not reasonable. SC3 calculates the distance between cells in the consistency matrix by using a Euclidean distance measurement method. However, this distance calculation method is too rough, so that it is difficult to grasp all the different modes of the characteristics between cells, and therefore, loss of effective distance information is necessarily caused.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides a single-cell clustering method and a single-cell clustering system based on consistency matrix scoring, which identify the combination of optimal pretreatment and a dimension reduction mode through scoring values so as to realize the selection of the combination of the optimal pretreatment and the dimension reduction mode with data specificity. The indirect distance of the cells is fully utilized, the optimal distance measurement with data specificity is obtained through the topological structure information among the cells, and the clustering accuracy is improved.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a first aspect of the present invention provides a single cell clustering method based on a consistency matrix score, comprising the steps of:
obtaining gene expression data;
after carrying out combination dimension reduction based on gene expression data, obtaining a plurality of consistency matrixes, and clustering each consistency matrix to obtain a corresponding clustering result;
combining the consistency matrixes and the clustering results corresponding to the consistency matrixes, and calculating f-value of each consistency matrix by adopting a scoring method, wherein the consistency matrix corresponding to the highest f-value score is the optimal consistency matrix; the f-value of each consistency matrix is calculated by adopting a scoring method by combining the consistency matrix and the corresponding clustering result, and the method specifically comprises the following steps:
calculating the inter-class distance and the intra-class distance of each row in the consistency matrix; obtaining corresponding f-value based on the inter-class distance and the intra-class distance of each row, and integrating the f-value of each row to obtain the f-value of the consistency matrix;
based on the obtained optimal consistency matrix, constructing a distance matrix among cells, and hierarchical clustering is adopted on the distance matrix among cells to obtain a final clustering result.
A second aspect of the invention provides a single cell clustering system based on a consistency matrix score, comprising:
a data acquisition module for acquiring gene expression data;
after carrying out combination dimension reduction based on gene expression data, obtaining a plurality of consistency matrixes, and clustering each consistency matrix to obtain a corresponding clustering result;
the consistency matrix scoring module is used for combining the consistency matrices and the corresponding clustering results thereof, and calculating f-value of each consistency matrix by adopting a scoring method, wherein the consistency matrix corresponding to the highest f-value score is the optimal consistency matrix; the f-value of each consistency matrix is calculated by adopting a scoring method by combining the consistency matrix and the corresponding clustering result, and the method specifically comprises the following steps:
calculating the inter-class distance and the intra-class distance of each row in the consistency matrix; obtaining corresponding f-value based on the inter-class distance and the intra-class distance of each row, and integrating the f-value of each row to obtain the f-value of the consistency matrix;
the clustering module is used for constructing a distance matrix among cells based on the obtained optimal consistency matrix, and hierarchical clustering is adopted for the distance matrix among cells to obtain a final clustering result.
A third aspect of the present invention provides a computer-readable storage medium.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps in a single cell clustering method based on a consistency matrix score as described above.
A fourth aspect of the invention provides a computer device.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a single cell clustering method based on a consistency matrix score as described above when the program is executed.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention designs an f-value scoring mechanism based on a consistency matrix, calculates f-value values for each combination, and identifies the combination of the optimal preprocessing and the dimension reduction mode through the scoring values so as to realize the selection of the combination of the optimal preprocessing and the dimension reduction mode with data specificity. The method solves the problems that most single cell clustering algorithms at present adopt single fixed data preprocessing and dimension reduction operation to all data.
2. The invention designs a brand new distance measurement SCM-tom based on an optimal consistency matrix, replaces common Euclidean distance measurement, fully utilizes indirect distance of cells, and obtains the optimal distance measurement with data specificity through topological structure information among the cells, which is called SCM-tom distance.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flowchart of a single cell clustering method based on consistency matrix scoring provided by an embodiment of the present invention;
FIG. 2 is a histogram of accuracy of f-value at different e and p values provided by an embodiment of the present invention;
FIG. 3 is a graph showing the ARI values of SCM-tom and SCM-eu provided by the embodiment of the invention;
FIG. 4 is a graph of ARI values versus other popular algorithms for different data sets for the algorithm of the present invention provided by an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Example 1
As shown in fig. 1, the embodiment provides a single-cell clustering method based on consistency matrix scoring, which includes the following steps:
step 1: acquiring an original gene expression matrix;
step 2: preprocessing an original gene expression matrix;
in step 2, the preprocessing of the original gene expression matrix includes:
before data processing, the SCM performs gene filtration on the gene expression matrix in advance in consideration of the phenomenon of data redundancy. The information content of the over-expressed genes is small, and the over-expressed genes have universality, and the two genes have little help to clustering, so that genes with the gene expression rate outside the set range are proposed.
In this embodiment, the setting range is lower than 6% and higher than 94%.
The filtered gene expression matrix was then pre-treated as follows:
given a gene expression matrix X G×N ={x ij And (c) comprising G genes and N cells. Wherein x is ij The expression level of the ith gene in the jth cell is shown.
(1)Log transformation:
Performing Log transformation on the gene expression matrix by adding pseudo count 1:
x' ij =log 2 (x ij +1)
(2)No transformation:
the transformation does not pretreat the original data, and adopts the original gene expression matrix as input for downstream analysis.
x' ij =x ij
(3)Z-score transformation:
Also called standard deviation standardization, the data gradually obeys standard normal distribution after Z-score treatment.
Wherein mu i Sum sigma i Representing the mean and standard deviation of row i in the gene expression matrix, respectively.
After the three pretreatment modes, three treated gene expression matrixes can be obtained. Then, the euclidean distance, pearson distance, and spearman distance between cells were calculated for each of the pre-treated expression matrices, respectively, so that three distance matrices were obtained for each expression matrix.
Step 3: performing dimension reduction on the gene expression matrix after pretreatment;
similar to data preprocessing, dimension reduction is also an indispensable step in downstream analysis of single cell transcriptomes.
In this embodiment, the selection of the dimension reduction mode by the user is expanded, and three dimension reduction methods are provided, namely, two-by-two combinations (pca+umap, le+umap, and le+pca) of Principal Component Analysis (PCA), UMAP, and Laplacian Eigenmaps (LE), respectively.
Principal Component Analysis (PCA) is a method of finding the r-wiki that best represents the data differences.
The UMAP method is a recently proposed algorithm based on Riemann geometry and algebraic topology for reduced-dimension data visualization analysis.
Laplacian Eigenmaps (LE) algorithm is a manifold dimension reduction algorithm that preserves local characteristics of data. The main idea is to keep as constant a structure between data local sample points as possible in a low-dimensional space.
And respectively adopting a dimension reduction combination (one combination comprises two dimension reduction methods) to reduce the dimension of the obtained 3 distance matrixes.
D is reserved after each dimension reduction of each distance matrix 1 ,d 2 ,...,d D Dimension (4% -7% of the original dimension). Then, each distance matrix is subjected to dimension reduction combination to obtain 2 Xd D And (5) a dimension reduction result.
Since the present embodiment involves 3 distance matrices, the total of 3 distance matrices is 3×2×d after a dimension reduction combination D And (5) a dimension reduction result.
Subsequently, each dimension reduction result is clustered by using k-means clustering to obtain 3 multiplied by 2 multiplied by d D Clustering results, and adopting a consistency clustering method to obtain all k-meansThe clustering results are integrated into a consistency matrix.
For a single clustering result, the consistency cluster converts it into a 0-1 binary matrix, and the dimension of the binary matrix is N (N is the total number of cells).
Assume that there is one clustering result r= { R 1 ,r 2 ,…,r N In the case that the ith cell and the jth cell belong to the same class, then in the clustering result R= { R 1 ,r 2 ,…,r N Binary matrix W corresponding to N×N ={w ij The value in the ith row and jth column in the } is 1; if the two cells do not belong to the same class, the binary W N×N The corresponding value of (2) is 0.
Wherein r is i And r j Representing class labels of the ith cell and the jth cell in the clustering result R, respectively.
In the above manner, a single cluster result can be converted into a 0-1 matrix.
Thereafter, the k-means was subjected to 3X 2X d D All the clustering results are converted into corresponding 0-1 matrixes, and the final consistency matrix Y is obtained by averaging the corresponding 0-1 matrixes.
Wherein W is i And (5) the binary matrix corresponding to the ith clustering result.
In summary, an expression matrix can yield a consistency matrix under a dimension-reduction combination. Since the method comprises a combination of 3 dimension reduction methods, each expression matrix obtains 3 consistency matrices in total after dimension reduction under the three dimension reduction combinations. And because the SCM obtains three expression matrixes after pretreatment in total, the invention finally obtains 9 consistency matrixes in total. Based on the 9 consistency matrixes, the Euclidean distance between every two cells is calculated, and a hierarchical clustering method is adopted, so that 9 corresponding clustering results can be obtained finally.
Step 4: selecting an optimal presentation matrix
A good clustering result often requires a smaller intra-class distance and a larger inter-class distance, while the value of the consistency matrix may reflect the similarity between cells to some extent. Similarly, a good consistency matrix generally corresponds to smaller intra-class differences and larger inter-class differences, which also means that better clustering results can be output.
In this embodiment, 9 consistency matrices and 9 clustering results corresponding to the consistency matrices are obtained in total.
To some extent, the value Y in the consistency matrix Y N×N ={Y ij }(Y ij ∈[0,1]) Representing the similarity between cells i and j, the larger the value, the greater the probability that two cells belong to the same cluster.
For this purpose, a scoring method is designed and its score f-value is calculated to measure the quality of the consistency matrix. Given a consistency matrix Y N×N ={y ij And the corresponding clustering result R, and f-value of the consistency matrix is calculated according to the following steps.
Step 1 calculating f-value for each row in the consistency matrix Y
Each row in the consistency matrix will act together to affect the final clustering result. Therefore, the present invention calculates the f-value of each row in the consistency matrix.
First, the present invention calculates the inter-class distance of the i-th row in the consistency matrix Y:
wherein n is j Is the number of cells in the j-th cluster in the cluster result R; k is the number of clusters in the clustering result R;is the mean value of the ith row in the consistency matrix Y; />Is the mean of the ith row in the consistency matrix Y in the jth cluster in the clustering result R.
Then, the intra-class distance of the ith row in the consistency matrix Y is calculated:
var_i(i)=var_a(i)-var_b(i)
wherein N is the number of rows of the consistency matrix Y, i.e. the total number of cells; y is Y ij Values representing the ith row and jth column in the consistency matrix Y; var_a (i) represents the sum of the distances.
Finally, the f-value of the ith row in the consistency matrix Y can be obtained by:
df 1 =k-1,df 2 =N-k
wherein df is 1 And df 2 The degrees of freedom for var_b (i) and var_i (i), respectively.
Step 2 calculating f-value of the consistency matrix
Through the steps, the f-value of each row in the consistency matrix is obtained. Considering that f-value of each row may be excessively discretized, the invention integrates the f-value of each row together in the following calculation manner to obtain the f-value of the consistency matrix:
wherein lambda defaults to 0.5 and alpha defaults to 5.
The above calculation method shows that the higher the f-value of the consistency matrix, the smaller the intra-class distance of the cells in the consistency matrix, and the larger the inter-class distance, which means that the difference fluctuation is mostly derived from the cells among different classes, and less from the cells within the same class. In short, the higher the f-value of the consistency matrix, the better the clustering effect of the consistency matrix. Thus, the consistency matrix of the highest f-value score corresponds to the combination of the optimal preprocessing and dimension reduction methods.
Step 5: reconstructing SCM-tom distance matrix
The conventional method calculates the distance between cells in the consistency matrix by using a Euclidean distance measurement mode. However, this distance calculation method is too rough to obtain all the difference patterns of the characteristics between cells, resulting in loss of effective distance information. Therefore, the method designs a brand new distance measurement mode based on the topological structure among the nodes (cells) to reconstruct the distance among the cells. In fact, when capturing distance information between two cells, only the direct distance between the two is of interest to the detriment of the knowledge of the global distance information, and therefore the indirect distance to other cells should also be taken into account. Based on the above considerations, a completely new distance metric is constructed for the consistency matrix, and the specific calculation method is as follows.
Consistency matrix Y N×N The value of (2) may represent the similarity between the ith cell and the jth cell, and the value range is 0-1, so the consistency matrix is used as the initial correlation coefficient matrix S between cells:
S ij =Y ij
wherein Y is ij Indicating the probability that the i-th cell and the j-th cell belong to the same cluster.
Then, the difference between the correlation coefficients is increased by introducing beta-index, thereby constructing an adjacency matrix A= { alpha ij }
α ij =S ij β
Wherein alpha is ij Beta correlation between the ith cell and the jth cell; beta is default to 8 in this embodiment. Therefore, the invention can obtain the connectivity k of each cell j :
Wherein k is j Representing the sum of beta associations between the jth cell and other cells.
Considering that there is not only direct connectivity between two cells, their indirect connectivity with other cells also affects the connectivity relationship of the two themselves. Therefore, indirect connectivity with other cells also needs to be considered, avoiding false positives in connectivity from destroying an accurate measure of both connectivity. Based on the adjacency matrix, a topology overlapping matrix omega= { omega is generated again ij }。
The topology overlapping matrix focuses on primary connectivity and secondary connectivity, and more accurately describes the topology association relationship between two cells. To a certain extent omega ij Represents the topological correlation between the ith cell and the jth cell. Finally, a distance matrix d= { D is obtained ij }。
d ij =1-ω ij
From the above calculation, when ω ij The greater the value of (2), the closer the distance between the corresponding two cells. And d is ij Representing the distance between the i-th cell and the j-th cell, the value ranging between 0 and 1, if the value is larger, the topology difference between the two cells is larger.
And then, the user constructs a distance matrix among cells based on the selected optimal consistency matrix, and hierarchical clustering is adopted for the obtained distance matrix to obtain a final clustering result.
To evaluate the effectiveness of the new method, 10 common public scRNA-seq datasets were collected and the true cell type tags of these data are known. Including Biase, deng, darmannis, muraro, usoskin, romanov, zeisel, lake, buettner, baron-mouse datasets. In addition, the invention adopts ARI (Adjusted Rand index) value as the judgment standard of the clustering effect. The value of the ARI is between [ -1,1], and when the ARI value is closer to 1, the clustering result and the real label are closer.
(1) Accuracy assessment of f-value
The invention sets an index of accuracy rate to check the validity of the f-value, and mainly examines whether the combination of the preprocessing and dimension reduction method corresponding to the maximum f-value is in the optimal combination or not.
First, the ARI valid period is set. Given a data set, under the combination of different preprocessing modes and dimension reduction methods, 9 consistency matrixes and clustering results corresponding to the consistency matrixes can be obtained in total. Then, ARI values of the 9 clustering results are calculated, respectively, and the maximum value ari_max thereof is found. Considering that part of ARI values are relatively close, the invention sets a reasonable error value e.
Finally, the ARI valid period is set to [ ari_max-e, ari_max ]. In addition, although the ARI value corresponding to the clustering result of the partial combination is outside the effective interval, the ARI value ranking of the 9 combinations is far ahead, which also shows that the clustering effect of the combination still shows good to a certain extent. Based on the above two considerations, the present embodiment incorporates the combination of 9 combinations whose ARI value is in the valid interval or whose rank is p (p is an optional parameter) before the name into the optimal combination.
F-value values of the consistency matrix corresponding to the 9 combinations are calculated respectively, each data set is run for 50 times, whether the combination pointed by the maximum f-value is in the optimal combination is detected, the count value is increased by one if the combination is hit, and otherwise, the combination value is 0. Observing the count value at run n=50 times, the accuracy ACC of the f-value can be obtained according to the following formula:
the present invention tested the accuracy of f-value at different values of e (0.05,0.03,0.01) and p (3, 2) over 10 data sets (see fig. 2). It can be found that when the e-error value of ARI is set to 0.01 and the optimal combination ranking threshold p is set to 2, the accuracy of the Darmanis, buettner, baron-mouse dataset is still 1, while the Usoskin, biase, lake dataset is also up to 0.9 or more, despite the severe optimal combination conditions. At this time, the accuracy of Usoskin, biase, lake and Muraro datasets also increased to 1 when p was adjusted to 3. Wherein the Muraro dataset is raised from 0.64 to 1, demonstrating that although the ARI value of the combination in the Muraro dataset pointing to the highest f-value differs from the highest ARI by more than 0.01, the ranking among the 9 combinations is also top three, laterally corroborating the strong effectiveness of f-value in choosing the combination of optimal preprocessing and dimension reduction.
In summary, the f-value shows significantly superior performance on most data sets, and can well assist the user in completing the selection of the combination of the optimal preprocessing mode and the dimension reduction mode in a smaller error range.
(2) Validity assessment of reconstruction distance
Based on the optimal consistency matrix, the invention compares the brand new distance measurement (SCM-tom) with the distance matrix obtained by the Euclidean distance (SCM-eu) commonly used in the traditional method by hierarchical clustering (see figure 3). It can be found that the clustering effect of SCM-tom over most data sets is superior to euclidean distance, and compared with SCM-eu over euclidean distance, SCM-tom has significant improvement over Muraro, baron-mouse, deng and Romanov data sets, with the amplitude of improvement being between 11.79% -27.89%. Wherein the improvement is more than 20% over the Muraro dataset. At the same time, there is a significant improvement in the Lake, zeisel and darmannis datasets.
(3) Clustering effect comparison with other popular algorithms
Compared with other clustering algorithms (see fig. 4), the ARI values obtained by the present invention are in the first place in almost all data sets. Specifically, ARI values of SCM-tom in Baron-mouse, darmanis, muraro, deng and Romanov data sets are significantly higher than those of all other clustering algorithms, and are respectively improved by 21.95% -145.40%, 10.17% -171.05%, 27.79% -325.79%, 8.02% -129.16% and 37.63% -99.35% compared with other clustering algorithms; the clustering effect on the Biase data set, the Buettner data set, the Lake data set, the Zeisel data set and the Usoskin data set is also obviously superior to all other algorithms.
In particular, compared with the SC3 clustering algorithm, the method fully displays the clustering superiority in most data sets. For example, in the Buettner dataset, ARI values <0.01 using the SC3 clustering algorithm, while the ARI values for SCM-tom all reached 0.89; on the Muraro data set, the ARI value calculated by the SC3 algorithm is 0.73, and the ARI value is improved to 0.94 by the SCM-tom algorithm; on Romanov dataset, the ARI values of SCM-tom all leap over 0.6, whereas the ARI value of SC3 is only 0.46, while the ARI values of other clustering algorithms are mostly distributed around 0.3.
In summary, the invention not only considers the algorithm sensibility of different data sets, selects the combination of the optimal pretreatment and the dimension reduction modes for the data sets, but also provides effective distance measurement for the corresponding optimal consistency matrix, and fully utilizes the distance information among cells, thereby greatly improving the clustering effect.
Example two
The present embodiment provides a single cell clustering system based on consistency matrix scoring, comprising:
a data acquisition module for acquiring gene expression data;
after carrying out combination dimension reduction based on gene expression data, obtaining a plurality of consistency matrixes, and clustering each consistency matrix to obtain a corresponding clustering result;
the consistency matrix scoring module is used for combining the consistency matrices and the corresponding clustering results thereof, and calculating f-value of each consistency matrix by adopting a scoring method, wherein the consistency matrix corresponding to the highest f-value score is the optimal consistency matrix; the f-value of each consistency matrix is calculated by adopting a scoring method by combining the consistency matrix and the corresponding clustering result, and the method specifically comprises the following steps:
calculating the inter-class distance and the intra-class distance of each row in the consistency matrix; obtaining corresponding f-value based on the inter-class distance and the intra-class distance of each row, and integrating the f-value of each row to obtain the f-value of the consistency matrix;
the clustering module is used for constructing a distance matrix among cells based on the obtained optimal consistency matrix, and hierarchical clustering is adopted for the distance matrix among cells to obtain a final clustering result.
Example III
The present embodiment provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps in a single cell clustering method based on a consistency matrix score as described above.
Example IV
The embodiment provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps in the single-cell clustering method based on the consistency matrix grading.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. The single cell clustering method based on the consistency matrix scoring is characterized by comprising the following steps of:
obtaining gene expression data;
after carrying out combination dimension reduction based on gene expression data, obtaining a plurality of consistency matrixes, and clustering each consistency matrix to obtain a corresponding clustering result;
combining the consistency matrixes and the clustering results corresponding to the consistency matrixes, and calculating f-value of each consistency matrix by adopting a scoring method, wherein the consistency matrix corresponding to the highest f-value score is the optimal consistency matrix; the f-value of each consistency matrix is calculated by adopting a scoring method by combining the consistency matrix and the corresponding clustering result, and the method specifically comprises the following steps:
calculating the inter-class distance and the intra-class distance of each row in the consistency matrix; obtaining corresponding f-value based on the inter-class distance and the intra-class distance of each row, and integrating the f-value of each row to obtain the f-value of the consistency matrix;
based on the obtained optimal consistency matrix, constructing a distance matrix among cells, and hierarchical clustering is adopted on the distance matrix among cells to obtain a final clustering result.
2. The single cell clustering method based on consistency matrix scoring as claimed in claim 1, wherein the constructing a distance matrix between cells based on the obtained optimal consistency matrix comprises:
taking the optimal consistency matrix as an initial correlation coefficient matrix among cells;
introducing a beta index to increase the difference between correlation coefficients, thereby constructing an adjacency matrix;
and obtaining connectivity and a topological overlap matrix of each cell based on the adjacent matrix, and obtaining a distance matrix among the cells according to the topological overlap matrix.
3. The single cell clustering method based on the consistency matrix score according to claim 1, wherein after obtaining the gene expression data, the data preprocessing is performed, specifically comprising:
gene expression data is subjected to gene filtration, and genes with the gene expression rate outside a set range are removed;
after gene filtration, the euclidean distance, pearson distance, and spearman distance between cells were calculated separately for each pre-treated expression matrix.
4. The single cell clustering method based on consistency matrix scoring as claimed in claim 1, wherein the inter-class distance and intra-class distance of each row in the consistency matrix are expressed as:
the inter-class distance formula for each row in the consistency matrix is:
wherein n is j Is the number of cells in the j-th cluster in the cluster result R; k is the number of clusters in the clustering result R;is the mean value of the ith row in the consistency matrix Y; />Is the mean of the ith row in the consistency matrix Y in the jth cluster in the clustering result R.
The intra-class distance formula for each row in the consistency matrix is:
var_i(i)=var_a(i)-var_b(i)
wherein N is the number of rows of the consistency matrix Y, i.e. the total number of cells; y is Y ij Values representing the ith row and jth column in the consistency matrix Y; var_a (i) represents the sum of the distances.
5. The single cell clustering method based on the consistency matrix score as claimed in claim 1, wherein as the topological correlation between the i-th cell and the j-th cell is larger, the distance between the corresponding two cells is closer.
6. The single-cell clustering method based on the consistency matrix score according to claim 1, wherein the method for carrying out the combined dimension reduction on the gene expression data comprises the step of carrying out the dimension reduction principal component analysis in a two-by-two combination mode by adopting three combination of PCA+UMAP, LE+UMAP and LE+PCA.
7. The single cell clustering method based on consistency matrix scoring according to claim 1, wherein after obtaining the corresponding clustering result, the single clustering result is converted into a 0-1 binary matrix, and the conversion process is as follows:
assume that there is one clustering result r= { R 1 ,r 2 ,…,r N In the case that the ith cell and the jth cell belong to the same class, then in the clustering result R= { R 1 ,r 2 ,…,r N Binary matrix W corresponding to N×N ={w ij Row i, column j with a value of 1; if the two cells do not belong to the same class, the binary W N×N The corresponding value of (2) is 0.
8. A single cell clustering system based on a consistency matrix score, comprising:
a data acquisition module for acquiring gene expression data;
after carrying out combination dimension reduction based on gene expression data, obtaining a plurality of consistency matrixes, and clustering each consistency matrix to obtain a corresponding clustering result;
the consistency matrix scoring module is used for combining the consistency matrices and the corresponding clustering results thereof, and calculating f-value of each consistency matrix by adopting a scoring method, wherein the consistency matrix corresponding to the highest f-value score is the optimal consistency matrix; the f-value of each consistency matrix is calculated by adopting a scoring method by combining the consistency matrix and the corresponding clustering result, and the method specifically comprises the following steps:
calculating the inter-class distance and the intra-class distance of each row in the consistency matrix; obtaining corresponding f-value based on the inter-class distance and the intra-class distance of each row, and integrating the f-value of each row to obtain the f-value of the consistency matrix;
the clustering module is used for constructing a distance matrix among cells based on the obtained optimal consistency matrix, and hierarchical clustering is adopted for the distance matrix among cells to obtain a final clustering result.
9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps in the single cell clustering based on the identity matrix score of any one of claims 1-7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps in single cell clustering based on a consistency matrix score as claimed in any of claims 1-7 when the program is executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310713091.2A CN116741267A (en) | 2023-06-15 | 2023-06-15 | Single cell clustering method and system based on consistency matrix scoring |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310713091.2A CN116741267A (en) | 2023-06-15 | 2023-06-15 | Single cell clustering method and system based on consistency matrix scoring |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116741267A true CN116741267A (en) | 2023-09-12 |
Family
ID=87913019
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310713091.2A Pending CN116741267A (en) | 2023-06-15 | 2023-06-15 | Single cell clustering method and system based on consistency matrix scoring |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116741267A (en) |
-
2023
- 2023-06-15 CN CN202310713091.2A patent/CN116741267A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110222745B (en) | Similarity learning based and enhanced cell type identification method | |
CN111899882B (en) | Method and system for predicting cancer | |
CN109935337B (en) | Medical record searching method and system based on similarity measurement | |
CN110619084B (en) | Method for recommending books according to borrowing behaviors of library readers | |
Yao et al. | Denoising protein–protein interaction network via variational graph auto-encoder for protein complex detection | |
CN113707317B (en) | Disease risk factor importance analysis method based on mixed model | |
Feng et al. | Fsrf: an improved random forest for classification | |
CN115881232A (en) | ScRNA-seq cell type annotation method based on graph neural network and feature fusion | |
CN116386899A (en) | Graph learning-based medicine disease association relation prediction method and related equipment | |
US7587280B2 (en) | Genomic data mining using clustering logic and filtering criteria | |
CN118280436A (en) | LncRNA-disease association prediction method based on singular value decomposition and graph comparison learning | |
Bania | R-GEFS: condorcet rank aggregation with graph theoretic ensemble feature selection algorithm for classification | |
Wang et al. | Poisson-based self-organizing feature maps and hierarchical clustering for serial analysis of gene expression data | |
CN117195027A (en) | Cluster weighted clustering integration method based on member selection | |
Wang et al. | scASGC: An adaptive simplified graph convolution model for clustering single-cell RNA-seq data | |
CN116741267A (en) | Single cell clustering method and system based on consistency matrix scoring | |
Meng et al. | Multi-view clustering with exemplars for scientific mapping | |
CN116259364A (en) | Cell track deducing method based on time sequence single cell transcriptome sequencing data | |
Li et al. | A novel approach to remote sensing image retrieval with multi-feature VP-tree indexing and online feature selection | |
Gong et al. | Interpretable single-cell transcription factor prediction based on deep learning with attention mechanism | |
CN114970684A (en) | Community detection method for extracting network core structure by combining VAE | |
CN113035279A (en) | Parkinson disease evolution key module identification method based on miRNA sequencing data | |
Ghai et al. | Proximity measurement technique for gene expression data | |
Cai et al. | Realize Generative Yet Complete Latent Representation for Incomplete Multi-View Learning | |
Hu et al. | A Novel clustering scheme based on density peaks and spectral analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |