CN117036762A - Multi-mode data clustering method - Google Patents
Multi-mode data clustering method Download PDFInfo
- Publication number
- CN117036762A CN117036762A CN202310975304.9A CN202310975304A CN117036762A CN 117036762 A CN117036762 A CN 117036762A CN 202310975304 A CN202310975304 A CN 202310975304A CN 117036762 A CN117036762 A CN 117036762A
- Authority
- CN
- China
- Prior art keywords
- sample
- matrix
- data
- samples
- pixel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 239000011159 matrix material Substances 0.000 claims abstract description 103
- 108020004999 messenger RNA Proteins 0.000 claims abstract description 36
- 239000002679 microRNA Substances 0.000 claims abstract description 36
- 108700011259 MicroRNAs Proteins 0.000 claims abstract description 33
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 28
- 230000004927 fusion Effects 0.000 claims abstract description 23
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000013507 mapping Methods 0.000 claims abstract description 7
- 230000003595 spectral effect Effects 0.000 claims abstract description 5
- 108091070501 miRNA Proteins 0.000 claims abstract description 4
- 238000001914 filtration Methods 0.000 claims description 12
- 230000001629 suppression Effects 0.000 claims description 8
- 230000002146 bilateral effect Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 5
- 102000008297 Nuclear Matrix-Associated Proteins Human genes 0.000 claims description 3
- 108010035916 Nuclear Matrix-Associated Proteins Proteins 0.000 claims description 3
- 230000002401 inhibitory effect Effects 0.000 claims description 3
- 210000000299 nuclear matrix Anatomy 0.000 claims description 3
- 230000003044 adaptive effect Effects 0.000 claims 1
- 206010028980 Neoplasm Diseases 0.000 description 12
- 230000014509 gene expression Effects 0.000 description 10
- 230000002829 reductive effect Effects 0.000 description 7
- 230000009286 beneficial effect Effects 0.000 description 6
- 201000011510 cancer Diseases 0.000 description 4
- 210000004027 cell Anatomy 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 108090000623 proteins and genes Proteins 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 210000001519 tissue Anatomy 0.000 description 3
- 206010009944 Colon cancer Diseases 0.000 description 2
- 208000001333 Colorectal Neoplasms Diseases 0.000 description 2
- 239000000090 biomarker Substances 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000000875 corresponding effect Effects 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000003708 edge detection Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012854 evaluation process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 1
- 206010017993 Gastrointestinal neoplasms Diseases 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 125000003275 alpha amino acid group Chemical group 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 230000003915 cell function Effects 0.000 description 1
- 210000003855 cell nucleus Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000003759 clinical diagnosis Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 230000037353 metabolic pathway Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 108091027963 non-coding RNA Proteins 0.000 description 1
- 102000042567 non-coding RNA Human genes 0.000 description 1
- 231100000915 pathological change Toxicity 0.000 description 1
- 230000036285 pathological change Effects 0.000 description 1
- 230000001575 pathological effect Effects 0.000 description 1
- 230000007170 pathology Effects 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000019491 signal transduction Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000010186 staining Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 210000004881 tumor cell Anatomy 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/762—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
- G06V10/763—Non-hierarchical techniques, e.g. based on statistics of modelling distributions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a multi-mode data clustering method, which belongs to the technical field of data processing and comprises the steps of obtaining a sample data set; extracting edge characteristics of the image data; extracting differential features of transcriptome data, the differential features including mRNA features and miRNA features; calculating a correlation coefficient matrix of each sample data; a soft threshold value is adopted to carry out nonlinear mapping on the correlation coefficient matrix; calculating the connectivity of each sample and the rest samples; calculating discretized connectivity and corresponding probability to obtain a distance matrix between samples; pre-clustering the sample data by a K-means++ clustering algorithm; converting the inter-sample distance matrix into an inter-sample similarity matrix; constructing a kernel matrix according to the pre-clustering information; iterating the similarity matrix among the samples according to the kernel matrix; integrating the similarity matrix among samples in the mRNA data view, the microRNA data view and the Image data view to obtain a similarity fusion matrix among samples; and clustering samples by a spectral clustering algorithm.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a multi-mode data clustering method.
Background
In clinical diagnosis, the variability of response to clinical treatment in different patients with the same tumor is often caused by tumor heterogeneity, which is likely due to mutations during proliferation and differentiation of tumor cells, which has been demonstrated by several studies. Tumor heterogeneity ultimately translates into a difference in phenotype, which refers not only to the difference in response of different patients with the same tumor to the same drug treatment experiment, but also to the differences in the various biomarkers in the patient's tumor microenvironment.
In one aspect, transcriptome data is a very critical biomarker for different tumor subtypes. microRNA-92a, microRNA-375 and microRNA-424 were found to have differences in expression of different colorectal cancer tumor subtypes by microRNA expression data of colorectal cancer samples in 2015, rikke Karlin Jepsen, et al. The transcriptome data reflects the expression condition of genes in cells, can provide a large amount of gene expression information, comprises the expression level of the genes under different conditions, and can reveal the differences in cell functions, metabolic pathways, signal pathways and the like. Transcriptome data typically has a high dimensional eigenvector, which allows more gene expression variation to be considered in the cluster analysis, helping to find minor differences. However, clustering algorithms may cause increased computational complexity when processing large-scale data due to the high-dimensional nature of transcriptome data.
On the other hand, histopathological images play an important role in early identification and diagnosis of cancer, and the efforts to participate in cancer diagnosis by means of analyzing pathological images have been applied and developed for many years; kowal et al compared and tested different algorithms for cell nucleus segmentation, and by analyzing the data set of the cancer image of the case to determine whether the tumor of the case patient is benign, the accuracy rate was over 96%. The histopathological image intuitively shows the morphology and structure of the tissue cells, and can help doctors or pathologists to quickly observe the characteristics of the sample and discover potential abnormalities or pathological changes. However, interpretation and clustering of histopathological images typically requires subjective judgment by a specialized pathologist, and may be affected by individual differences and subjective experience. Moreover, obtaining high quality histopathological images requires processing such as tissue slicing and staining, which is costly and time consuming.
In summary, transcriptome data and histopathological images each have advantages and disadvantages in clustering cancer samples.
Disclosure of Invention
The invention provides a multi-mode data clustering method, which aims to solve the technical problems that transcriptome data in the prior art is high in data dimension, high in calculation complexity of a clustering algorithm, and poor in clustering accuracy, and tissue pathology images are easily influenced by subjective factors.
First aspect
The invention provides a multi-mode data clustering method, which comprises the following steps:
s101: obtaining a sample data set comprising a plurality of sample data, each sample data comprising image data and transcriptome data;
s102: filtering the image data through a bilateral filter;
s103: introducing a Sobel operator, and calculating gradient information of pixel points in the filtered image data, wherein the gradient information comprises gradient strength and gradient direction;
s104: when a plurality of gradient information exists in the image data after the filtering processing, reserving maximum value pixel points and inhibiting non-maximum value pixel points;
s105: denoising the sample data after non-maximum suppression to obtain edge characteristics of the image data;
s106: extracting differential features of transcriptome data, the differential features including mRNA features and miRNA features;
s107: calculating a correlation coefficient matrix of each sample data according to the edge characteristics, the mRNA characteristics and the miRNA characteristics of the sample data;
s108: a soft threshold value is adopted to carry out nonlinear mapping on the correlation coefficient matrix;
s109: calculating the connectivity of each sample and the rest samples;
s110: discretizing the connectivity through a Histogram algorithm, and calculating the discretized connectivity and the corresponding probability to obtain a distance matrix between samples;
s111: pre-clustering sample data under an mRNA data view, a microRNA data view and an Image data view by a K-means++ clustering algorithm to obtain pre-clustering information;
s112: converting the distance matrix between samples into a similarity matrix between samples under the mRNA data view, the microRNA data view and the Image data view;
s113: according to the pre-clustering information, constructing a nuclear matrix under the mRNA data view, the microRNA data view and the Image data view;
s114: iterating the sample-to-sample similarity matrix under the mRNA data view, the microRNA data view and the Image data view according to the kernel matrix;
s115: integrating the similarity matrix among samples in the mRNA data view, the microRNA data view and the Image data view to obtain a similarity fusion matrix among samples;
s116: and clustering the samples according to a similarity fusion matrix among the samples by a spectral clustering algorithm.
Compared with the prior art, the invention has at least the following beneficial technical effects:
according to the invention, transcriptome data and histopathological images are synthesized, edge characteristics of the histopathological images and mRNA characteristics and miRNA characteristics of the transcriptome data are extracted, multi-mode fusion is carried out on the edge characteristics of the histopathological images and the mRNA characteristics and the miRNA characteristics of the transcriptome data to obtain an inter-sample similarity fusion matrix, and then automatic clustering is carried out according to the inter-sample similarity fusion matrix. For transcriptome data, only mRNA features and miRNA features in the transcriptome data are required to be focused, so that the data dimension is reduced, the complexity of a clustering algorithm is reduced, the influence of subjective factors in the disease evaluation process is reduced, and the accuracy of clustering evaluation is improved through multi-modal analysis.
Drawings
The above features, technical features, advantages and implementation of the present invention will be further described in the following description of preferred embodiments with reference to the accompanying drawings in a clear and easily understood manner.
Fig. 1 is a schematic flow chart of a multi-mode data clustering method provided by the invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will explain the specific embodiments of the present invention with reference to the accompanying drawings. It is evident that the drawings in the following description are only examples of the invention, from which other drawings and other embodiments can be obtained by a person skilled in the art without inventive effort.
For simplicity of the drawing, only the parts relevant to the invention are schematically shown in each drawing, and they do not represent the actual structure thereof as a product. Additionally, in order to simplify the drawing for ease of understanding, components having the same structure or function in some of the drawings are shown schematically with only one of them, or only one of them is labeled. Herein, "a" means not only "only this one" but also "more than one" case.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
In this context, it should be noted that the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected, unless otherwise explicitly stated and defined. Either mechanically or electrically. Can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In addition, in the description of the present invention, the terms "first," "second," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.
Example 1
In one embodiment, referring to fig. 1 of the specification, a flow chart of a multi-modal data clustering method provided by the invention is shown.
The invention provides a multi-mode data clustering method, which comprises the following steps:
s101: a sample dataset is acquired.
Wherein the sample data set comprises a plurality of sample data, each sample data comprising image data and transcriptome data.
In particular, the sample data may be gastrointestinal cancer sample data.
S102: and filtering the image data through a bilateral filter.
In one possible implementation, S102 specifically includes:
s1021: the sample data is converted into a matrix of pixels.
S1022: non-linear fusion is carried out on the current pixel point and the pixel points in the neighborhood range with the surrounding radius of 1 pixel point:
where g (i, j) represents a non-linearly fused pixel value at the current pixel (i, j), S (i, j) represents a set of pixels within a neighborhood of 1 pixel in radius around the current pixel (i, j), (k, l) represents pixel coordinates around the current pixel (i, j), f (k, l) represents a gray value at the pixel (k, l), and w (i, j, k, l) represents a weight parameter between the current pixel (i, j) and the pixel (k, l).
The calculation mode of the weight parameter w (i, j, k, l) between the current pixel point (i, j) and the pixel point (k, l) is as follows:
w(i,j,k,l)=d(i,j,k,l)·r(i,j,k,l)
where d (i, j, k, l) represents the spatial domain weight between the current pixel point (i, j) and the pixel point (k, l), r (i, j, k, l) represents the pixel domain weight between the current pixel point (i, j) and the pixel point (k, l), σ d Representing the spatial domain standard deviation, sigma r Representing the pixel domain standard deviation.
In the invention, the bilateral filter is a nonlinear filter, and the gray value and the spatial position of the pixel point are considered in the filtering process, so that the edge characteristics of the image can be reserved, and the image can be more clear and sharp. The bilateral filter performs nonlinear fusion on gray values of surrounding pixels, which means that in the filtering process, the fusion relationship between the pixels is not simply weighted average, but is adjusted according to the similarity between the pixels. In this way, details and texture information of the image can be better preserved.
S103: and introducing a Sobel operator, and calculating gradient information of pixel points in the filtered image data, wherein the gradient information comprises gradient strength and gradient direction.
The Sobel operator is a commonly used image processing operator for detecting edges in an image. The method is a discrete difference operator, and the edge of the image is found by calculating the gradient of the pixel point of the image in the horizontal and vertical directions.
Among them, gradient information is an important feature in image processing and computer vision for describing the degree of change of an image and edge information. It can help us understand the variation of pixel values in an image, and thus locate and identify features such as edges, textures, shapes, etc. in the image.
In one possible implementation, S103 specifically includes:
s1031: introducing a Sobel operator, and calculating a horizontal feature matrix S of sample data after filtering x And a vertical feature matrix S y :
S1032: according to a horizontal feature matrix S x And a vertical feature matrix S y Calculating the horizontal gradient G x And a gradient G in the vertical direction y :
G x =S x I
G y =S y I
Wherein I represents a gray value matrix of the sample data after the filtering process.
S1033: calculating gradient intensity G and gradient direction theta of the pixel points:
according to the invention, gradient information of the image in the horizontal and vertical directions can be obtained by introducing a Sobel operator and calculating the horizontal and vertical feature matrixes of the filtered image data. Then, the gradient intensity and the gradient direction of the pixel point are calculated according to the gradient characteristics, so that the edge detection and the characteristic extraction of the image are realized. Such gradient information is important for subsequent image analysis and processing steps and can provide more abundant and useful information for image data.
S104: when a plurality of gradient information exists in the image data after the filtering processing, reserving the maximum value pixel point and restraining the non-maximum value pixel point.
In one possible implementation, S104 specifically includes:
s1041: selecting fourThe preset edge angles are respectively 0,
s1042: when the gradient strength of the pixel points is larger than the gradient of the preset edge angle, determining that the current pixel point is the maximum value pixel point to be reserved, otherwise, determining that the current pixel point is the non-maximum value pixel point to be restrained.
In the invention, the processing mode of reserving the maximum pixel point and inhibiting the non-maximum pixel point is beneficial to improving the image edge detection effect, enhancing the image characteristics and reducing the influence of noise, thereby obtaining more accurate and clear image edge information and providing a better foundation for the subsequent image processing and analysis tasks.
S105: and denoising the sample data after non-maximum suppression to obtain edge characteristics of the image data.
In one possible implementation, S105 specifically includes:
s1051: a high threshold and a low threshold are set.
S1052: and when the edge pixel gradient value of the pixel point is larger than the high threshold value, determining the pixel point as a strong edge pixel point.
S1053: and when the edge pixel gradient value of the pixel point is between the low threshold value and the high threshold value, determining the pixel point as a weak edge pixel point.
S1054: and when the edge pixel gradient value of the pixel point is smaller than the low threshold value, suppressing the pixel point. The determination method of the high threshold TH and the low threshold TL is as follows:
TH=0.5max(H*)
TL=0.1max(H*)
where H denotes a pixel matrix after non-maximum suppression, and max (H) denotes a maximum value in the pixel matrix after non-maximum suppression.
The sample data after non-maximum suppression is subjected to denoising processing and edge characteristics of the image data are extracted, so that important visual information can be extracted from the image, and the method is used for various image processing and computer visual tasks, and therefore the performance and accuracy of an algorithm are improved.
S106: differential features of transcriptome data are extracted, including mRNA features and miRNA features.
Among them, mRNA (messenger RNA) is a class of RNA molecules that, in cells, are involved in gene expression processes, transcribing genetic information in DNA into amino acid sequences of proteins. mRNA characteristics generally refer to characteristics that analyze and describe the expression levels of mRNA in transcriptomic studies.
Among them, miRNA (microRNA) is a short, non-coding RNA molecule that is involved in gene expression regulation in cells and that inhibits transcription or translation of a target gene by binding to an mRNA target. miRNA signatures generally refer to the features that analyze the expression and function of mirnas in a miRNA histology study.
S107: and calculating a correlation coefficient matrix of each sample data according to the edge characteristics, the mRNA characteristics and the miRNA characteristics of the sample data.
Specifically, the pearson correlation coefficient may be calculated to construct a pearson correlation coefficient matrix.
S108: and adopting a soft threshold value to carry out nonlinear mapping on the correlation coefficient matrix.
In one possible implementation, S108 is specifically:
the correlation coefficient matrix is mapped non-linearly by the following formula:
a ij =|a ij | β
S m×m =[a ij ] m×m
wherein S represents a correlation coefficient matrix, a ij The correlation coefficient between the i-th sample and the j-th sample is represented, β represents the soft threshold, and m represents the number of samples.
Wherein the range of the soft threshold is beta epsilon [2,20].
In the present invention, soft threshold mapping may enhance the similarity between correlated samples. In the correlation coefficient matrix, the correlation coefficient values between the correlation samples are higher, and after the soft threshold mapping, the correlation coefficient between the correlation samples is further enhanced, so that the correlation coefficient values are more closely clustered together to form a more definite class or cluster. The nonlinear mapping of the correlation coefficient matrix by adopting the soft threshold is beneficial to improving the accuracy and stability of a clustering algorithm, and the inherent relation between samples and the potential structure of data are better found.
S109: and calculating the connectivity between each sample and the rest samples.
Wherein connectivity measures the degree of association between samples. The similarity degree of each sample and other samples can be known by calculating the communication degree of the samples and the rest samples, so that the stronger relation between the samples in the data is known, and the samples are independent or uncorrelated.
In one possible implementation, S109 is specifically:
calculating the connectivity of the current sample and the rest samples by the following formula:
wherein k is i Indicating the connectivity of the ith sample to the remaining samples.
In the invention, calculating the connectivity between the sample and other samples is helpful for understanding the association degree between the samples, and provides important information and basis for the clustering algorithm, thereby improving the accuracy and the interpretability of the clustering result.
S110: discretizing the connectivity through a Histogram algorithm, and calculating the discretized connectivity and the corresponding probability to obtain a distance matrix between samples.
The Histogram algorithm is a method for discretizing data, among other things, which maps continuous data into discrete intervals, thereby quantizing the data into different values.
In one possible implementation, S110 is specifically:
the probability under discretized connectivity is calculated by the following formula:
log 10 (p(k i ))∝(1/log 10 (k i ))
wherein p (k) i ) Expressed in the degree of connectivity k i Probability below.
According to the invention, the connectivity is discretized through a Histogram algorithm, and the probability is calculated, so that the original continuous connectivity data can be converted into the discretized characteristic representation, and effective information is provided for the construction of the distance matrix between samples, thereby providing beneficial support and guidance for subsequent data processing and clustering tasks.
S111: and pre-clustering the sample data under the mRNA data view, the microRNA data view and the Image data view by a K-means++ clustering algorithm to obtain pre-clustering information.
Optionally, the value range of the class number K of the clusters of the K-means++ clustering algorithm is k= [2,10].
In the invention, the K-means++ clustering algorithm is used for pre-clustering sample data under different data views, so that high-dimensional data can be converted into low-dimensional clustering labels, structures and modes in the data are found, and beneficial information and basis are provided for subsequent clustering, clustering and data fusion.
S112: and converting the distance matrix between samples into a similarity matrix between samples under the mRNA data view, the microRNA data view and the Image data view.
In one possible implementation, S112 is specifically:
s1121: the inter-sample distance is converted to an inter-sample similarity by the following equation:
wherein w is ij Represents the similarity between the ith sample and the jth sample, d ij Represents the distance, ε, between the ith sample and the jth sample ij Representing the ith sample and the jth sampleAdaptive parameters between mean (d i ,N i ) Representing the ith sample and other N i Average value of distances of individual samples, mean (d j ,N j ) Representing the jth sample with other N j Average of the distances of the individual samples.
S1122: the inter-sample similarity matrix P is constructed according to the following formula:
wherein P is ij Element values representing the j-th column of the i-th row in the inter-sample similarity matrix, w ik Representing the similarity between the i-th sample and the k-th sample.
In the invention, the conversion of the inter-sample distance matrix into the inter-sample similarity matrix is helpful to better describe the similarity and correlation between samples, flexibly adjust the similarity calculation, and provide more meaningful input and basis for subsequent data analysis and fusion.
S113: and constructing a nuclear matrix under the mRNA data view, the microRNA data view and the Image data view according to the pre-clustering information.
In one possible implementation, S113 is specifically:
the kernel matrix S is constructed by the following formula:
wherein S is ij Representing the element values of the ith row and jth column in the core matrix, C i Representing the i-th sample class.
In the invention, in multi-modal data clustering, each data view provides different types of characteristic information, and the construction of a kernel matrix can integrate the characteristic information, so that the relevance among samples is more comprehensively described. By constructing the kernel matrix, the sample similarity information under different data views can be integrated, the clustering accuracy and stability are improved, and the calculation complexity is reduced, so that the method plays an important role in multi-mode data clustering.
S114: and iterating the sample-to-sample similarity matrix under the mRNA data view, the microRNA data view and the Image data view according to the kernel matrix.
In one possible implementation, S114 is specifically:
iterating the sample-to-sample similarity matrix by the following formula:
wherein P is v Representing an inter-sample similarity matrix under a v-th view, P k Representing an inter-sample similarity matrix under a kth view, S v Representing the kernel matrix under view v (·) T Representing the matrix transpose.
According to the invention, the sample similarity information under different views can be fused by iteratively updating the sample similarity matrix, so that a more comprehensive sample similarity matrix is obtained. By doing so, the information provided by different views can be fully utilized, and the performance of a clustering algorithm is improved. In the iterative process, the sample similarity matrix continuously approaches the kernel matrix, which is equivalent to transmitting sample similarity information under different views to other views. By doing so, information loss among different views can be made up, and the accuracy of sample clustering can be improved.
S115: and integrating the similarity matrix among the samples in the mRNA data view, the microRNA data view and the Image data view to obtain a similarity fusion matrix among the samples.
In one possible implementation, S115 is specifically:
the sample-to-sample similarity fusion matrix is obtained by integrating the sample-to-sample similarity matrix under the mRNA data view, the microRNA data view and the Image data view through the following formula:
wherein P represents an inter-sample similarity fusion matrix, and P v Representing the inter-sample similarity matrix under view v.
In the present invention, the different views provide respective characteristic information. By integrating the sample similarity matrixes under different views, the information of the different views can be fused together, so that more comprehensive and richer sample similarity information is obtained. By doing so, the understanding and judging capability of the clustering algorithm on the relevance among samples can be improved, so that the clustering performance is improved. The sample similarity matrix under different views may reflect the similarity relationships of the different aspects. By integrating the similarity matrices of these views, the similarity information for the different aspects can be integrated, thereby describing the similarity between samples more fully. This helps to overcome the limitations and biases that may exist for a single view, improving the stability and robustness of the clustering algorithm.
S116: and clustering the samples according to a similarity fusion matrix among the samples by a spectral clustering algorithm.
The spectral clustering algorithm is an unsupervised clustering algorithm based on graph theory and spectrum theory. It classifies similar samples into the same category by representing the sample data as a graph and dividing the data samples using feature vectors of the graph.
Compared with the prior art, the invention has at least the following beneficial technical effects:
according to the invention, transcriptome data and histopathological images are synthesized, edge characteristics of the histopathological images and mRNA characteristics and miRNA characteristics of the transcriptome data are extracted, multi-mode fusion is carried out on the edge characteristics of the histopathological images and the mRNA characteristics and the miRNA characteristics of the transcriptome data to obtain an inter-sample similarity fusion matrix, and then automatic clustering is carried out according to the inter-sample similarity fusion matrix. For transcriptome data, only mRNA features and miRNA features in the transcriptome data are required to be focused, so that the data dimension is reduced, the complexity of a clustering algorithm is reduced, the influence of subjective factors in the disease evaluation process is reduced, and the accuracy of clustering evaluation is improved through multi-modal analysis.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.
Claims (10)
1. A method for clustering multi-modal data, comprising:
s101: obtaining a sample dataset comprising a plurality of sample data, each sample data comprising image data and transcriptome data;
s102: filtering the image data through a bilateral filter;
s103: introducing a Sobel operator, and calculating gradient information of pixel points in the filtered image data, wherein the gradient information comprises gradient strength and gradient direction;
s104: when a plurality of gradient information exists in the image data after the filtering processing, reserving maximum value pixel points and inhibiting non-maximum value pixel points;
s105: denoising the sample data after non-maximum suppression to obtain edge characteristics of the image data;
s106: extracting differential features of the transcriptome data, the differential features including mRNA features and miRNA features;
s107: calculating a correlation coefficient matrix of each sample data according to the edge characteristics, the mRNA characteristics and the miRNA characteristics of the sample data;
s108: a soft threshold value is adopted to carry out nonlinear mapping on the correlation coefficient matrix;
s109: calculating the connectivity of each sample and the rest samples;
s110: discretizing the connectivity through a Histogram algorithm, and calculating the discretized connectivity and the corresponding probability to obtain a distance matrix between samples;
s111: pre-clustering sample data under an mRNA data view, a microRNA data view and an Image data view by a K-means++ clustering algorithm to obtain pre-clustering information;
s112: converting the inter-sample distance matrix into an inter-sample similarity matrix under an mRNA data view, a microRNA data view and an Image data view;
s113: according to the pre-clustering information, constructing a nuclear matrix under an mRNA data view, a microRNA data view and an Image data view;
s114: according to the kernel matrix, iterating a similarity matrix among samples in an mRNA data view, a microRNA data view and an Image data view;
s115: integrating the similarity matrix among samples in the mRNA data view, the microRNA data view and the Image data view to obtain a similarity fusion matrix among samples;
s116: and clustering samples according to the similarity fusion matrix among the samples by a spectral clustering algorithm.
2. The multi-modal data clustering method according to claim 1, wherein S102 specifically includes:
s1021: converting the sample data into a matrix of pixels;
s1022: non-linear fusion is carried out on the current pixel point and the pixel points in the neighborhood range with the surrounding radius of 1 pixel point:
where g (i, j) represents a non-linearly fused pixel value at the current pixel (i, j), S (i, j) represents a set of pixels within a neighborhood of 1 pixel in radius around the current pixel (i, j), (k, l) represents pixel coordinates around the current pixel (i, j), f (k, l) represents a gray value at the pixel (k, l), and w (i, j, k, l) represents a weight parameter between the current pixel (i, j) and the pixel (k, l);
the calculation mode of the weight parameter w (i, j, k, l) between the current pixel point (i, j) and the pixel point (k, l) is as follows:
w(i,j,k,l)=d(i,j,k,l)·r(i,j,k,l)
where d (i, j, k, l) represents the spatial domain weight between the current pixel point (i, j) and the pixel point (k, l), r (i, j, k, l) represents the pixel domain weight between the current pixel point (i, j) and the pixel point (k, l), σ d Representing the spatial domain standard deviation, sigma r Representing the pixel domain standard deviation.
3. The multi-modal data clustering method according to claim 1, wherein S103 specifically includes:
s1031: introducing a Sobel operator, and calculating a horizontal feature matrix S of sample data after filtering x And a vertical feature matrix S y :
S1032: according to the horizontal feature matrix S x And the vertical feature matrix S y Calculating the horizontal gradient G x And a gradient G in the vertical direction y :
G x =S x I
G y =S y I
Wherein I represents a gray value matrix of the sample data after the filtering processing;
s1033: calculating gradient intensity G and gradient direction theta of the pixel points:
4. the multi-modal data clustering method according to claim 1, wherein S105 specifically includes:
s1051: setting a high threshold and a low threshold;
s1052: when the edge pixel gradient value of the pixel point is larger than the high threshold value, determining that the pixel point is a strong edge pixel point;
s1053: when the edge pixel gradient value of the pixel point is between the low threshold value and the high threshold value, determining the pixel point as a weak edge pixel point;
s1054: and when the edge pixel gradient value of the pixel point is smaller than the low threshold value, suppressing the pixel point.
The determination manners of the high threshold TH and the low threshold TL are as follows:
TH=0.5max(H*)
TL=0.1max(H*)
where H denotes a pixel matrix after non-maximum suppression, and max (H) denotes a maximum value in the pixel matrix after non-maximum suppression.
5. The multi-modal data clustering method according to claim 1, wherein S108 is specifically:
the correlation coefficient matrix is mapped non-linearly by the following formula:
a ij =|a ij | β
S m×m =[a ij ] m×m
wherein S represents a correlation coefficient matrix, a ij The correlation coefficient between the i-th sample and the j-th sample is represented, β represents the soft threshold, and m represents the number of samples.
6. The multi-modal data clustering method according to claim 1, wherein the step S109 is specifically:
calculating the connectivity of the current sample and the rest samples by the following formula:
wherein k is i Indicating the connectivity of the ith sample to the remaining samples.
7. The multi-modal data clustering method according to claim 1, wherein S112 specifically includes:
s1121: the inter-sample distance is converted to an inter-sample similarity by the following equation:
wherein w is ij Represents the similarity between the ith sample and the jth sample, d ij Represents the distance, ε, between the ith sample and the jth sample ij Representing the adaptive parameters between the ith and jth samples, mean (d i ,N i ) Representing the ith sample and other N i Average value of distances of individual samples, mean(d j ,N j ) Representing the jth sample with other N j Average value of the distances of the individual samples;
s1122: the inter-sample similarity matrix P is constructed according to the following formula:
wherein P is ij Element values representing the j-th column of the i-th row in the inter-sample similarity matrix, w ik Representing the similarity between the i-th sample and the k-th sample.
8. The multi-modal data clustering method according to claim 1, wherein S113 is specifically:
the kernel matrix S is constructed by the following formula:
wherein S is ij Representing the element values of the ith row and jth column in the core matrix, C i Representing the category of the ith sample.
9. The multi-modal data clustering method according to claim 1, wherein the S114 specifically is:
iterating the sample-to-sample similarity matrix by the following formula:
wherein P is v Representing an inter-sample similarity matrix under a v-th view, P k Representing an inter-sample similarity matrix under a kth view, S v Representing the kernel matrix under view v (·) T Representing the matrix transpose.
10. The multi-modal data clustering method according to claim 1, wherein the S115 is specifically:
the sample-to-sample similarity fusion matrix is obtained by integrating the sample-to-sample similarity matrix under the mRNA data view, the microRNA data view and the Image data view through the following formula:
wherein P represents an inter-sample similarity fusion matrix, and P v Representing the inter-sample similarity matrix under view v.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310975304.9A CN117036762B (en) | 2023-08-03 | 2023-08-03 | Multi-mode data clustering method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310975304.9A CN117036762B (en) | 2023-08-03 | 2023-08-03 | Multi-mode data clustering method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117036762A true CN117036762A (en) | 2023-11-10 |
CN117036762B CN117036762B (en) | 2024-03-22 |
Family
ID=88629217
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310975304.9A Active CN117036762B (en) | 2023-08-03 | 2023-08-03 | Multi-mode data clustering method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117036762B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200152289A1 (en) * | 2018-11-09 | 2020-05-14 | The Broad Institute, Inc. | Compressed sensing for screening and tissue imaging |
WO2022178274A1 (en) * | 2021-02-19 | 2022-08-25 | The Broad Institute, Inc. | Multi-scale spatial transcriptomics analysis |
WO2022270725A1 (en) * | 2021-06-25 | 2022-12-29 | 주식회사 포트래이 | Method for searching for intra-tissue molecular marker associated with distribution or physiological activity of probe or for physiological activity information |
CN115631361A (en) * | 2022-10-12 | 2023-01-20 | 山西大学 | Image clustering method fusing low-rank kernel learning and self-adaptive hypergraph |
CN115641957A (en) * | 2022-11-11 | 2023-01-24 | 广州大学 | New auxiliary chemotherapy curative effect prediction method and system based on image genomics |
US20230046438A1 (en) * | 2020-01-14 | 2023-02-16 | Peking University | Method for predicting cell spatial relation based on single-cell transcriptome sequencing data |
CN116312782A (en) * | 2023-05-18 | 2023-06-23 | 南京航空航天大学 | Spatial transcriptome spot region clustering method fusing image gene data |
-
2023
- 2023-08-03 CN CN202310975304.9A patent/CN117036762B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200152289A1 (en) * | 2018-11-09 | 2020-05-14 | The Broad Institute, Inc. | Compressed sensing for screening and tissue imaging |
US20230046438A1 (en) * | 2020-01-14 | 2023-02-16 | Peking University | Method for predicting cell spatial relation based on single-cell transcriptome sequencing data |
WO2022178274A1 (en) * | 2021-02-19 | 2022-08-25 | The Broad Institute, Inc. | Multi-scale spatial transcriptomics analysis |
WO2022270725A1 (en) * | 2021-06-25 | 2022-12-29 | 주식회사 포트래이 | Method for searching for intra-tissue molecular marker associated with distribution or physiological activity of probe or for physiological activity information |
CN115631361A (en) * | 2022-10-12 | 2023-01-20 | 山西大学 | Image clustering method fusing low-rank kernel learning and self-adaptive hypergraph |
CN115641957A (en) * | 2022-11-11 | 2023-01-24 | 广州大学 | New auxiliary chemotherapy curative effect prediction method and system based on image genomics |
CN116312782A (en) * | 2023-05-18 | 2023-06-23 | 南京航空航天大学 | Spatial transcriptome spot region clustering method fusing image gene data |
Non-Patent Citations (1)
Title |
---|
谭俊;袁少勋;明文龙;孙啸;: "影像基因组学分析方法研究进展", 生物技术进展, no. 04, pages 7 - 13 * |
Also Published As
Publication number | Publication date |
---|---|
CN117036762B (en) | 2024-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2920492C (en) | Systems and methods for adaptive histopathology image unmixing | |
EP2391887B1 (en) | Image-based risk score-a prognostic predictor of survival and outcome from digital histopathology | |
CN110097921B (en) | Visualized quantitative method and system for glioma internal gene heterogeneity based on image omics | |
CN107169497A (en) | A kind of tumor imaging label extracting method based on gene iconography | |
Xu et al. | Using transfer learning on whole slide images to predict tumor mutational burden in bladder cancer patients | |
CN112767325B (en) | Automatic detection method and system for cancer pathology image | |
CN114334012B (en) | Method for identifying cancer subtype based on multiple sets of chemical data | |
Qin et al. | Spot detection and image segmentation in DNA microarray data | |
CN109711469B (en) | Breast cancer diagnosis system based on semi-supervised neighborhood discrimination index | |
He et al. | Local and global Gaussian mixture models for hematoxylin and eosin stained histology image segmentation | |
CN114613430A (en) | Filtering method and computing equipment for false positive nucleotide variation sites | |
Kothari et al. | Histological image feature mining reveals emergent diagnostic properties for renal cancer | |
CN117036762B (en) | Multi-mode data clustering method | |
US20050232488A1 (en) | Analysis of patterns among objects of a plurality of classes | |
CN115881218A (en) | Automatic gene selection method for whole genome association analysis | |
Mazo et al. | Automatic recognition of fundamental tissues on histology images of the human cardiovascular system | |
KR102361615B1 (en) | Method for drug repositioning based on drug responding gene expression features | |
Bergemann et al. | A statistically driven approach for image segmentation and signal extraction in cDNA microarrays | |
Krishna et al. | Various versions of K-means clustering algorithm for segmentation of microarray image | |
Cao et al. | Pattern recognition in high-content cytomics screens for target discovery-case studies in endocytosis | |
CN110751983A (en) | Method for screening characteristic mRNA (messenger ribonucleic acid) for diagnosing early lung cancer | |
CN117133466B (en) | Survival prediction method and device based on transcriptomics and image histology | |
CN117314908B (en) | Flue-cured tobacco virus tracing method, medium and system | |
CN116758989B (en) | Breast cancer marker screening method and related device | |
CN114703263B (en) | Group chromosome copy number variation detection method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |