CN117036762A - Multi-mode data clustering method - Google Patents

Multi-mode data clustering method Download PDF

Info

Publication number
CN117036762A
CN117036762A CN202310975304.9A CN202310975304A CN117036762A CN 117036762 A CN117036762 A CN 117036762A CN 202310975304 A CN202310975304 A CN 202310975304A CN 117036762 A CN117036762 A CN 117036762A
Authority
CN
China
Prior art keywords
sample
matrix
data
samples
pixel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310975304.9A
Other languages
Chinese (zh)
Other versions
CN117036762B (en
Inventor
艾冬梅
陈露露
王艺舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202310975304.9A priority Critical patent/CN117036762B/en
Publication of CN117036762A publication Critical patent/CN117036762A/en
Application granted granted Critical
Publication of CN117036762B publication Critical patent/CN117036762B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • G06V10/763Non-hierarchical techniques, e.g. based on statistics of modelling distributions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a multi-mode data clustering method, which belongs to the technical field of data processing and comprises the steps of obtaining a sample data set; extracting edge characteristics of the image data; extracting differential features of transcriptome data, the differential features including mRNA features and miRNA features; calculating a correlation coefficient matrix of each sample data; a soft threshold value is adopted to carry out nonlinear mapping on the correlation coefficient matrix; calculating the connectivity of each sample and the rest samples; calculating discretized connectivity and corresponding probability to obtain a distance matrix between samples; pre-clustering the sample data by a K-means++ clustering algorithm; converting the inter-sample distance matrix into an inter-sample similarity matrix; constructing a kernel matrix according to the pre-clustering information; iterating the similarity matrix among the samples according to the kernel matrix; integrating the similarity matrix among samples in the mRNA data view, the microRNA data view and the Image data view to obtain a similarity fusion matrix among samples; and clustering samples by a spectral clustering algorithm.

Description

Multi-mode data clustering method
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a multi-mode data clustering method.
Background
In clinical diagnosis, the variability of response to clinical treatment in different patients with the same tumor is often caused by tumor heterogeneity, which is likely due to mutations during proliferation and differentiation of tumor cells, which has been demonstrated by several studies. Tumor heterogeneity ultimately translates into a difference in phenotype, which refers not only to the difference in response of different patients with the same tumor to the same drug treatment experiment, but also to the differences in the various biomarkers in the patient's tumor microenvironment.
In one aspect, transcriptome data is a very critical biomarker for different tumor subtypes. microRNA-92a, microRNA-375 and microRNA-424 were found to have differences in expression of different colorectal cancer tumor subtypes by microRNA expression data of colorectal cancer samples in 2015, rikke Karlin Jepsen, et al. The transcriptome data reflects the expression condition of genes in cells, can provide a large amount of gene expression information, comprises the expression level of the genes under different conditions, and can reveal the differences in cell functions, metabolic pathways, signal pathways and the like. Transcriptome data typically has a high dimensional eigenvector, which allows more gene expression variation to be considered in the cluster analysis, helping to find minor differences. However, clustering algorithms may cause increased computational complexity when processing large-scale data due to the high-dimensional nature of transcriptome data.
On the other hand, histopathological images play an important role in early identification and diagnosis of cancer, and the efforts to participate in cancer diagnosis by means of analyzing pathological images have been applied and developed for many years; kowal et al compared and tested different algorithms for cell nucleus segmentation, and by analyzing the data set of the cancer image of the case to determine whether the tumor of the case patient is benign, the accuracy rate was over 96%. The histopathological image intuitively shows the morphology and structure of the tissue cells, and can help doctors or pathologists to quickly observe the characteristics of the sample and discover potential abnormalities or pathological changes. However, interpretation and clustering of histopathological images typically requires subjective judgment by a specialized pathologist, and may be affected by individual differences and subjective experience. Moreover, obtaining high quality histopathological images requires processing such as tissue slicing and staining, which is costly and time consuming.
In summary, transcriptome data and histopathological images each have advantages and disadvantages in clustering cancer samples.
Disclosure of Invention
The invention provides a multi-mode data clustering method, which aims to solve the technical problems that transcriptome data in the prior art is high in data dimension, high in calculation complexity of a clustering algorithm, and poor in clustering accuracy, and tissue pathology images are easily influenced by subjective factors.
First aspect
The invention provides a multi-mode data clustering method, which comprises the following steps:
s101: obtaining a sample data set comprising a plurality of sample data, each sample data comprising image data and transcriptome data;
s102: filtering the image data through a bilateral filter;
s103: introducing a Sobel operator, and calculating gradient information of pixel points in the filtered image data, wherein the gradient information comprises gradient strength and gradient direction;
s104: when a plurality of gradient information exists in the image data after the filtering processing, reserving maximum value pixel points and inhibiting non-maximum value pixel points;
s105: denoising the sample data after non-maximum suppression to obtain edge characteristics of the image data;
s106: extracting differential features of transcriptome data, the differential features including mRNA features and miRNA features;
s107: calculating a correlation coefficient matrix of each sample data according to the edge characteristics, the mRNA characteristics and the miRNA characteristics of the sample data;
s108: a soft threshold value is adopted to carry out nonlinear mapping on the correlation coefficient matrix;
s109: calculating the connectivity of each sample and the rest samples;
s110: discretizing the connectivity through a Histogram algorithm, and calculating the discretized connectivity and the corresponding probability to obtain a distance matrix between samples;
s111: pre-clustering sample data under an mRNA data view, a microRNA data view and an Image data view by a K-means++ clustering algorithm to obtain pre-clustering information;
s112: converting the distance matrix between samples into a similarity matrix between samples under the mRNA data view, the microRNA data view and the Image data view;
s113: according to the pre-clustering information, constructing a nuclear matrix under the mRNA data view, the microRNA data view and the Image data view;
s114: iterating the sample-to-sample similarity matrix under the mRNA data view, the microRNA data view and the Image data view according to the kernel matrix;
s115: integrating the similarity matrix among samples in the mRNA data view, the microRNA data view and the Image data view to obtain a similarity fusion matrix among samples;
s116: and clustering the samples according to a similarity fusion matrix among the samples by a spectral clustering algorithm.
Compared with the prior art, the invention has at least the following beneficial technical effects:
according to the invention, transcriptome data and histopathological images are synthesized, edge characteristics of the histopathological images and mRNA characteristics and miRNA characteristics of the transcriptome data are extracted, multi-mode fusion is carried out on the edge characteristics of the histopathological images and the mRNA characteristics and the miRNA characteristics of the transcriptome data to obtain an inter-sample similarity fusion matrix, and then automatic clustering is carried out according to the inter-sample similarity fusion matrix. For transcriptome data, only mRNA features and miRNA features in the transcriptome data are required to be focused, so that the data dimension is reduced, the complexity of a clustering algorithm is reduced, the influence of subjective factors in the disease evaluation process is reduced, and the accuracy of clustering evaluation is improved through multi-modal analysis.
Drawings
The above features, technical features, advantages and implementation of the present invention will be further described in the following description of preferred embodiments with reference to the accompanying drawings in a clear and easily understood manner.
Fig. 1 is a schematic flow chart of a multi-mode data clustering method provided by the invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will explain the specific embodiments of the present invention with reference to the accompanying drawings. It is evident that the drawings in the following description are only examples of the invention, from which other drawings and other embodiments can be obtained by a person skilled in the art without inventive effort.
For simplicity of the drawing, only the parts relevant to the invention are schematically shown in each drawing, and they do not represent the actual structure thereof as a product. Additionally, in order to simplify the drawing for ease of understanding, components having the same structure or function in some of the drawings are shown schematically with only one of them, or only one of them is labeled. Herein, "a" means not only "only this one" but also "more than one" case.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
In this context, it should be noted that the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected, unless otherwise explicitly stated and defined. Either mechanically or electrically. Can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In addition, in the description of the present invention, the terms "first," "second," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.
Example 1
In one embodiment, referring to fig. 1 of the specification, a flow chart of a multi-modal data clustering method provided by the invention is shown.
The invention provides a multi-mode data clustering method, which comprises the following steps:
s101: a sample dataset is acquired.
Wherein the sample data set comprises a plurality of sample data, each sample data comprising image data and transcriptome data.
In particular, the sample data may be gastrointestinal cancer sample data.
S102: and filtering the image data through a bilateral filter.
In one possible implementation, S102 specifically includes:
s1021: the sample data is converted into a matrix of pixels.
S1022: non-linear fusion is carried out on the current pixel point and the pixel points in the neighborhood range with the surrounding radius of 1 pixel point:
where g (i, j) represents a non-linearly fused pixel value at the current pixel (i, j), S (i, j) represents a set of pixels within a neighborhood of 1 pixel in radius around the current pixel (i, j), (k, l) represents pixel coordinates around the current pixel (i, j), f (k, l) represents a gray value at the pixel (k, l), and w (i, j, k, l) represents a weight parameter between the current pixel (i, j) and the pixel (k, l).
The calculation mode of the weight parameter w (i, j, k, l) between the current pixel point (i, j) and the pixel point (k, l) is as follows:
w(i,j,k,l)=d(i,j,k,l)·r(i,j,k,l)
where d (i, j, k, l) represents the spatial domain weight between the current pixel point (i, j) and the pixel point (k, l), r (i, j, k, l) represents the pixel domain weight between the current pixel point (i, j) and the pixel point (k, l), σ d Representing the spatial domain standard deviation, sigma r Representing the pixel domain standard deviation.
In the invention, the bilateral filter is a nonlinear filter, and the gray value and the spatial position of the pixel point are considered in the filtering process, so that the edge characteristics of the image can be reserved, and the image can be more clear and sharp. The bilateral filter performs nonlinear fusion on gray values of surrounding pixels, which means that in the filtering process, the fusion relationship between the pixels is not simply weighted average, but is adjusted according to the similarity between the pixels. In this way, details and texture information of the image can be better preserved.
S103: and introducing a Sobel operator, and calculating gradient information of pixel points in the filtered image data, wherein the gradient information comprises gradient strength and gradient direction.
The Sobel operator is a commonly used image processing operator for detecting edges in an image. The method is a discrete difference operator, and the edge of the image is found by calculating the gradient of the pixel point of the image in the horizontal and vertical directions.
Among them, gradient information is an important feature in image processing and computer vision for describing the degree of change of an image and edge information. It can help us understand the variation of pixel values in an image, and thus locate and identify features such as edges, textures, shapes, etc. in the image.
In one possible implementation, S103 specifically includes:
s1031: introducing a Sobel operator, and calculating a horizontal feature matrix S of sample data after filtering x And a vertical feature matrix S y
S1032: according to a horizontal feature matrix S x And a vertical feature matrix S y Calculating the horizontal gradient G x And a gradient G in the vertical direction y
G x =S x I
G y =S y I
Wherein I represents a gray value matrix of the sample data after the filtering process.
S1033: calculating gradient intensity G and gradient direction theta of the pixel points:
according to the invention, gradient information of the image in the horizontal and vertical directions can be obtained by introducing a Sobel operator and calculating the horizontal and vertical feature matrixes of the filtered image data. Then, the gradient intensity and the gradient direction of the pixel point are calculated according to the gradient characteristics, so that the edge detection and the characteristic extraction of the image are realized. Such gradient information is important for subsequent image analysis and processing steps and can provide more abundant and useful information for image data.
S104: when a plurality of gradient information exists in the image data after the filtering processing, reserving the maximum value pixel point and restraining the non-maximum value pixel point.
In one possible implementation, S104 specifically includes:
s1041: selecting fourThe preset edge angles are respectively 0,
s1042: when the gradient strength of the pixel points is larger than the gradient of the preset edge angle, determining that the current pixel point is the maximum value pixel point to be reserved, otherwise, determining that the current pixel point is the non-maximum value pixel point to be restrained.
In the invention, the processing mode of reserving the maximum pixel point and inhibiting the non-maximum pixel point is beneficial to improving the image edge detection effect, enhancing the image characteristics and reducing the influence of noise, thereby obtaining more accurate and clear image edge information and providing a better foundation for the subsequent image processing and analysis tasks.
S105: and denoising the sample data after non-maximum suppression to obtain edge characteristics of the image data.
In one possible implementation, S105 specifically includes:
s1051: a high threshold and a low threshold are set.
S1052: and when the edge pixel gradient value of the pixel point is larger than the high threshold value, determining the pixel point as a strong edge pixel point.
S1053: and when the edge pixel gradient value of the pixel point is between the low threshold value and the high threshold value, determining the pixel point as a weak edge pixel point.
S1054: and when the edge pixel gradient value of the pixel point is smaller than the low threshold value, suppressing the pixel point. The determination method of the high threshold TH and the low threshold TL is as follows:
TH=0.5max(H*)
TL=0.1max(H*)
where H denotes a pixel matrix after non-maximum suppression, and max (H) denotes a maximum value in the pixel matrix after non-maximum suppression.
The sample data after non-maximum suppression is subjected to denoising processing and edge characteristics of the image data are extracted, so that important visual information can be extracted from the image, and the method is used for various image processing and computer visual tasks, and therefore the performance and accuracy of an algorithm are improved.
S106: differential features of transcriptome data are extracted, including mRNA features and miRNA features.
Among them, mRNA (messenger RNA) is a class of RNA molecules that, in cells, are involved in gene expression processes, transcribing genetic information in DNA into amino acid sequences of proteins. mRNA characteristics generally refer to characteristics that analyze and describe the expression levels of mRNA in transcriptomic studies.
Among them, miRNA (microRNA) is a short, non-coding RNA molecule that is involved in gene expression regulation in cells and that inhibits transcription or translation of a target gene by binding to an mRNA target. miRNA signatures generally refer to the features that analyze the expression and function of mirnas in a miRNA histology study.
S107: and calculating a correlation coefficient matrix of each sample data according to the edge characteristics, the mRNA characteristics and the miRNA characteristics of the sample data.
Specifically, the pearson correlation coefficient may be calculated to construct a pearson correlation coefficient matrix.
S108: and adopting a soft threshold value to carry out nonlinear mapping on the correlation coefficient matrix.
In one possible implementation, S108 is specifically:
the correlation coefficient matrix is mapped non-linearly by the following formula:
a ij =|a ij | β
S m×m =[a ij ] m×m
wherein S represents a correlation coefficient matrix, a ij The correlation coefficient between the i-th sample and the j-th sample is represented, β represents the soft threshold, and m represents the number of samples.
Wherein the range of the soft threshold is beta epsilon [2,20].
In the present invention, soft threshold mapping may enhance the similarity between correlated samples. In the correlation coefficient matrix, the correlation coefficient values between the correlation samples are higher, and after the soft threshold mapping, the correlation coefficient between the correlation samples is further enhanced, so that the correlation coefficient values are more closely clustered together to form a more definite class or cluster. The nonlinear mapping of the correlation coefficient matrix by adopting the soft threshold is beneficial to improving the accuracy and stability of a clustering algorithm, and the inherent relation between samples and the potential structure of data are better found.
S109: and calculating the connectivity between each sample and the rest samples.
Wherein connectivity measures the degree of association between samples. The similarity degree of each sample and other samples can be known by calculating the communication degree of the samples and the rest samples, so that the stronger relation between the samples in the data is known, and the samples are independent or uncorrelated.
In one possible implementation, S109 is specifically:
calculating the connectivity of the current sample and the rest samples by the following formula:
wherein k is i Indicating the connectivity of the ith sample to the remaining samples.
In the invention, calculating the connectivity between the sample and other samples is helpful for understanding the association degree between the samples, and provides important information and basis for the clustering algorithm, thereby improving the accuracy and the interpretability of the clustering result.
S110: discretizing the connectivity through a Histogram algorithm, and calculating the discretized connectivity and the corresponding probability to obtain a distance matrix between samples.
The Histogram algorithm is a method for discretizing data, among other things, which maps continuous data into discrete intervals, thereby quantizing the data into different values.
In one possible implementation, S110 is specifically:
the probability under discretized connectivity is calculated by the following formula:
log 10 (p(k i ))∝(1/log 10 (k i ))
wherein p (k) i ) Expressed in the degree of connectivity k i Probability below.
According to the invention, the connectivity is discretized through a Histogram algorithm, and the probability is calculated, so that the original continuous connectivity data can be converted into the discretized characteristic representation, and effective information is provided for the construction of the distance matrix between samples, thereby providing beneficial support and guidance for subsequent data processing and clustering tasks.
S111: and pre-clustering the sample data under the mRNA data view, the microRNA data view and the Image data view by a K-means++ clustering algorithm to obtain pre-clustering information.
Optionally, the value range of the class number K of the clusters of the K-means++ clustering algorithm is k= [2,10].
In the invention, the K-means++ clustering algorithm is used for pre-clustering sample data under different data views, so that high-dimensional data can be converted into low-dimensional clustering labels, structures and modes in the data are found, and beneficial information and basis are provided for subsequent clustering, clustering and data fusion.
S112: and converting the distance matrix between samples into a similarity matrix between samples under the mRNA data view, the microRNA data view and the Image data view.
In one possible implementation, S112 is specifically:
s1121: the inter-sample distance is converted to an inter-sample similarity by the following equation:
wherein w is ij Represents the similarity between the ith sample and the jth sample, d ij Represents the distance, ε, between the ith sample and the jth sample ij Representing the ith sample and the jth sampleAdaptive parameters between mean (d i ,N i ) Representing the ith sample and other N i Average value of distances of individual samples, mean (d j ,N j ) Representing the jth sample with other N j Average of the distances of the individual samples.
S1122: the inter-sample similarity matrix P is constructed according to the following formula:
wherein P is ij Element values representing the j-th column of the i-th row in the inter-sample similarity matrix, w ik Representing the similarity between the i-th sample and the k-th sample.
In the invention, the conversion of the inter-sample distance matrix into the inter-sample similarity matrix is helpful to better describe the similarity and correlation between samples, flexibly adjust the similarity calculation, and provide more meaningful input and basis for subsequent data analysis and fusion.
S113: and constructing a nuclear matrix under the mRNA data view, the microRNA data view and the Image data view according to the pre-clustering information.
In one possible implementation, S113 is specifically:
the kernel matrix S is constructed by the following formula:
wherein S is ij Representing the element values of the ith row and jth column in the core matrix, C i Representing the i-th sample class.
In the invention, in multi-modal data clustering, each data view provides different types of characteristic information, and the construction of a kernel matrix can integrate the characteristic information, so that the relevance among samples is more comprehensively described. By constructing the kernel matrix, the sample similarity information under different data views can be integrated, the clustering accuracy and stability are improved, and the calculation complexity is reduced, so that the method plays an important role in multi-mode data clustering.
S114: and iterating the sample-to-sample similarity matrix under the mRNA data view, the microRNA data view and the Image data view according to the kernel matrix.
In one possible implementation, S114 is specifically:
iterating the sample-to-sample similarity matrix by the following formula:
wherein P is v Representing an inter-sample similarity matrix under a v-th view, P k Representing an inter-sample similarity matrix under a kth view, S v Representing the kernel matrix under view v (·) T Representing the matrix transpose.
According to the invention, the sample similarity information under different views can be fused by iteratively updating the sample similarity matrix, so that a more comprehensive sample similarity matrix is obtained. By doing so, the information provided by different views can be fully utilized, and the performance of a clustering algorithm is improved. In the iterative process, the sample similarity matrix continuously approaches the kernel matrix, which is equivalent to transmitting sample similarity information under different views to other views. By doing so, information loss among different views can be made up, and the accuracy of sample clustering can be improved.
S115: and integrating the similarity matrix among the samples in the mRNA data view, the microRNA data view and the Image data view to obtain a similarity fusion matrix among the samples.
In one possible implementation, S115 is specifically:
the sample-to-sample similarity fusion matrix is obtained by integrating the sample-to-sample similarity matrix under the mRNA data view, the microRNA data view and the Image data view through the following formula:
wherein P represents an inter-sample similarity fusion matrix, and P v Representing the inter-sample similarity matrix under view v.
In the present invention, the different views provide respective characteristic information. By integrating the sample similarity matrixes under different views, the information of the different views can be fused together, so that more comprehensive and richer sample similarity information is obtained. By doing so, the understanding and judging capability of the clustering algorithm on the relevance among samples can be improved, so that the clustering performance is improved. The sample similarity matrix under different views may reflect the similarity relationships of the different aspects. By integrating the similarity matrices of these views, the similarity information for the different aspects can be integrated, thereby describing the similarity between samples more fully. This helps to overcome the limitations and biases that may exist for a single view, improving the stability and robustness of the clustering algorithm.
S116: and clustering the samples according to a similarity fusion matrix among the samples by a spectral clustering algorithm.
The spectral clustering algorithm is an unsupervised clustering algorithm based on graph theory and spectrum theory. It classifies similar samples into the same category by representing the sample data as a graph and dividing the data samples using feature vectors of the graph.
Compared with the prior art, the invention has at least the following beneficial technical effects:
according to the invention, transcriptome data and histopathological images are synthesized, edge characteristics of the histopathological images and mRNA characteristics and miRNA characteristics of the transcriptome data are extracted, multi-mode fusion is carried out on the edge characteristics of the histopathological images and the mRNA characteristics and the miRNA characteristics of the transcriptome data to obtain an inter-sample similarity fusion matrix, and then automatic clustering is carried out according to the inter-sample similarity fusion matrix. For transcriptome data, only mRNA features and miRNA features in the transcriptome data are required to be focused, so that the data dimension is reduced, the complexity of a clustering algorithm is reduced, the influence of subjective factors in the disease evaluation process is reduced, and the accuracy of clustering evaluation is improved through multi-modal analysis.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (10)

1. A method for clustering multi-modal data, comprising:
s101: obtaining a sample dataset comprising a plurality of sample data, each sample data comprising image data and transcriptome data;
s102: filtering the image data through a bilateral filter;
s103: introducing a Sobel operator, and calculating gradient information of pixel points in the filtered image data, wherein the gradient information comprises gradient strength and gradient direction;
s104: when a plurality of gradient information exists in the image data after the filtering processing, reserving maximum value pixel points and inhibiting non-maximum value pixel points;
s105: denoising the sample data after non-maximum suppression to obtain edge characteristics of the image data;
s106: extracting differential features of the transcriptome data, the differential features including mRNA features and miRNA features;
s107: calculating a correlation coefficient matrix of each sample data according to the edge characteristics, the mRNA characteristics and the miRNA characteristics of the sample data;
s108: a soft threshold value is adopted to carry out nonlinear mapping on the correlation coefficient matrix;
s109: calculating the connectivity of each sample and the rest samples;
s110: discretizing the connectivity through a Histogram algorithm, and calculating the discretized connectivity and the corresponding probability to obtain a distance matrix between samples;
s111: pre-clustering sample data under an mRNA data view, a microRNA data view and an Image data view by a K-means++ clustering algorithm to obtain pre-clustering information;
s112: converting the inter-sample distance matrix into an inter-sample similarity matrix under an mRNA data view, a microRNA data view and an Image data view;
s113: according to the pre-clustering information, constructing a nuclear matrix under an mRNA data view, a microRNA data view and an Image data view;
s114: according to the kernel matrix, iterating a similarity matrix among samples in an mRNA data view, a microRNA data view and an Image data view;
s115: integrating the similarity matrix among samples in the mRNA data view, the microRNA data view and the Image data view to obtain a similarity fusion matrix among samples;
s116: and clustering samples according to the similarity fusion matrix among the samples by a spectral clustering algorithm.
2. The multi-modal data clustering method according to claim 1, wherein S102 specifically includes:
s1021: converting the sample data into a matrix of pixels;
s1022: non-linear fusion is carried out on the current pixel point and the pixel points in the neighborhood range with the surrounding radius of 1 pixel point:
where g (i, j) represents a non-linearly fused pixel value at the current pixel (i, j), S (i, j) represents a set of pixels within a neighborhood of 1 pixel in radius around the current pixel (i, j), (k, l) represents pixel coordinates around the current pixel (i, j), f (k, l) represents a gray value at the pixel (k, l), and w (i, j, k, l) represents a weight parameter between the current pixel (i, j) and the pixel (k, l);
the calculation mode of the weight parameter w (i, j, k, l) between the current pixel point (i, j) and the pixel point (k, l) is as follows:
w(i,j,k,l)=d(i,j,k,l)·r(i,j,k,l)
where d (i, j, k, l) represents the spatial domain weight between the current pixel point (i, j) and the pixel point (k, l), r (i, j, k, l) represents the pixel domain weight between the current pixel point (i, j) and the pixel point (k, l), σ d Representing the spatial domain standard deviation, sigma r Representing the pixel domain standard deviation.
3. The multi-modal data clustering method according to claim 1, wherein S103 specifically includes:
s1031: introducing a Sobel operator, and calculating a horizontal feature matrix S of sample data after filtering x And a vertical feature matrix S y
S1032: according to the horizontal feature matrix S x And the vertical feature matrix S y Calculating the horizontal gradient G x And a gradient G in the vertical direction y
G x =S x I
G y =S y I
Wherein I represents a gray value matrix of the sample data after the filtering processing;
s1033: calculating gradient intensity G and gradient direction theta of the pixel points:
4. the multi-modal data clustering method according to claim 1, wherein S105 specifically includes:
s1051: setting a high threshold and a low threshold;
s1052: when the edge pixel gradient value of the pixel point is larger than the high threshold value, determining that the pixel point is a strong edge pixel point;
s1053: when the edge pixel gradient value of the pixel point is between the low threshold value and the high threshold value, determining the pixel point as a weak edge pixel point;
s1054: and when the edge pixel gradient value of the pixel point is smaller than the low threshold value, suppressing the pixel point.
The determination manners of the high threshold TH and the low threshold TL are as follows:
TH=0.5max(H*)
TL=0.1max(H*)
where H denotes a pixel matrix after non-maximum suppression, and max (H) denotes a maximum value in the pixel matrix after non-maximum suppression.
5. The multi-modal data clustering method according to claim 1, wherein S108 is specifically:
the correlation coefficient matrix is mapped non-linearly by the following formula:
a ij =|a ij | β
S m×m =[a ij ] m×m
wherein S represents a correlation coefficient matrix, a ij The correlation coefficient between the i-th sample and the j-th sample is represented, β represents the soft threshold, and m represents the number of samples.
6. The multi-modal data clustering method according to claim 1, wherein the step S109 is specifically:
calculating the connectivity of the current sample and the rest samples by the following formula:
wherein k is i Indicating the connectivity of the ith sample to the remaining samples.
7. The multi-modal data clustering method according to claim 1, wherein S112 specifically includes:
s1121: the inter-sample distance is converted to an inter-sample similarity by the following equation:
wherein w is ij Represents the similarity between the ith sample and the jth sample, d ij Represents the distance, ε, between the ith sample and the jth sample ij Representing the adaptive parameters between the ith and jth samples, mean (d i ,N i ) Representing the ith sample and other N i Average value of distances of individual samples, mean(d j ,N j ) Representing the jth sample with other N j Average value of the distances of the individual samples;
s1122: the inter-sample similarity matrix P is constructed according to the following formula:
wherein P is ij Element values representing the j-th column of the i-th row in the inter-sample similarity matrix, w ik Representing the similarity between the i-th sample and the k-th sample.
8. The multi-modal data clustering method according to claim 1, wherein S113 is specifically:
the kernel matrix S is constructed by the following formula:
wherein S is ij Representing the element values of the ith row and jth column in the core matrix, C i Representing the category of the ith sample.
9. The multi-modal data clustering method according to claim 1, wherein the S114 specifically is:
iterating the sample-to-sample similarity matrix by the following formula:
wherein P is v Representing an inter-sample similarity matrix under a v-th view, P k Representing an inter-sample similarity matrix under a kth view, S v Representing the kernel matrix under view v (·) T Representing the matrix transpose.
10. The multi-modal data clustering method according to claim 1, wherein the S115 is specifically:
the sample-to-sample similarity fusion matrix is obtained by integrating the sample-to-sample similarity matrix under the mRNA data view, the microRNA data view and the Image data view through the following formula:
wherein P represents an inter-sample similarity fusion matrix, and P v Representing the inter-sample similarity matrix under view v.
CN202310975304.9A 2023-08-03 2023-08-03 Multi-mode data clustering method Active CN117036762B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310975304.9A CN117036762B (en) 2023-08-03 2023-08-03 Multi-mode data clustering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310975304.9A CN117036762B (en) 2023-08-03 2023-08-03 Multi-mode data clustering method

Publications (2)

Publication Number Publication Date
CN117036762A true CN117036762A (en) 2023-11-10
CN117036762B CN117036762B (en) 2024-03-22

Family

ID=88629217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310975304.9A Active CN117036762B (en) 2023-08-03 2023-08-03 Multi-mode data clustering method

Country Status (1)

Country Link
CN (1) CN117036762B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200152289A1 (en) * 2018-11-09 2020-05-14 The Broad Institute, Inc. Compressed sensing for screening and tissue imaging
WO2022178274A1 (en) * 2021-02-19 2022-08-25 The Broad Institute, Inc. Multi-scale spatial transcriptomics analysis
WO2022270725A1 (en) * 2021-06-25 2022-12-29 주식회사 포트래이 Method for searching for intra-tissue molecular marker associated with distribution or physiological activity of probe or for physiological activity information
CN115631361A (en) * 2022-10-12 2023-01-20 山西大学 Image clustering method fusing low-rank kernel learning and self-adaptive hypergraph
CN115641957A (en) * 2022-11-11 2023-01-24 广州大学 New auxiliary chemotherapy curative effect prediction method and system based on image genomics
US20230046438A1 (en) * 2020-01-14 2023-02-16 Peking University Method for predicting cell spatial relation based on single-cell transcriptome sequencing data
CN116312782A (en) * 2023-05-18 2023-06-23 南京航空航天大学 Spatial transcriptome spot region clustering method fusing image gene data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200152289A1 (en) * 2018-11-09 2020-05-14 The Broad Institute, Inc. Compressed sensing for screening and tissue imaging
US20230046438A1 (en) * 2020-01-14 2023-02-16 Peking University Method for predicting cell spatial relation based on single-cell transcriptome sequencing data
WO2022178274A1 (en) * 2021-02-19 2022-08-25 The Broad Institute, Inc. Multi-scale spatial transcriptomics analysis
WO2022270725A1 (en) * 2021-06-25 2022-12-29 주식회사 포트래이 Method for searching for intra-tissue molecular marker associated with distribution or physiological activity of probe or for physiological activity information
CN115631361A (en) * 2022-10-12 2023-01-20 山西大学 Image clustering method fusing low-rank kernel learning and self-adaptive hypergraph
CN115641957A (en) * 2022-11-11 2023-01-24 广州大学 New auxiliary chemotherapy curative effect prediction method and system based on image genomics
CN116312782A (en) * 2023-05-18 2023-06-23 南京航空航天大学 Spatial transcriptome spot region clustering method fusing image gene data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谭俊;袁少勋;明文龙;孙啸;: "影像基因组学分析方法研究进展", 生物技术进展, no. 04, pages 7 - 13 *

Also Published As

Publication number Publication date
CN117036762B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
CA2920492C (en) Systems and methods for adaptive histopathology image unmixing
EP2391887B1 (en) Image-based risk score-a prognostic predictor of survival and outcome from digital histopathology
CN110097921B (en) Visualized quantitative method and system for glioma internal gene heterogeneity based on image omics
CN107169497A (en) A kind of tumor imaging label extracting method based on gene iconography
Xu et al. Using transfer learning on whole slide images to predict tumor mutational burden in bladder cancer patients
CN112767325B (en) Automatic detection method and system for cancer pathology image
CN114334012B (en) Method for identifying cancer subtype based on multiple sets of chemical data
Qin et al. Spot detection and image segmentation in DNA microarray data
CN109711469B (en) Breast cancer diagnosis system based on semi-supervised neighborhood discrimination index
He et al. Local and global Gaussian mixture models for hematoxylin and eosin stained histology image segmentation
CN114613430A (en) Filtering method and computing equipment for false positive nucleotide variation sites
Kothari et al. Histological image feature mining reveals emergent diagnostic properties for renal cancer
CN117036762B (en) Multi-mode data clustering method
US20050232488A1 (en) Analysis of patterns among objects of a plurality of classes
CN115881218A (en) Automatic gene selection method for whole genome association analysis
Mazo et al. Automatic recognition of fundamental tissues on histology images of the human cardiovascular system
KR102361615B1 (en) Method for drug repositioning based on drug responding gene expression features
Bergemann et al. A statistically driven approach for image segmentation and signal extraction in cDNA microarrays
Krishna et al. Various versions of K-means clustering algorithm for segmentation of microarray image
Cao et al. Pattern recognition in high-content cytomics screens for target discovery-case studies in endocytosis
CN110751983A (en) Method for screening characteristic mRNA (messenger ribonucleic acid) for diagnosing early lung cancer
CN117133466B (en) Survival prediction method and device based on transcriptomics and image histology
CN117314908B (en) Flue-cured tobacco virus tracing method, medium and system
CN116758989B (en) Breast cancer marker screening method and related device
CN114703263B (en) Group chromosome copy number variation detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant