CN117036762A

CN117036762A - Multi-mode data clustering method

Info

Publication number: CN117036762A
Application number: CN202310975304.9A
Authority: CN
Inventors: 艾冬梅; 陈露露; 王艺舒
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-11-10
Anticipated expiration: 2043-08-03
Also published as: CN117036762B

Abstract

The invention discloses a multi-mode data clustering method, which belongs to the technical field of data processing and comprises the steps of obtaining a sample data set; extracting edge characteristics of the image data; extracting differential features of transcriptome data, the differential features including mRNA features and miRNA features; calculating a correlation coefficient matrix of each sample data; a soft threshold value is adopted to carry out nonlinear mapping on the correlation coefficient matrix; calculating the connectivity of each sample and the rest samples; calculating discretized connectivity and corresponding probability to obtain a distance matrix between samples; pre-clustering the sample data by a K-means++ clustering algorithm; converting the inter-sample distance matrix into an inter-sample similarity matrix; constructing a kernel matrix according to the pre-clustering information; iterating the similarity matrix among the samples according to the kernel matrix; integrating the similarity matrix among samples in the mRNA data view, the microRNA data view and the Image data view to obtain a similarity fusion matrix among samples; and clustering samples by a spectral clustering algorithm.

Description

A multimodal data clustering method

技术领域Technical field

本发明属于数据处理技术领域，具体涉及一种多模态数据聚类方法。The invention belongs to the field of data processing technology, and specifically relates to a multi-modal data clustering method.

背景技术Background technique

临床诊断中，同一肿瘤的不同患者对临床治疗的反应差异性往往是由肿瘤异质性引起的，目前已经有多项研究证明了肿瘤异质性的存在，这种异质性可能归因于肿瘤细胞增殖和分化过程中的突变。肿瘤异质性最终会转化为表型的不同，这一表型不仅指同一肿瘤不同患者对同一药物治疗实验的反应存在差异，还反映在患者肿瘤微环境中各种生物标志物的差异。In clinical diagnosis, differences in response to clinical treatment among different patients with the same tumor are often caused by tumor heterogeneity. Many studies have proven the existence of tumor heterogeneity. This heterogeneity may be attributed to Mutations during tumor cell proliferation and differentiation. Tumor heterogeneity will eventually translate into phenotypic differences, which not only refer to differences in the response of different patients with the same tumor to the same drug treatment experiment, but also reflect differences in various biomarkers in the patient's tumor microenvironment.

一方面，转录组数据是不同肿瘤亚型的一个十分关键的生物标记物。2015年RikkeKarlin Jepsen等人通过结直肠癌样本的microRNA表达数据，发现microRNA-92a，microRNA-375，microRNA-424在不同结直肠癌肿瘤亚型中表达具有差异性。转录组数据反映了细胞内基因的表达情况，能够提供大量的基因表达信息，包含了基因在不同条件下的表达水平，可以揭示细胞功能、代谢途径、信号通路等方面的差异。转录组数据通常具有高维度的特征向量，这使得在聚类分析中可以考虑更多的基因表达变化，有助于发现微小的差异。然而，由于转录组数据的高维特性，聚类算法在处理大规模数据时可能会造成计算复杂度增加。On the one hand, transcriptomic data is a very critical biomarker for different tumor subtypes. In 2015, RikkeKarlin Jepsen et al. used microRNA expression data in colorectal cancer samples and found that microRNA-92a, microRNA-375, and microRNA-424 were differentially expressed in different colorectal cancer tumor subtypes. Transcriptome data reflects the expression of genes in cells and can provide a large amount of gene expression information, including the expression levels of genes under different conditions, and can reveal differences in cell functions, metabolic pathways, signaling pathways, etc. Transcriptome data usually have high-dimensional feature vectors, which allows more gene expression changes to be considered in cluster analysis, helping to discover small differences. However, due to the high-dimensional nature of transcriptome data, clustering algorithms may cause increased computational complexity when processing large-scale data.

另一方面，组织病理学图像对于癌症的早期识别和诊断起着重要作用，采用分析病理学图像手段参与癌症诊断的工作已经应用并且发展了许多年；Kowal等人比较和测试了用于细胞核分割的不同算法，通过分析病例的癌症图像的数据集用以判别病例患者的肿瘤是否为良性，准确率达到了96％以上。组织病理学图像直观地展示了组织细胞的形态学和结构，可以帮助医生或病理学家快速观察样本的特征，发现潜在的异常或病理变化。然而，组织病理学图像的解释和聚类通常需要专业的病理学家进行主观判断，可能会受到个体差异和主观经验的影响。并且，获取高质量的组织病理学图像需要进行组织切片、染色等处理，成本较高且时间耗费较多。On the other hand, histopathological images play an important role in the early identification and diagnosis of cancer. The use of analysis of pathological images to participate in cancer diagnosis has been applied and developed for many years; Kowal et al. compared and tested the method used for cell nucleus segmentation. Different algorithms are used to determine whether the patient's tumor is benign by analyzing a data set of cancer images, with an accuracy of more than 96%. Histopathology images visually display the morphology and structure of tissue cells, which can help doctors or pathologists quickly observe the characteristics of samples and discover potential abnormalities or pathological changes. However, the interpretation and clustering of histopathological images often requires subjective judgment by professional pathologists, which may be affected by individual differences and subjective experience. Moreover, obtaining high-quality histopathological images requires tissue sectioning, staining and other processing, which is costly and time-consuming.

综上所述，转录组数据与组织病理学图像在对癌症样本进行聚类的过程中各有优缺点。In summary, transcriptome data and histopathology images each have their own advantages and disadvantages in clustering cancer samples.

发明内容Contents of the invention

为了解决现有技术中的转录组数据的数据维度高，聚类算法计算复杂度高，组织病理学图像易受到主观因素的影响，聚类准确性差的技术问题，本发明提供一种多模态数据聚类方法。In order to solve the technical problems in the prior art that the transcriptome data has high data dimensions, the clustering algorithm has high computational complexity, histopathological images are susceptible to the influence of subjective factors, and the clustering accuracy is poor, the present invention provides a multi-modal Data clustering methods.

第一方面first

本发明提供了一种多模态数据聚类方法，包括：The present invention provides a multi-modal data clustering method, including:

S101：获取样本数据集，样本数据集包括多个样本数据，每个样本数据包括图像数据与转录组数据；S101: Obtain a sample data set. The sample data set includes multiple sample data, and each sample data includes image data and transcriptome data;

S102：通过双边滤波器，对图像数据进行滤波处理；S102: Filter the image data through a bilateral filter;

S103：引入Sobel算子，计算滤波处理后的图像数据中像素点的梯度信息，梯度信息包括梯度强度和梯度方向；S103: Introduce the Sobel operator to calculate the gradient information of pixels in the filtered image data. The gradient information includes gradient intensity and gradient direction;

S104：当滤波处理后的图像数据中存在多个梯度信息时，保留极大值像素点，抑制非极大值像素点；S104: When there is multiple gradient information in the filtered image data, retain the maximum value pixels and suppress the non-maximum value pixels;

S105：对非极大值抑制后的样本数据进行去噪处理，得到图像数据的边缘特征；S105: Perform denoising on the sample data after non-maximum suppression to obtain edge features of the image data;

S106：提取转录组数据的差异性特征，差异性特征包括mRNA特征和miRNA特征；S106: Extract differential features of transcriptome data. Differential features include mRNA features and miRNA features;

S107：根据样本数据的边缘特征、mRNA特征和miRNA特征，计算各个样本数据的相关系数矩阵；S107: Calculate the correlation coefficient matrix of each sample data based on the edge features, mRNA features and miRNA features of the sample data;

S108：采用软阈值，对相关系数矩阵进行非线性映射；S108: Use soft thresholds to perform nonlinear mapping on the correlation coefficient matrix;

S109：计算各个样本与其余样本的连通度；S109: Calculate the connectivity between each sample and the remaining samples;

S110：通过Histogram算法，将连通度离散化，计算离散化的连通度以及相应的概率，得到样本间距离矩阵；S110: Use the Histogram algorithm to discretize the connectivity, calculate the discretized connectivity and the corresponding probability, and obtain the distance matrix between samples;

S111：通过K-means++聚类算法对样本数据在mRNA数据视图、microRNA数据视图以及Image数据视图下进行预聚类，得到预聚类信息；S111: Use the K-means++ clustering algorithm to pre-cluster the sample data in the mRNA data view, microRNA data view and Image data view to obtain pre-clustering information;

S112：在mRNA数据视图、microRNA数据视图以及Image数据视图下将样本间距离矩阵转化为样本间相似度矩阵；S112: Convert the distance matrix between samples into a similarity matrix between samples in the mRNA data view, microRNA data view and Image data view;

S113：根据预聚类信息，构建mRNA数据视图、microRNA数据视图以及Image数据视图下的核矩阵；S113: Based on the pre-clustering information, construct the kernel matrix under the mRNA data view, microRNA data view and Image data view;

S114：根据核矩阵，对在mRNA数据视图、microRNA数据视图以及Image数据视图下的样本间相似度矩阵进行迭代；S114: According to the kernel matrix, iterate the similarity matrix between samples in the mRNA data view, microRNA data view and Image data view;

S115：综合在mRNA数据视图、microRNA数据视图以及Image数据视图下的样本间相似度矩阵，得到样本间相似度融合矩阵；S115: Combine the similarity matrices between samples under the mRNA data view, microRNA data view and Image data view to obtain the similarity fusion matrix between samples;

S116：通过谱聚类算法，根据样本间相似度融合矩阵，对样本进行聚类。S116: Use the spectral clustering algorithm to cluster the samples based on the similarity fusion matrix between samples.

与现有技术相比，本发明至少具有以下有益技术效果：Compared with the prior art, the present invention at least has the following beneficial technical effects:

在本发明中，综合转录组数据以及组织病理学图像，提取组织病理学图像的边缘特征以及转录组数据的mRNA特征和miRNA特征，并对组织病理学图像的边缘特征以及转录组数据的mRNA特征和miRNA特征进行多模态融合得到样本间相似度融合矩阵，进而根据样本间相似度融合矩阵进行自动化聚类。对于转录组数据，仅需关注其中的mRNA特征和miRNA特征，降低了数据维度，减小聚类算法的复杂度，并且减少疾病评估过程中的主观因素的影响，通过多模态分析提升聚类评估的准确性。In the present invention, the transcriptome data and histopathology images are integrated, the edge features of the histopathology images and the mRNA features and miRNA features of the transcriptome data are extracted, and the edge features of the histopathology images and the mRNA features of the transcriptome data are extracted. Perform multi-modal fusion with miRNA features to obtain a similarity fusion matrix between samples, and then perform automatic clustering based on the similarity fusion matrix between samples. For transcriptome data, we only need to focus on the mRNA features and miRNA features, which reduces the data dimension, reduces the complexity of the clustering algorithm, and reduces the influence of subjective factors in the disease assessment process, and improves clustering through multi-modal analysis. Accuracy of assessment.

附图说明Description of the drawings

下面将以明确易懂的方式，结合附图说明优选实施方式，对本发明的上述特性、技术特征、优点及其实现方式予以进一步说明。The following will describe the preferred embodiments in a clear and easy-to-understand manner with reference to the accompanying drawings, and further explain the above-mentioned characteristics, technical features, advantages and implementation methods of the present invention.

图1是本发明提供的一种多模态数据聚类方法的流程示意图。Figure 1 is a schematic flow chart of a multi-modal data clustering method provided by the present invention.

具体实施方式Detailed ways

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对照附图说明本发明的具体实施方式。显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图，并获得其他的实施方式。In order to explain the embodiments of the present invention or technical solutions in the prior art more clearly, the specific implementation modes of the present invention will be described below with reference to the accompanying drawings. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, without exerting creative efforts, other drawings can also be obtained based on these drawings, and obtain Other embodiments.

为使图面简洁，各图中只示意性地表示出了与发明相关的部分，它们并不代表其作为产品的实际结构。另外，以使图面简洁便于理解，在有些图中具有相同结构或功能的部件，仅示意性地绘示了其中的一个，或仅标出了其中的一个。在本文中，“一个”不仅表示“仅此一个”，也可以表示“多于一个”的情形。In order to keep the drawings concise, only the parts related to the invention are schematically shown in each figure, and they do not represent the actual structure of the product. In addition, in order to make the drawings concise and easy to understand, in some drawings, only one of the components with the same structure or function is schematically illustrated or labeled. In this article, "a" not only means "only one", but can also mean "more than one".

还应当进一步理解，在本发明说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。It will be further understood that the term "and/or" as used in the specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. .

在本文中，需要说明的是，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接。可以是机械连接，也可以是电连接。可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。In this article, it should be noted that, unless otherwise clearly stated and limited, the terms "installation", "connection" and "connection" should be understood in a broad sense. For example, it can be a fixed connection or a detachable connection, or Connected in one piece. The connection can be mechanical or electrical. It can be directly connected, or it can be indirectly connected through an intermediary, or it can be an internal connection between two components. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood on a case-by-case basis.

另外，在本发明的描述中，术语“第一”、“第二”等仅用于区分描述，而不能理解为指示或暗示相对重要性。In addition, in the description of the present invention, the terms "first", "second", etc. are only used to differentiate the description and cannot be understood as indicating or implying relative importance.

实施例1Example 1

在一个实施例中，参考说明书附图1，示出了本发明提供的多模态数据聚类方法的流程示意图。In one embodiment, refer to FIG. 1 of the description, which shows a schematic flow chart of the multi-modal data clustering method provided by the present invention.

本发明提供的一种多模态数据聚类方法，包括：The invention provides a multi-modal data clustering method, including:

S101：获取样本数据集。S101: Obtain sample data set.

其中，样本数据集包括多个样本数据，每个样本数据包括图像数据与转录组数据。The sample data set includes multiple sample data, and each sample data includes image data and transcriptome data.

具体地，样本数据可以是胃肠癌样本数据。Specifically, the sample data may be gastrointestinal cancer sample data.

S102：通过双边滤波器，对图像数据进行滤波处理。S102: Filter the image data through a bilateral filter.

在一种可能的实施方式中，S102具体包括：In a possible implementation, S102 specifically includes:

S1021：将样本数据转换为像素矩阵。S1021: Convert sample data into a pixel matrix.

S1022：将当前像素点与周围半径为1个像素点位的邻域范围内的像素点进行非线性融合：S1022: Nonlinearly fuse the current pixel with pixels within a neighborhood with a radius of 1 pixel:

其中，g(i,j)表示在当前像素点(i,j)处非线性融合后的像素值，S(i,j)表示在当前像素点(i,j)周围半径为1个像素点位的邻域范围内的像素点集合，(k,l)表示在当前像素点(i,j)周围的像素点坐标，f(k,l)表示在像素点(k,l)处的灰度值，w(i,j,k,l)表示当前像素点(i,j)与像素点(k,l)之间的权重参数。Among them, g(i,j) represents the pixel value after nonlinear fusion at the current pixel point (i,j), and S(i,j) represents a radius of 1 pixel around the current pixel point (i,j). A collection of pixels within the neighborhood of pixels, (k,l) represents the pixel coordinates around the current pixel (i,j), f(k,l) represents the grayscale at the pixel (k,l) Degree value, w(i,j,k,l) represents the weight parameter between the current pixel point (i,j) and the pixel point (k,l).

其中，当前像素点(i,j)与像素点(k,l)之间的权重参数w(i,j,k,l)的计算方式为：Among them, the weight parameter w(i,j,k,l) between the current pixel point (i,j) and the pixel point (k,l) is calculated as:

w(i,j,k,l)＝d(i,j,k,l)·r(i,j,k,l)w(i,j,k,l)=d(i,j,k,l)·r(i,j,k,l)

其中，d(i,j,k,l)表示当前像素点(i,j)与像素点(k,l)之间的空间域权重，r(i,j,k,l)表示当前像素点(i,j)与像素点(k,l)之间的像素域权重，σ_d表示空间域标准差，σ_r表示像素域标准差。Among them, d(i,j,k,l) represents the spatial domain weight between the current pixel point (i,j) and the pixel point (k,l), and r(i,j,k,l) represents the current pixel point. The pixel domain weight between (i, j) and pixel point (k, l), σ _d represents the spatial domain standard deviation, and σ _r represents the pixel domain standard deviation.

在本发明中，双边滤波器是一种非线性滤波器，在滤波的过程中考虑了像素点的灰度值和空间位置，可以保留图像的边缘特征，使图像看起来更加清晰和锐利。双边滤波器对周围像素点的灰度值进行非线性融合，这意味着在滤波的过程中，像素点之间的融合关系不是简单的加权平均，而是根据像素点之间的相似性进行调整。这样可以更好地保留图像的细节和纹理信息。In the present invention, the bilateral filter is a nonlinear filter that takes into account the gray value and spatial position of the pixels during the filtering process, which can preserve the edge characteristics of the image and make the image look clearer and sharper. Bilateral filters perform nonlinear fusion of the gray values of surrounding pixels, which means that during the filtering process, the fusion relationship between pixels is not a simple weighted average, but is adjusted based on the similarity between pixels. . This better preserves the details and texture information of the image.

S103：引入Sobel算子，计算滤波处理后的图像数据中像素点的梯度信息，梯度信息包括梯度强度和梯度方向。S103: Introduce the Sobel operator to calculate the gradient information of pixels in the filtered image data. The gradient information includes gradient intensity and gradient direction.

其中，Sobel算子是一种常用的图像处理算子，用于检测图像中的边缘。它是一种离散差分算子，通过计算图像像素点的水平和竖直方向的梯度来寻找图像的边缘。Among them, the Sobel operator is a commonly used image processing operator used to detect edges in images. It is a discrete difference operator that finds the edge of the image by calculating the gradient of the horizontal and vertical directions of the image pixels.

其中，梯度信息在图像处理和计算机视觉中是一种重要的特征，用于描述图像的变化程度和边缘信息。它可以帮助我们理解图像中像素值的变化情况，从而在图像中定位和识别边缘、纹理、形状等特征。Among them, gradient information is an important feature in image processing and computer vision, used to describe the degree of change and edge information of the image. It can help us understand the changes in pixel values in the image, so as to locate and identify features such as edges, textures, shapes, etc. in the image.

在一种可能的实施方式中，S103具体包括：In a possible implementation, S103 specifically includes:

S1031：引入Sobel算子，计算滤波处理后的样本数据的水平特征矩阵S_x和竖直特征矩阵S_y：S1031: Introduce the Sobel operator to calculate the horizontal feature matrix S _x and vertical feature matrix S _y of the filtered sample data:

S1032：根据水平特征矩阵S_x和竖直特征矩阵S_y，计算水平方向梯度G_x和竖直方向梯度G_y：S1032: Calculate the horizontal gradient G _x and the vertical gradient G _y according to the horizontal feature matrix S _x and the vertical feature matrix S _y :

G_x＝S_xIG _x =S _x I

G_y＝S_yIG _y =S _y I

其中，I表示滤波处理后的样本数据的灰度值矩阵。Among them, I represents the gray value matrix of the sample data after filtering.

S1033：计算像素点的梯度强度G和梯度方向θ：S1033: Calculate the gradient intensity G and gradient direction θ of the pixel:

在本发明中，通过引入Sobel算子并计算滤波处理后的图像数据的水平和竖直特征矩阵，可以得到图像在水平和竖直方向上的梯度信息。然后，根据梯度特征计算像素点的梯度强度和梯度方向，从而实现对图像的边缘检测和特征提取。这些梯度信息对于后续的图像分析和处理步骤非常重要，能够为图像数据提供更加丰富和有用的信息。In the present invention, by introducing the Sobel operator and calculating the horizontal and vertical feature matrices of the filtered image data, the gradient information of the image in the horizontal and vertical directions can be obtained. Then, the gradient intensity and gradient direction of the pixel are calculated based on the gradient features, thereby achieving edge detection and feature extraction of the image. This gradient information is very important for subsequent image analysis and processing steps, and can provide richer and more useful information for image data.

S104：当滤波处理后的图像数据中存在多个梯度信息时，保留极大值像素点，抑制非极大值像素点。S104: When there is multiple gradient information in the filtered image data, the maximum value pixels are retained and the non-maximum value pixels are suppressed.

在一种可能的实施方式中，S104具体包括：In a possible implementation, S104 specifically includes:

S1041：选取四个预设边缘角度，分别为0, S1041: Select four preset edge angles, respectively 0,

S1042：当像素点的梯度强度均大于预设边缘角度的梯度时，确定当前像素点为极大值像素点，予以保留，否则，确定当前像素点为非极大值像素点，予以抑制。S1042: When the gradient intensity of the pixel is greater than the gradient of the preset edge angle, the current pixel is determined to be a maximum value pixel and retained. Otherwise, the current pixel is determined to be a non-maximum value pixel and suppressed.

在本发明中，保留极大值像素点并抑制非极大值像素点的处理方式有助于提高图像边缘检测的效果，增强图像特征，并减少噪声的影响，从而得到更准确、清晰的图像边缘信息，为后续的图像处理和分析任务提供更好的基础。In the present invention, the processing method of retaining the maximum value pixels and suppressing the non-maximum value pixels helps to improve the effect of image edge detection, enhance image features, and reduce the impact of noise, thereby obtaining a more accurate and clear image. Edge information provides a better foundation for subsequent image processing and analysis tasks.

S105：对非极大值抑制后的样本数据进行去噪处理，得到图像数据的边缘特征。S105: Perform denoising processing on the sample data after non-maximum value suppression to obtain edge features of the image data.

在一种可能的实施方式中，S105具体包括：In a possible implementation, S105 specifically includes:

S1051：设置高阈值以及低阈值。S1051: Set high threshold and low threshold.

S1052：当像素点的边缘像素梯度值大于高阈值时，确定像素点为强边缘像素点。S1052: When the edge pixel gradient value of a pixel is greater than the high threshold, the pixel is determined to be a strong edge pixel.

S1053：当像素点的边缘像素梯度值在低阈值与高阈值之间时，确定像素点为弱边缘像素点。S1053: When the edge pixel gradient value of a pixel is between the low threshold and the high threshold, determine the pixel to be a weak edge pixel.

S1054：当像素点的边缘像素梯度值小于低阈值时，对像素点进行抑制。其中，高阈值TH与低阈值TL的确定方式为：S1054: When the edge pixel gradient value of a pixel is less than the low threshold, suppress the pixel. Among them, the high threshold TH and the low threshold TL are determined as follows:

TH＝0.5max(H*)TH＝0.5max(H*)

TL＝0.1max(H*)TL＝0.1max(H*)

其中，H*表示非极大值抑制后的像素矩阵，max(H*)表示非极大值抑制后的像素矩阵中的最大值。Among them, H* represents the pixel matrix after non-maximum suppression, and max(H*) represents the maximum value in the pixel matrix after non-maximum suppression.

其中，对非极大值抑制后的样本数据进行去噪处理并提取图像数据的边缘特征，可以帮助我们从图像中提取出重要的视觉信息，用于各种图像处理和计算机视觉任务，从而提高算法的性能和准确性。Among them, denoising the sample data after non-maximum suppression and extracting edge features of the image data can help us extract important visual information from the image for various image processing and computer vision tasks, thereby improving Algorithm performance and accuracy.

S106：提取转录组数据的差异性特征，差异性特征包括mRNA特征和miRNA特征。S106: Extract differential features of transcriptome data. Differential features include mRNA features and miRNA features.

其中，mRNA(messenger RNA)是一类RNA分子，它在细胞内参与基因表达过程，将DNA中的遗传信息转录成蛋白质的氨基酸序列。mRNA特征通常指在转录组学研究中对mRNA的表达水平进行分析和描述的特征。Among them, mRNA (messenger RNA) is a type of RNA molecule that participates in the gene expression process within cells and transcribes the genetic information in DNA into the amino acid sequence of proteins. mRNA characteristics usually refer to the characteristics of analyzing and describing the expression levels of mRNA in transcriptomic studies.

其中，miRNA(microRNA)是一类短小的非编码RNA分子，它在细胞内参与基因表达调控，通过与mRNA靶标结合，抑制目标基因的转录或翻译。miRNA特征通常指在miRNA组学研究中对miRNA的表达和功能进行分析的特征。Among them, miRNA (microRNA) is a type of short non-coding RNA molecules that participates in the regulation of gene expression in cells and inhibits the transcription or translation of target genes by binding to mRNA targets. miRNA signature usually refers to the characteristics of the expression and function of miRNA in miRNAomics research.

S107：根据样本数据的边缘特征、mRNA特征和miRNA特征，计算各个样本数据的相关系数矩阵。S107: Calculate the correlation coefficient matrix of each sample data based on the edge features, mRNA features and miRNA features of the sample data.

具体而言，可以采用计算皮尔逊相关系数，来构建皮尔逊相关系数矩阵。Specifically, the Pearson correlation coefficient can be calculated to construct the Pearson correlation coefficient matrix.

S108：采用软阈值，对相关系数矩阵进行非线性映射。S108: Use soft thresholds to perform nonlinear mapping on the correlation coefficient matrix.

在一种可能的实施方式中，S108具体为：In a possible implementation, S108 is specifically:

通过以下公式，对相关系数矩阵进行非线性映射：The correlation coefficient matrix is nonlinearly mapped through the following formula:

a_ij＝|a_ij|^β a _ij =|a _ij | ^β

S_m×m＝[a_ij]_m×m Sm _×m ＝[ _aij ] _m×m

其中，S表示相关系数矩阵，a_ij表示第i个样本与第j个样本之间的相关系数，β表示软阈值，m表示样本数。Among them, S represents the correlation coefficient matrix, a _ij represents the correlation coefficient between the i-th sample and the j-th sample, β represents the soft threshold, and m represents the number of samples.

其中，软阈值的取值范围是β∈[2,20]。Among them, the value range of the soft threshold is β∈[2,20].

在本发明中，软阈值映射可以增强相关样本之间的相似性。在相关系数矩阵中，相关样本之间的相关系数值较高，经过软阈值映射后，相关样本之间的相关系数会被进一步增强，使得它们更加紧密地聚集在一起，形成更为明确的类别或簇。采用软阈值对相关系数矩阵进行非线性映射有助于提高聚类算法的准确性和稳定性，更好地发现样本之间的内在关系和数据的潜在结构。In the present invention, soft threshold mapping can enhance the similarity between related samples. In the correlation coefficient matrix, the correlation coefficient values between related samples are higher. After soft threshold mapping, the correlation coefficients between related samples will be further enhanced, making them more closely clustered together to form a clearer category. or cluster. Using soft thresholds for nonlinear mapping of the correlation coefficient matrix can help improve the accuracy and stability of the clustering algorithm and better discover the intrinsic relationships between samples and the potential structure of the data.

S109：计算各个样本与其余样本的连通度。S109: Calculate the connectivity between each sample and the remaining samples.

其中，连通度度量了样本之间的关联程度。通过计算样本与其余样本的连通度，可以得知每个样本与其他样本的相似程度，从而了解数据中哪些样本之间具有较强的联系，哪些样本之间相对独立或不相关。Among them, connectivity measures the degree of correlation between samples. By calculating the connectivity between a sample and other samples, we can know how similar each sample is to other samples, thereby understanding which samples in the data have a strong connection with each other and which samples are relatively independent or uncorrelated.

在一种可能的实施方式中，S109具体为：In a possible implementation, S109 is specifically:

通过以下公式，计算当前样本与其余样本的连通度：Calculate the connectivity between the current sample and the remaining samples through the following formula:

其中，k_i表示第i个样本与其余样本的连通度。Among them, k _i represents the connectivity between the i-th sample and the remaining samples.

在本发明中，计算样本与其余样本的连通度有助于理解样本之间的关联程度，为聚类算法提供重要的信息和依据，从而提高聚类结果的准确性和可解释性。In the present invention, calculating the connectivity between a sample and other samples helps to understand the degree of correlation between samples, provides important information and basis for the clustering algorithm, thereby improving the accuracy and interpretability of the clustering results.

S110：通过Histogram算法，将连通度离散化，计算离散化的连通度以及相应的概率，得到样本间距离矩阵。S110: Use the Histogram algorithm to discretize the connectivity, calculate the discretized connectivity and the corresponding probability, and obtain the distance matrix between samples.

其中，Histogram算法是一种用于数据离散化的方法，它将连续的数据映射到离散的区间内，从而将数据量化成不同的值。Among them, the Histogram algorithm is a method for data discretization. It maps continuous data into discrete intervals, thereby quantizing the data into different values.

在一种可能的实施方式中，S110具体为：In a possible implementation, S110 is specifically:

通过以下公式，计算在离散化的连通度下的概率：Calculate the probability under discretized connectivity through the following formula:

log₁₀(p(k_i))∝(1/log₁₀(k_i))log ₁₀ (p(k _i ))∝(1/log ₁₀ (k _i ))

其中，p(k_i)表示在连通度k_i下的概率。Among them, p(k _i ) represents the probability under connectivity k _i .

在本发明中，通过Histogram算法将连通度离散化，并计算概率，能够将原始的连续连通度数据转换成离散化的特征表示，为样本间距离矩阵的构建提供有效的信息，从而为后续的数据处理和聚类任务提供有益的支持和指导。In the present invention, the Histogram algorithm is used to discretize the connectivity and calculate the probability, which can convert the original continuous connectivity data into a discretized feature representation, providing effective information for the construction of the distance matrix between samples, thereby providing subsequent Provides helpful support and guidance on data processing and clustering tasks.

S111：通过K-means++聚类算法对样本数据在mRNA数据视图、microRNA数据视图以及Image数据视图下进行预聚类，得到预聚类信息。S111: Use the K-means++ clustering algorithm to pre-cluster the sample data in the mRNA data view, microRNA data view and Image data view to obtain pre-clustering information.

可选地，K-means++聚类算法的聚类的类别数K的取值范围为K＝[2,10]。Optionally, the value range of the number of clustering categories K of the K-means++ clustering algorithm is K=[2,10].

在本发明中，通过K-means++聚类算法对样本数据在不同数据视图下进行预聚类，能够将高维的数据转换为低维的聚类标签，发现数据中的结构和模式，并为后续的聚类、聚类和数据融合提供有益的信息和基础。In the present invention, the sample data is pre-clustered under different data views through the K-means++ clustering algorithm, which can convert high-dimensional data into low-dimensional clustering labels, discover structures and patterns in the data, and provide Subsequent clustering, clustering, and data fusion provide useful information and foundation.

S112：在mRNA数据视图、microRNA数据视图以及Image数据视图下将样本间距离矩阵转化为样本间相似度矩阵。S112: Convert the distance matrix between samples into a similarity matrix between samples in the mRNA data view, microRNA data view and Image data view.

在一种可能的实施方式中，S112具体为：In a possible implementation, S112 is specifically:

S1121：通过以下公式，将样本间距离转化为样本间相似度：S1121: Convert the distance between samples into the similarity between samples through the following formula:

其中，w_ij表示第i个样本与第j个样本之间的相似度，d_ij表示第i个样本与第j个样本之间的距离，ε_ij表示第i个样本与第j个样本之间的自适应参数，mean(d_i,N_i)表示第i个样本与其他N_i个样本的距离的平均值，mean(d_j,N_j)表示第j个样本与其他N_j个样本的距离的平均值。Among them, w _ij represents the similarity between the i-th sample and the j-th sample, d _ij represents the distance between the i-th sample and the j-th sample, and ε _ij represents the distance between the i-th sample and the j-th sample. _The _adaptive _parameters _between _{_} _{_} the average of the distances.

S1122：根据以下公式构建样本间相似度矩阵P：S1122: Construct the similarity matrix P between samples according to the following formula:

其中，P_ij表示样本间相似度矩阵中第i行第j列的元素值，w_ik表示第i个样本与第k个样本之间的相似度。Among them, P _ij represents the element value of the i-th row and j-th column in the similarity matrix between samples, and w _ik represents the similarity between the i-th sample and the k-th sample.

在本发明中，将样本间距离矩阵转化为样本间相似度矩阵有助于更好地描述样本之间的相似性和相关性，灵活地调整相似度计算，并为后续的数据分析和融合提供更有意义的输入和基础。In the present invention, converting the distance matrix between samples into the similarity matrix between samples helps to better describe the similarity and correlation between samples, flexibly adjust the similarity calculation, and provide information for subsequent data analysis and fusion. More meaningful input and foundation.

S113：根据预聚类信息，构建mRNA数据视图、microRNA数据视图以及Image数据视图下的核矩阵。S113: Based on the pre-clustering information, construct the kernel matrix under the mRNA data view, microRNA data view and Image data view.

在一种可能的实施方式中，S113具体为：In a possible implementation, S113 is specifically:

通过以下公式，构建核矩阵S：Construct the kernel matrix S through the following formula:

其中，S_ij表示核矩阵中第i行第j列的元素值，C_i表示第i个样本类别。Among them, S _ij represents the element value of the i-th row and j-th column in the kernel matrix, and C _i represents the i-th sample category.

在本发明中，在多模态数据聚类中，每种数据视图都提供了不同类型的特征信息，而核矩阵的构建可以将这些特征信息综合起来，从而更全面地描述样本之间的关联性。通过构建核矩阵，可以将不同数据视图下的样本相似性信息综合起来，提高聚类准确性和稳定性，并降低计算复杂度，从而在多模态数据聚类中发挥重要作用。In the present invention, in multi-modal data clustering, each data view provides different types of feature information, and the construction of the kernel matrix can synthesize these feature information to more comprehensively describe the association between samples. sex. By constructing a kernel matrix, sample similarity information under different data views can be integrated, improving clustering accuracy and stability, and reducing computational complexity, thereby playing an important role in multi-modal data clustering.

S114：根据核矩阵，对在mRNA数据视图、microRNA数据视图以及Image数据视图下的样本间相似度矩阵进行迭代。S114: According to the kernel matrix, iterate the similarity matrix between samples in the mRNA data view, microRNA data view and Image data view.

在一种可能的实施方式中，S114具体为：In a possible implementation, S114 is specifically:

通过以下公式，对样本间相似度矩阵进行迭代：Iterate the similarity matrix between samples through the following formula:

其中，P^v表示第v种视图下的样本间相似度矩阵，P^k表示第k种视图下的样本间相似度矩阵，S^v表示第v种视图下的核矩阵，(·)^T表示矩阵转置。Among them, P ^v represents the similarity matrix between samples under the v-th view, P ^k represents the similarity matrix between samples under the k-th view, S ^v represents the kernel matrix under the v-th view, (·) ^T represents the matrix Transpose.

在本发明中，通过迭代更新样本间相似度矩阵，可以将不同视图下的样本相似性信息进行融合，从而得到更综合的样本相似度矩阵。这样做可以充分利用不同视图提供的信息，提高聚类算法的性能。在迭代过程中，样本间相似度矩阵不断逼近核矩阵，这相当于将不同视图下的样本相似性信息传递给其他视图。这样做有助于弥补不同视图之间的信息缺失，提高样本聚类的准确性。In the present invention, by iteratively updating the similarity matrix between samples, the sample similarity information under different views can be fused, thereby obtaining a more comprehensive sample similarity matrix. This can make full use of the information provided by different views and improve the performance of the clustering algorithm. During the iteration process, the similarity matrix between samples continuously approaches the kernel matrix, which is equivalent to transferring sample similarity information under different views to other views. Doing so can help make up for the lack of information between different views and improve the accuracy of sample clustering.

S115：综合在mRNA数据视图、microRNA数据视图以及Image数据视图下的样本间相似度矩阵，得到样本间相似度融合矩阵。S115: Combine the similarity matrices between samples under the mRNA data view, microRNA data view and Image data view to obtain the similarity fusion matrix between samples.

在一种可能的实施方式中，S115具体为：In a possible implementation, S115 is specifically:

通过以下公式，综合在mRNA数据视图、microRNA数据视图以及Image数据视图下的样本间相似度矩阵，得到样本间相似度融合矩阵：Through the following formula, the similarity matrix between samples in the mRNA data view, microRNA data view and Image data view is combined to obtain the similarity fusion matrix between samples:

其中，P表示样本间相似度融合矩阵，P^v表示第v种视图下的样本间相似度矩阵。Among them, P represents the similarity fusion matrix between samples, and P ^v represents the similarity matrix between samples under the vth view.

在本发明中，不同视图提供了各自的特征信息。通过综合不同视图下的样本相似度矩阵，可以将不同视图的信息融合在一起，得到更全面、更丰富的样本相似性信息。这样做可以提高聚类算法对样本之间关联性的理解和判断能力，从而改善聚类性能。不同视图下的样本相似度矩阵可能反映了不同方面的相似性关系。通过综合这些视图的相似度矩阵，可以将不同方面的相似性信息综合起来，从而更全面地描述样本之间的相似性。这样做有助于克服单一视图可能存在的局限性和偏见，提高聚类算法的稳定性和鲁棒性。In the present invention, different views provide respective feature information. By synthesizing the sample similarity matrices under different views, the information from different views can be fused together to obtain more comprehensive and richer sample similarity information. Doing so can improve the clustering algorithm's ability to understand and judge the correlation between samples, thereby improving clustering performance. Sample similarity matrices under different views may reflect different aspects of similarity relationships. By combining the similarity matrices of these views, different aspects of similarity information can be combined to more fully describe the similarity between samples. Doing so helps overcome the possible limitations and biases of a single view and improves the stability and robustness of the clustering algorithm.

其中，谱聚类算法是一种基于图论和谱理论的无监督聚类算法。它通过将样本数据表示为图的形式，并利用图的特征向量来划分数据样本，将相似的样本分到同一个类别中。Among them, the spectral clustering algorithm is an unsupervised clustering algorithm based on graph theory and spectral theory. It represents the sample data in the form of a graph and uses the feature vectors of the graph to divide the data samples and classify similar samples into the same category.

在本发明中，综合转录组数据以及组织病理学图像，提取组织病理学图像的边缘特征以及转录组数据的mRNA特征和miRNA特征，并对组织病理学图像的边缘特征以及转录组数据的mRNA特征和miRNA特征进行多模态融合得到样本间相似度融合矩阵，进而根据样本间相似度融合矩阵进行自动化聚类。对于转录组数据，仅需关注其中的mRNA特征和miRNA特征，降低了数据维度，减小聚类算法的复杂度，并且减少疾病评估过程中的主观因素的影响，通过多模态分析提升聚类评估的准确性。In the present invention, the transcriptome data and histopathology images are integrated, the edge features of the histopathology images and the mRNA features and miRNA features of the transcriptome data are extracted, and the edge features of the histopathology images and the mRNA features of the transcriptome data are extracted. Perform multi-modal fusion with miRNA features to obtain a similarity fusion matrix between samples, and then perform automatic clustering based on the similarity fusion matrix between samples. For transcriptome data, we only need to focus on the mRNA features and miRNA features, which reduces the data dimension, reduces the complexity of the clustering algorithm, and reduces the influence of subjective factors in the disease assessment process, and improves clustering through multi-modal analysis. Assessment accuracy.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations should be used. It is considered to be within the scope of this manual.

以上实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above embodiments only express several embodiments of the present invention, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the invention. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the scope of protection of the patent of the present invention should be determined by the appended claims.

Claims

1. A multimodal data clustering method, characterized by:

S101: Obtain a sample data set, the sample data set includes multiple sample data, each of the sample data includes image data and transcriptome data;

S102: Filter the image data through a bilateral filter;

S103: Introduce the Sobel operator to calculate the gradient information of pixels in the filtered image data, where the gradient information includes gradient intensity and gradient direction;

S104: When there is multiple gradient information in the filtered image data, retain the maximum value pixels and suppress the non-maximum value pixels;

S105: Perform denoising processing on the sample data after non-maximum value suppression to obtain edge features of the image data;

S106: Extract differential features of the transcriptome data, where the differential features include mRNA features and miRNA features;

S107: Calculate the correlation coefficient matrix of each sample data according to the edge features, mRNA features and miRNA features of the sample data;

S108: Use soft thresholds to perform nonlinear mapping on the correlation coefficient matrix;

S109: Calculate the connectivity between each sample and the remaining samples;

S110: Use the Histogram algorithm to discretize the connectivity, calculate the discretized connectivity and the corresponding probability, and obtain the distance matrix between samples;

S111: Use the K-means++ clustering algorithm to pre-cluster the sample data in the mRNA data view, microRNA data view and Image data view to obtain pre-clustering information;

S112: Convert the inter-sample distance matrix into an inter-sample similarity matrix in the mRNA data view, microRNA data view and Image data view;

S113: Based on the pre-clustering information, construct the kernel matrix under the mRNA data view, microRNA data view and Image data view;

S114: According to the kernel matrix, iterate the similarity matrix between samples under the mRNA data view, microRNA data view and Image data view;

S115: Combine the similarity matrices between samples under the mRNA data view, microRNA data view and Image data view to obtain the similarity fusion matrix between samples;

S116: Use the spectral clustering algorithm to cluster the samples according to the similarity fusion matrix between samples.

2. The multi-modal data clustering method according to claim 1, characterized in that said S102 specifically includes:

S1021: Convert the sample data into a pixel matrix;

S1022: Nonlinearly fuse the current pixel with pixels within a neighborhood with a radius of 1 pixel:

Among them, g(i,j) represents the pixel value after nonlinear fusion at the current pixel point (i,j), and S(i,j) represents a radius of 1 pixel around the current pixel point (i,j). A collection of pixels within the neighborhood of pixels, (k,l) represents the pixel coordinates around the current pixel (i,j), f(k,l) represents the grayscale at the pixel (k,l) Degree value, w(i,j,k,l) represents the weight parameter between the current pixel point (i,j) and the pixel point (k,l);

Among them, the weight parameter w(i,j,k,l) between the current pixel point (i,j) and the pixel point (k,l) is calculated as:

w(i,j,k,l)=d(i,j,k,l)·r(i,j,k,l)

Among them, d(i,j,k,l) represents the spatial domain weight between the current pixel point (i,j) and the pixel point (k,l), and r(i,j,k,l) represents the current pixel point. The pixel domain weight between (i, j) and pixel point (k, l), σ _d represents the spatial domain standard deviation, and σ _r represents the pixel domain standard deviation.

3. The multi-modal data clustering method according to claim 1, characterized in that said S103 specifically includes:

S1031: Introduce the Sobel operator to calculate the horizontal feature matrix S _x and vertical feature matrix S _y of the filtered sample data:

S1032: Calculate the horizontal gradient G _x and the vertical gradient G _y according to the horizontal feature matrix S _x and the vertical feature matrix _Sy :

G _x =S _x I

G _y =S _y I

Among them, I represents the gray value matrix of the filtered sample data;

S1033: Calculate the gradient intensity G and gradient direction θ of the pixel:

4. The multi-modal data clustering method according to claim 1, characterized in that said S105 specifically includes:

S1051: Set high threshold and low threshold;

S1052: When the edge pixel gradient value of a pixel is greater than the high threshold, determine the pixel to be a strong edge pixel;

S1053: When the edge pixel gradient value of a pixel is between the low threshold and the high threshold, determine the pixel to be a weak edge pixel;

S1054: When the edge pixel gradient value of a pixel is less than the low threshold, suppress the pixel.

Wherein, the determination method of the high threshold TH and the low threshold TL is:

TH＝0.5max(H*)

TL＝0.1max(H*)

Among them, H* represents the pixel matrix after non-maximum suppression, and max(H*) represents the maximum value in the pixel matrix after non-maximum suppression.

5. The multimodal data clustering method according to claim 1, characterized in that, the S108 is specifically:

The correlation coefficient matrix is nonlinearly mapped through the following formula:

a _ij =|a _ij | ^β

Sm _×m ＝[ _aij ] _m×m

Among them, S represents the correlation coefficient matrix, a _ij represents the correlation coefficient between the i-th sample and the j-th sample, β represents the soft threshold, and m represents the number of samples.

6. The multi-modal data clustering method according to claim 1, characterized in that, the S109 is specifically:

Calculate the connectivity between the current sample and the remaining samples through the following formula:

Among them, k _i represents the connectivity between the i-th sample and the remaining samples.

7. The multi-modal data clustering method according to claim 1, characterized in that said S112 specifically includes:

S1121: Convert the distance between samples into similarity between samples through the following formula:

Among them, w _ij represents the similarity between the i-th sample and the j-th sample, d _ij represents the distance between the i-th sample and the j-th sample, and ε _ij represents the distance between the i-th sample and the j-th sample. _The _adaptive _parameters _between _{_} _{_} the average of the distance;

S1122: Construct the similarity matrix P between samples according to the following formula:

Among them, P _ij represents the element value of the i-th row and j-th column in the similarity matrix between samples, and w _ik represents the similarity between the i-th sample and the k-th sample.

8. The multi-modal data clustering method according to claim 1, characterized in that, the S113 is specifically:

Construct the kernel matrix S through the following formula:

Among them, S _ij represents the element value of the i-th row and j-th column in the kernel matrix, and C _i represents the category of the i-th sample.

9. The multi-modal data clustering method according to claim 1, characterized in that, the S114 is specifically:

Iterate the similarity matrix between samples through the following formula:

Among them, P ^v represents the similarity matrix between samples under the v-th view, P ^k represents the similarity matrix between samples under the k-th view, S ^v represents the kernel matrix under the v-th view, (·) ^T represents the matrix Transpose.

10. The multi-modal data clustering method according to claim 1, characterized in that, the S115 is specifically:

Through the following formula, the similarity matrix between samples in the mRNA data view, microRNA data view and Image data view is combined to obtain the similarity fusion matrix between samples:

Among them, P represents the similarity fusion matrix between samples, and P ^v represents the similarity matrix between samples under the vth view.