CN116580848A

CN116580848A - Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers

Info

Publication number: CN116580848A
Application number: CN202310538812.0A
Authority: CN
Inventors: 彭绍亮; 潘良睿; 王练; 刘达政; 窦钰涛; 刘明婷; 许力文; 王鹤恬
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-08-11

Abstract

The invention discloses a multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers, which comprises the following steps: s1, collecting and preprocessing data of multiple groups of cancer data; s2, completing a classification task of multiple groups of cancer data by adopting a supervised multi-head attention model; and S3, learning by adopting an decoupling comparison learning model based on a multi-head attention mechanism to complete the clustering task of the cancer multi-group data. The invention can obtain better effect on classification task and clustering task, and can analyze pathogenesis of cancer with clinical information.

Description

Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers

Technical Field

The invention relates to the technical field of artificial intelligence and bioinformatics, in particular to a method for analyzing multiple groups of chemical data of cancers based on a multi-head attention mechanism.

Background

With the development of high throughput sequencing technology, the era of precise medicine has arrived. A large amount of biomedical data presents an explosive growth and is collected and consolidated in a common database. Extensive efforts such as cancer genetic profiling (TCGA) have accumulated the genome, transcriptome, proteome and clinical data of more than 20 cancers from thousands of patients [1]. The rich data can help researchers understand the heterogeneity of captured biological processes and phenotypes from different perspectives. However, high throughput sequencing techniques acquire a large amount of data, a small number of samples, a large amount of noise between the data, and a large difference between the platforms. Thus, extracting valuable information from high-throughput data presents a significant challenge.

A single group of study can theoretically and efficiently carry out accurate analysis on the study object. Currently, single-group science has become an important research means in the field of life science, and has been widely used in genomics and proteomics. With the development of research, in order to gain insight into the interrelationship and regulatory mechanisms between molecules in organisms, multiple sets of chemical analyses have integrated genomics in an unbiased manner, epigenetic, transcriptomic, and proteomic systems to analyze the mechanisms and phenotypes of the living system. At present, multiple sets of chemical data have become hot spots for research in many fields, such as cancer research, drug development, agriculture and environmental fields, and software and tools for analysis of multiple sets of chemical data, such as R-package limma, DESeq2, edge, etc., and software such as metaanalysis, proteome Discoverer, etc., have also appeared. Second, some researchers have developed various methods of multi-set data processing, such as: the multi-core learning method comprises Bayesian consensus clustering and dimension reduction based on machine learning.

Some algorithms for deep learning have recently been widely used in the study of multiple sets of chemical data. Some researchers have proposed 16 representative deep learning methods to classify and cluster multiple sets of chemical data, including Fully Connected Neural Networks (FCNN), convolutional Neural Networks (CNN), graphic neural networks (GCN), self-encoders (AE), capsule networks (capsule net), and Generation of Antagonism Networks (GAN), among others. Some researchers have proposed an end-to-end multi-modal deep learning model (scMDC) that characterizes different data sources and co-learns the deep embedded potential features for cluster analysis. Some researchers have proposed a unified multi-discipline data multitasking deep learning framework (omi embedded) that supports dimension reduction, multi-discipline integration, tumor type classification, phenotypic feature reconstruction and survival prediction. Some researchers have proposed an extensible and interpretable multi-set of chemical deep learning frameworks (deeplomix) for use in survival analysis of cancer. It is used to extract the relationship between clinical survival time and multiple sets of study data based on a deep learning framework to predict prognosis. Some researchers propose a neural network method based on multiple-input multiple-output deep challenge learning, accurately model complex data, and identify molecular subtypes of tumor samples using consensus clustering and a gaussian mixture model. Some researchers have proposed using the field component analysis (NCA) algorithm to select relevant features from multiple sets of data retrieved from TCGA and cancer drug susceptibility Genomics (GDSC) databases and develop survival and predictive models. Second, there are also some deep learning and machine learning methods applied to the diagnosis and prognosis of tumor subtypes.

Disclosure of Invention

The invention aims to provide a multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers, which overcomes the defects in the prior art.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a method for analyzing multiple sets of biological data for cancer based on a multi-headed attentional mechanism, comprising the steps of:

s1, collecting and preprocessing data of multiple groups of cancer data;

s2, completing a classification task of multiple groups of cancer data by adopting a supervised multi-head attention model;

and S3, learning by adopting an decoupling comparison learning model based on a multi-head attention mechanism to complete the clustering task of the cancer multi-group data.

Further, the step S1 specifically includes:

s11, carrying out normalization operation on multiple groups of cancer data, and carrying out unified operation on the dimensions of different data;

s12, combining and integrating data features from different groups, and disturbing the sequence of sample data to add noise to the samples to generate training data.

Further, the step of generating the supervised multi-head attention model in the step S2 is as follows:

s21, designing a multi-head attention encoder;

s22, creating symmetrical multi-head attention encoders based on the multi-head attention encoders;

s23, creating a supervised multi-head attention model based on the symmetrical multi-head attention encoder.

Further, the step S21 includes:

s211, carrying out position coding on the cancer multiunit data, and reserving the relation among all positions in the sequence;

s212, extracting features by adopting a symmetrical multi-head attention mechanism;

s213, carrying out head separation calculation on the multiple groups of the learning data features by adopting a multi-head attention mechanism;

s214, performing multi-group self-attention processing on the original input sequence, and then splicing the results of each group of attention together for linear transformation once to obtain a final output result.

Further, the step S22 specifically includes: feature sharing of multiple groups of the learning data is achieved by sharing a weight matrix, feature extraction is performed in a symmetric multi-head self-attention encoder, the learned weight features share weights in a feature map, in back propagation, since the weight matrix is shared, the symmetric multi-head attention encoder updates the weight gradient with the same value, and two independent multi-head attention encoders are connected in parallel to obtain the symmetric multi-head attention encoder.

Further, the step S23 specifically includes:

s231, extracting the characteristics of multiple groups of chemical data by adopting a symmetrical multi-head attention mechanism encoder to generate a characteristic matrix W ₁ And W is ₂ ；

S232, adopting a feature fusion method of multiplying elements by elements to obtain a feature matrix W ₁ And W is ₂ The features of the (a) are multiplied element by element to obtain a fused feature vector;

s233, sending the fused feature vectors into a three-layer perceptron for normalization operation, projecting the feature vectors into a new feature space, and generating a new feature matrixCalculating an error between the single prediction sample and the label by adopting a cross entropy loss function;

s234, calculating the feature matrixAnd the distance between the tags, the total loss function L is obtained.

Further, the formula for calculating the error between the single prediction sample and the label using the cross entropy loss function in step S233 is:

the calculation formula of the total loss function L in step S234 is as follows:

further, the step S3 specifically includes:

s31, projecting the projection in the step S233 to a new feature spaceMiddle->As a training positive sample, n-1 pairs are added +.>As a training negative sample n-1 pairs, the similarity of paired samples is measured by cosine distance:

in the formula, i, j E [1, N]For the purpose of calculationAnd->Error of each view in a database, creating a cross entropy loss functionThen the loss function between positive and negative samples is:

wherein k is [1,2], and τ is a temperature parameter in the model that controls softness;

s32, removing the dead pairs from denominators by adopting a decoupling comparison learning method to realize decoupling comparison learning, wherein the process is as follows:

s33, enhancing all data by calculatingObtaining cross entropy loss of decoupling comparison learning, and enabling a model to identify all positive samples in a data set, wherein the cross entropy loss comprises the following steps:

s34, feature matrixAnd->The cosine similarity is also used to calculate the error between a pair of samples, as follows:

in the formula, i, j E [1, M]For the purpose of calculationAnd->Error of each view in the list, create a cluster penalty function +.>The loss function between each pair of positive and negative samples can be expressed as:

s35, through learning of all positive and negative sample pairs, the total loss function is expressed as:

in the method, in the process of the invention,is the entropy of the subtype cluster allocation probability, outputting most of the label features after each loss calculation.

Further, the feature spaceClustering samples by using decoupling comparison loss function to realize output of clustering labels, wherein the feature space is +.>And->The characteristics use the clustering loss function to calculate and cluster the sample, in order to realize the output of the clustering characteristic, in the clustering task, the total loss function is:

L＝L _D +L _C 。

compared with the prior art, the invention has the advantages that:

1. the multi-headed attention mechanism model (SMA) supervised in the present invention achieves 100% accurate subtype classification on simulated single cell and cancer multi-set of chemical data sets.

2. The invention learns multiple groups of chemical data characteristics through an decoupling comparison learning model (DMACL) and clusters and identifies the subtype of the cancer, and the unsupervised comparison learning method carries out subtype analysis by calculating the similarity among multiple groups of chemical data samples. The DMACL model shows significant advantages over the 16 deep learning models.

3. The invention can obtain better effect on classification task and clustering task, and can analyze pathogenesis of cancer with clinical information.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a multi-headed self-attention encoder frame of the present invention.

FIG. 2 is the performance of seven supervised methods of classifying cancer benchmark datasets used in the tasks of the present invention.

FIG. 3 is a graph of C-index, silhouette score and Davies Bouldin score of 11 unsupervised methods of the present invention on a single cell multicellular multi-set of data. Based on the cluster analysis of the single cell dataset, three internal indices C-index, silhouette score, and Davies Bouldin score (a, b, C) were calculated.

FIG. 4 is a C-index of 11 unsupervised methods on a cancer reference dataset for a clustering task according to the present invention.

FIG. 5 is a Davies Bouldin score on a cancer benchmark dataset used in a clustering task with 11 unsupervised methods of the present invention.

FIG. 6 is a Davies Bouldin score of 11 unsupervised methods on a cancer reference dataset used in a clustering task according to the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.

Referring to fig. 1 and 2, the present embodiment discloses a method for analyzing multiple sets of chemical data of cancer based on a multi-head attention mechanism, comprising the following steps:

and S1, collecting and preprocessing the multiple sets of cancer data.

Specifically, step S1 includes the steps of:

and S11, carrying out normalization operation on the cancer multi-group data, and carrying out unified operation on the dimensions of different data so as to facilitate subsequent feature extraction.

And step S12, combining and integrating data features from different groups, wherein the integration purpose is to improve the coverage rate of the data, increase the information quantity of the data and improve the interpretability of the data. Then, the sequence of the sample data is disturbed, noise is added to the samples, and training data is generated.

And S2, completing the task of classifying the multiple groups of cancer data by using a supervised multi-head attention model (SMA).

Specifically, step S2 includes the steps of:

step S21, designing a multi-head attention encoder.

Wherein, step S21 includes:

in step S211, since the multiple sets of chemical data need to be sent into the slice framework for feature extraction and data dimension reduction, the model is not processed according to the order of the multiple sets of chemical data, so that the multiple sets of chemical data of the cancer need to be position-coded, and the relation between the positions in the sequence is preserved.

In step S212, through linear transformation of position coding, the attention mechanism module can better capture the relationship between different positions in the sequence data, so as to improve the performance of the model. In the feature extraction section, the present embodiment performs feature extraction using a symmetrical multi-head attention mechanism. The present embodiment uses a fully connected layer to implement linear transformation of input, and the process can be expressed as:

y _pe ＝x _pe W+b

wherein x is _pe The code vector representing each position, W being the characteristic weight of the data, b being each characteristic weightAnd (5) a bias vector. Through linear transformation of position coding, the attention mechanism module can better capture the relation between different positions in the sequence data, so that the performance of the model is improved.

In the feature extraction section, the present embodiment performs feature extraction using a symmetrical multi-head attention mechanism. Tensor matrix after multiple groups of chemical data cancat is W _mn . Multi-head attention mechanism will W _mn And (5) carrying out head-dividing calculation on the characteristics. Herein, W is _mn Along the last dimension, a number of small feature vectors are divided, each called a head, the number of heads h of the multi-head attention mechanism being set to 80. For each head, it is necessary to calculate its weight for other attentiveness using a dot product attentiveness mechanism, the vector of its self-attentiveness output is:

wherein Q, K respectively represent the feature matrix output by the same head, and V represents the feature matrix obtained by another head.Is the dimension of the matrix and can be used to reduce the dimension after the feature matrix product. The multi-head attention mechanism is characterized in that the original input sequence is subjected to a plurality of groups of self-attention processing procedures; and then splicing the results of each group of attention together for linear transformation once to obtain a final output result. The process of its calculation can be expressed as:

MultiHead(Q，K，V)＝Concat(head ₁ ，…head _n )W ^o

step S213, the multi-group learning data feature is processed by head separation calculation by adopting a multi-head attention mechanism. In this embodiment, the feature is divided into a plurality of small feature vectors along the last dimension, each small feature vector is called a header, and the header number h of the multi-header attention mechanism is set to 80. For each head, a dot product attention mechanism needs to be used to calculate its attention weight for the others. The multi-headed attention mechanism builds an attention layer according to the h-size. During forward pass, the feature matrix is fed into the input layer of the feedforward module. The neurons of each input layer correspond to the columns of each feature matrix, i.e., one feature. Each neuron weights and biases its input, then calculates the output by activating the function, and passes the output to the next layer of neurons, and finally the output layer outputs the feature matrix.

In step S214, the multi-head attention mechanism performs multiple groups of self-attention processing on the original input sequence, and then splices the results of each group of attention together for linear transformation to obtain the final output result.

Step S22, a symmetrical multi-head attention encoder is created based on the multi-head attention encoder. The method comprises the following steps:

feature sharing of multiple sets of chemical data may generally be achieved by sharing a weight matrix. Since feature extraction is performed in a symmetric multi-headed self-attention encoder, the learned weight features are identical, and thus weights can be shared in the feature map. Second, in back propagation, since the weight matrix is shared, the symmetric multi-headed attention encoder can update the weight gradient with the same value.

Step S23, a supervised multi-head attention model is created based on the symmetrical multi-head attention encoder. The method specifically comprises the following steps:

in step S231, interactions and relationships between different types of data can be found and key components and pathways in the biological system can be identified through multiple sets of mathematical data classification. The comprehensive analysis can provide important clues and insights for researching pathogenesis of complex diseases, searching new therapeutic targets and the like. The three data sets provided by the experiment already contain labels for all samples. Thus, the present experiment suggests using a supervised multi-headed attention mechanism (SMA) modelTypes are used to classify cancer types. The experiment adopts a symmetrical multi-head attention mechanism encoder to extract the characteristics of multiple groups of chemical data and generate a characteristic matrix W ₁ And W is ₂ 。

Step S232, adopting a feature fusion method of multiplying elements by elements to obtain a feature matrix W ₁ And W is ₂ The method can highlight the unique features of the encoder to improve the performance and generalization capability of the model.

Step S233, sending the fused feature vectors into a three-layer perceptron (MLP) for normalization operation, projecting the feature vectors into a new feature space, and generating a new feature matrixThe error between the single prediction sample and the label is calculated by adopting the cross entropy loss function, and the process is as follows:

step S234, after calculating the feature matrixAnd the distance between the tags, the total loss function L is obtained.

And S3, learning by adopting an decoupling comparison learning model (DMACL) based on a multi-head attention mechanism to complete the clustering task of the cancer multi-group data.

Clustering of cancer subtypes aims to separate similar cancer samples into the same subtype and minimize the differences between different subtypes in order to better understand the biological characteristics and molecular mechanisms of cancer, providing better diagnosis, treatment and prognosis for patients. Unsupervised decoupling comparison learning can greatly improve matching similarity, and in the task of cancer typing, in view of the fact that available labels are not provided by experiments, both positive samples and negative samples are composed of pseudo labels generated by data enhancement.

in the formula, i, j E [1, N]For the purpose of calculationAnd->Error of each view in the picture, creating a cross entropy loss function +.>Then the loss function between positive and negative samples is:

where k.epsilon.1, 2, τ is the temperature parameter in the model that controls softness, and in general, the negative-positive coupling (NPC) multiplier in the cross entropy loss (InfoNCE) tends to affect the results of model training, and the following two cases occur. Positive samples near the first, anchor point would be considered important information because they are the only positive samples we have. At the same time, the gradient of the negative sample is gradually reduced. Second, when the negative samples are far apart and less informative, the model may erroneously decrease the learning rate from the positive samples. This means that the model will emphasize the negative samples even more than consider the information of the positive and negative samples in balance. This may lead to errors in the model in processing the positive samples, thereby reducing the accuracy of the model.

Step S32, removing the dead pairs from denominators by adopting a decoupling comparison learning method to realize decoupling comparison learning, wherein the process is as follows:

step S33, after enhancing by calculating all dataObtaining cross entropy loss of decoupling comparison learning, and enabling a model to identify all positive samples in a data set, wherein the cross entropy loss comprises the following steps:

the concept of "labels, i.e. representations", is most common in comparative clusters. The basic idea of this approach is to encode the labels as feature vectors and input them into a cluster model for training along with the feature vectors of the data points. The clustering problem can be converted into a contrast learning problem by embedding the labels into the feature space. I.e., data points within the same cluster should be closer in feature space, while data points between different clusters should be farther apart in feature space. This allows the clusters to which data points belong to be determined by comparing the similarity between them.

Step S34, feature matrixAnd->The cosine similarity is also used to calculate the error between a pair of samples, as follows:

step S35, through learning of all positive and negative sample pairs, the total loss function is expressed as:

The characteristics use the decoupling comparison loss function to perform clustering operation on the samples, so that output of clustering labels is realized. />And->Feature sample entry using cluster loss function computationAnd performing row clustering operation, so as to realize the output of clustering features. The clustering model is an end-to-end training and prediction, so that in the training process of the model, the simultaneous optimization of the decoupling comparison loss function and the clustering loss function is ensured, and finally in the clustering task, the total loss function is as follows:

L＝L _D +L _c 。

the invention is further illustrated by the following examples:

the present example compares the classification performance of SMA with the method of classification of 6 common histology data: (1) lfNN model: each histology vector is stitched into a feature vector as input to the model, and multiple neural networks perform feature extraction and use Softmax as the final layer of output classification. (2) efNN model: each histologic vector serves as an input to the model, and multiple neural networks perform feature extraction and connect the outputs into one vector using Softmax as the final layer of output classification. (3) lfCNN model: it is similar to efNN, with the addition of convolution and pooling layers in the lfCNN model. And after the plurality of histology vectors are connected into one feature vector, the feature vector is sent into a convolution layer and a pooling layer, the output features are flattened into and out of a fully-connected network, and final prediction is carried out. (4) efCNN model: it is similar to lfNN. Each histology vector is input to the convolutional and pooling layers and the output features are flattened, connected and fed into a fully connected neural network for final prediction. (5) moGCN model: it uses the GCN to learn the characteristics of the omics data and perform classification tasks. To perform group-specific classification, a multi-layer GCN needs to be built for each group data type. (6) moGAT model: the GAT in the moGCN model replaces the GCN to obtain the moGAT model. In the test method, the lfNN, efNN, lfCNN, efCNN, moGCN, moGAT, SMA model is trained with a direct concatenation of pre-processed multiple sets of mathematical data as inputs, all models being trained using the same pre-processed data. The classification effect of all models can be referred to the data of table 1 under both equivalent and heterogenic conditions in the simulated dataset.

Table 1 7 performance of the supervision method with all clusters being the same size

Samples of 5clusters of random sizes,10clusters of random sizes,15clusters of random sizes were experimentally selected. These seven supervised approaches are essentially designed for sample classification, which classifies samples of a true cluster (subtype). In the classification task, to quantitatively evaluate the 7 methods of supervised models, we used a simple random cross-validation method to train models and test models. Meanwhile, all models apply three evaluation indexes of Accurcry, F1 macro and F1 weighted to measure the classification performance. From table 2, the efNN, moGCN, moGAT, SMA model was obtained in the classification task of multiple school groups. The efCNN model is significantly lower than the performance of the other 6 models in the sample classification task of 15clusters of random sizes. The lfNN model was significantly lower in the sample classification task of 10clusters of random sizes than the other 6 models. This may be because the simulated data set, after passing through the multi-layer convolution and pooling layers, has a model over-fitting phenomenon, resulting in a false positive of the model in classification. The lfNN model only achieves the best effect in the sample classification task of 5clusters of random sizes. This may be that the lfNN model fails to learn the multiple-learning features during the feature extraction process, and misjudgment occurs.

In the classification task, this example will investigate the performance of lfNN, efNN, lfCNN, efCNN, moGCN, moGAT and SMA models in single cell datasets, similar to the method by which these models are evaluated in the simulated dataset. All models used a simple cross-validation method to classify samples of three cancer cell lines and performance of the classification was measured by three evaluation indicators, accuracy, F1 macro, F1 weighted.

Table 2 performance of six supervised methods on single cell multicellular multiple sets of mathematical data

As described in table 2, both lfNN, efNN, moGCN, moGAT and SMA models were found to peak on all three evaluations of Accuracy, F1 macro, F1 weighted, which showed that these models achieved optimal performance on the classification task. The lfCNN and efCNN models still have no other models with good effect in the test, and one of the main reasons is probably that the number of layers of the convolution layer and the pooling layer is relatively small, and the effect of feature extraction is not obvious enough. Another reason may be that no regularization penalty is added to the model, or some other reason not limited to improper learning rate, improper batch size, etc.

In the classification task, experiments were chosen on five datasets with true cancer subtypes, similar to lfNN, efNN, lfCNN, efCNN, moGCN, moGAT and SMA models in the methods of modeling the dataset and single cell dataset evaluation model. These methods classify true cancer subtype samples. All model training and testing used a simple cross-validation method and the performance of the classification was measured by three evaluation indexes, accuracy, F1 macro, F1 weighted. For each cancer dataset we selected three sets of data samples, 59, 272, 206, 144 and 198 samples of BRCA, GBM, SARC, LUAD and STAD, respectively. Five cancer subtypes, luminalA, luminalB, basal-like, normal-like, HER2-enriched, are included in BRCA. GBM includes 4 cancer subtypes, respectively, proserual, classification, mesenchymal, neuroal. SARC also includes five cancer subtypes, dediferentiated liposarcoma, leiomyosarcoma, undiferentiated pleomorphic sarcoma, myxobrarcoma, malignant peripheral nerve sheath tumor, synovial sarca, respectively. LUAD includes four cancer subtypes, formerly bronchioid, formerly squamoid, forming magnetic. STAD includes Epstein-Barr virus, microsatellite instability, genomically stable, chromosomal instability.

As shown in FIG. 3, the SMA model classifies subtypes of cancer in BRCA, GBM, SARC and STAD to obtain three evaluation indexes of Accuracy, F1 macro and F1 weighted, which all reach 1, so as to achieve the performance of accurate classification. However, when classifying the LUAD subtypes, only the three evaluation indexes of Accuracy, F1 macro, and F1 weighted respectively obtain high performance of 0.958,0.93,0.91. Compared with other models, the SMA model has better classification effect in classification task than other models. This is possible because the SMA model is able to focus on the positional information of the input sequence at the same time, thereby enabling global information to be captured. Secondly, the SMA model has deeper structure and more parameters, and can help learn more characteristics, so that the classification accuracy is improved. Thus, SMA models can be used as a standard method of classifying multiple sets of chemical data for cancer.

The clustering performance of the DMACL model is compared with a common method for clustering 10 kinds of histology data: (1) lfAE, namely, multiple groups of chemical data are connected into a feature vector, and then AE formed by an encoder and a decoder is subjected to feature clustering. Wherein the ReLu function is used for the activation function of all layers of the encoder and the intermediate layers of the decoder and tanh is used for the last layer of the decoder. (2) efAE: it is similar to the lfAE model, and only when processing multiple sets of data, the AE extracts features simultaneously for multiple sets of data, respectively. (3) lfDAE: the lfDAE will process the vector features of each omic data independently. The partially corrupted data is constructed by adding noise to the input data and restored to the original input data by encoding and decoding. (4) efDAE: the efDAE will process the vector features of the spliced sets of mathematical data. The other steps thereafter are the same as lfDAE. (5) lfVAE: similar to the efAE model, multiple groups of the mathematical data are spliced into one-dimensional feature vectors, and then feature clustering analysis is carried out on the VAE (compared with AE, the potential vectors of the VAE closely follow unit Gaussian distribution). (6) efVAE: it is similar to the lfVAE model, but at the input of the model, each of the histology data allows the VAE to perform a feature cluster analysis, respectively. (7) lfSVAE: in contrast to lfVAE, this model replaces VAE with only SVAE (SVAE is a stacked VAE model in which all hidden layers follow a unit gaussian distribution) and the rest is unchanged. (8) efSVAE: each hidden layer of the encoder is fully connected to two output layers, the sampling step being identical to the VAE. In the evaluation, a multiplier similar to β -VAE was added to the loss function. (9) lfmmdVAE: it is similar to lfVAE, which is used to train the omics data and finally classify the characteristics of the multi-set of mathematics integration. (10) efmmdVAE: one VAE is also used for training the omics data. The other parts are identical except that the loss function is different from the efVAE.

In the clustering task, the experiment uses a model to perform feature extraction on the simulated multi-set of chemical data to obtain 5-dimensional, 10-dimensional and 15-dimensional embeddings. The embedded dimension is set according to the number of clusters in the simulated multi-set of chemical data. And then, clustering the dimension reduction results of the multiple groups of the mathematical data by adopting a k-means algorithm. Sample clusters were finally obtained to compare the performance of eleven unsupervised methods.

In the clustering task of the simulation dataset, the embodiment firstly uses the C-index evaluation index to measure the consistency between the clusters fused by the multiple groups of the chemical data and the real clusters. The lower the C-index, the smaller the distance between the clustering samples, and the better the clustering effect of the model. As can be seen from the experimental results summarized in table 3, most of the clustering methods have better clustering performance. However, the DMACL model reaches 0.002,0.022,0.023 for each of the values of C-index under the condition that the clusters have a random size, and then 0.005,0.021,0.014 for each of the values of C-index obtained by the DMACL model under the condition that the clusters have the same size. It exceeds the other models in this evaluation index. This may be because when the multi-head attention mechanism is extracting multiple sets of chemical data, it is more focused on the local information of the data to extract more significant data bits. Secondly, the DMACL model clustering effect is still good with the increase of the clustering quantity.

Table 3 simulates C-index for eleven unsupervised methods on the dataset.

Silhouette score is obtained by calculating the contour coefficient of each sample, measuring how well a sample is assigned to the correct cluster. From table 4, we find that the probability of the efVAE model assigning samples into the correct clusters is high. The DMACL model only obtains the ranks 3, 5, and 7, respectively, if the clusters are the same size. The reasons for poor clustering of DMACL models may be poor sample quality of the simulated data, noise in the data set, outliers, etc. Second, the imbalance in the size distribution of the individual clusters in the dataset may also result in a lower value for Silhouette score. Second, silhouette score itself has certain limitations such as inaccurate assessment of clustering effects for density non-uniformity.

Table 4 11 unsupervised methods Silhouette score on simulated dataset.

From Table 5, it can be found that the efVAE model achieves a lower Davies Bouldin score in the clustering of the simulation data. This may be because the VAE encodes the input data into potential vectors and then generates new data from the potential vectors to learn the distribution of the data. The DMACL model only obtains the 3 rd, 6 th and 6 th ranks, respectively, on condition that the clusters have random sizes. The reason the Davies Bouldin score is not high may be because the data features in the simulated data set are not significant enough, resulting in a multi-headed attentiveness mechanism that cannot extract valid features. Second, it can also be found that the number of clusters is also one of the reasons for influencing the Davies Bouldin score.

Table 5 Davies Bouldin score on simulated dataset for 11 unsupervised methods

And (3) evaluating the DMACL model in single-cell data, and for a clustering task of a single-cell data set, performing feature fusion on multiple groups of chemical data by all models to obtain fused two-dimensional embedding. And then using a k-means algorithm to perform dimension reduction and clustering on the multi-learning data. The performance of eleven unsupervised methods was finally compared by obtaining the results of class 1 clusters. Experiments used C-index, silhouette score, and Davies Bouldin score to evaluate the effect of model clustering. As shown in fig. 3, the DMACL model obtained the lowest C-index value and Davies Bouldin score, higher silhouette score when clustering samples. Therefore, the DMACL model becomes the best model for single cell dataset clustering. This is probably because single cell data has information of long sequences, and DMACL models have a multi-headed attention mechanism to handle long sequences, thereby reducing the occurrence of gradient extinction and gradient explosion during model training. In a word, the DMACL model can better capture the characteristics of single cell data, so that the accuracy of clustering is improved.

The cancer multi-group data has the characteristics of high dimensionality, diversity, noise and the like. For clustering tasks, eleven unsupervised models are first used to fuse multiple sets of cancer data to obtain 10-dimensional embedments. And then clustering the multiple groups of the chemical data by adopting a k-means algorithm. Experiments with a number of clusters from 1 to 7 were explored, as the best number of clusters was not determined. And finally, carrying out cluster analysis on the samples by using an unsupervised model. In evaluating the self-supervised clustering model, the performance of the model was measured using the C-index, silhouette score, and Davies Bouldin score evaluation index. As shown in fig. 4, in the clustering experiments of all models, the C-index of the DMACL model was mainly concentrated in the middle part of the radar chart. According to the coordinates of the radar map, the closer the data is to the center point, the smaller the value. Thus, the C-index value of the DMACL model reflects an almost exact clustering of clustered samples. This is probably because the DMACL model has a strong generalization ability, which can help the model capture more features, thereby improving the effects of feature extraction and data dimension reduction. The efmmdVAE, the efVAE and the lfmmdVAE models have good clustering effect and can be used as a reference model of a plurality of groups of cancer data sets.

From fig. 5, it was found that Silhouette scores of the DMACL model takes larger values on most of the cancer multi-set of data, which are distributed mainly on the outer circle of the radar chart. The DMACL model only achieves lower values on the 2 cluster tasks of SKCM and lucc. This may be the case when the structure of the cancer multi-cluster data is complex and the number of data points is large, the 2 clusters may be used to generate an under-fitting condition, that is, the essential features of the data set cannot be captured well, so that some confusing data points may exist between the two clusters after segmentation.

Davies Bouldin scores is also an important evaluation index for analyzing the clustering effect of the DMACL model. Therefore, we also use Davies Bouldin scores to measure the performance of the model. As shown in fig. 6, davies Bouldin scores takes the smallest value among the 2 clusters and the 3 clusters. In clusters 4, 5 and 6, the DMACL model had a poorer clustering effect on LUCS and LIHC. It may be that when the structure of the multi-set data set itself is complex and the data points are small, the use of 4 clusters, 5clusters and 6 clusters may be over-fitted, i.e. the division of the data set into three clusters may lead to some unnecessary subdivision which does not reflect the essential features of the data set well, resulting in poor clustering.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, the patentees may make various modifications or alterations within the scope of the appended claims, and are intended to be within the scope of the invention as described in the claims.

Claims

1. A method for analyzing multiple sets of biological data for cancer based on a multi-headed attentional mechanism, comprising the steps of:

s1, collecting and preprocessing data of multiple groups of cancer data;

2. The method for analyzing multiple sets of chemical data for cancer based on the multi-head attentiveness mechanism as claimed in claim 1, wherein said step S1 specifically comprises:

3. The method for analyzing multiple sets of chemical data for cancer based on the multi-head attention mechanism according to claim 1, wherein the step of generating the supervised multi-head attention model in step S2 is:

s21, designing a multi-head attention encoder;

4. The method for analyzing multiple sets of chemical data for cancer based on the multi-head attentiveness mechanism as claimed in claim 3, wherein said step S21 includes:

5. The method for analyzing multiple sets of chemical data for cancer based on the multi-head attentiveness mechanism as claimed in claim 3, wherein said step S22 is specifically: feature sharing of multiple groups of the learning data is achieved by sharing a weight matrix, feature extraction is performed in a symmetric multi-head self-attention encoder, the learned weight features share weights in a feature map, in back propagation, since the weight matrix is shared, the symmetric multi-head attention encoder updates the weight gradient with the same value, and two independent multi-head attention encoders are connected in parallel to obtain the symmetric multi-head attention encoder.

6. The method for analyzing multiple sets of chemical data for cancer based on the multi-head attentiveness mechanism as claimed in claim 3, wherein said step S23 is specifically:

7. The method for analyzing multiple sets of cancer data based on the multi-head attention mechanism according to claim 6, wherein the formula for calculating the error between the single prediction sample and the label using the cross entropy loss function in step S233 is:

8. the method for analyzing multiple sets of chemical data for cancer based on the multi-head attentiveness mechanism as claimed in claim 6, wherein said step S3 specifically comprises:

s34, feature matrix The cosine similarity is also used to calculate the error between a pair of samples, as follows:

9. The multi-head attention mechanism based method of analyzing multiple sets of cancer data of claim 8,

the feature spaceClustering samples by using decoupling comparison loss function to realize output of clustering labels, wherein the feature space is +.>And->The characteristics use the clustering loss function to calculate and cluster the sample, in order to realize the output of the clustering characteristic, in the clustering task, the total loss function is:

L＝L _D +L _C 。