CN117153260A

CN117153260A - Spatial transcriptome data clustering method, device and medium based on contrast learning

Info

Publication number: CN117153260A
Application number: CN202311204657.5A
Authority: CN
Inventors: 李君一; 韩睿; 王旭; 王轩; 刘博�; 王亚东
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2023-12-01
Anticipated expiration: 2043-09-18
Also published as: CN117153260B

Abstract

The invention discloses a spatial transcriptome data clustering method, device, equipment and storage medium based on contrast learning, wherein the method comprises the following steps: obtaining weighted feature matrix and adjacency matrix based on the space transcriptome data and constructing adjacency graph; respectively inputting the adjacency graph into two encoders of the twin network structure to learn a first node representation and a second node representation; constructing a positive sample set for calculating contrast loss based on the first node representation and the second node representation; calculating cluster loss based on soft cluster distribution and auxiliary distribution of the nodes; and training the model through comparison loss and clustering loss guidance so as to obtain a clustering result. The node representation for constructing the positive sample set is obtained through the comparison learning of the twin network structure, the comparison loss and the clustering loss are calculated, and the model training is guided based on the comparison loss and the clustering loss among the nodes, so that the data clustering method for the genome transcriptome data is obtained based on the comparison learning, and the pertinence and the accuracy of the spatial transcriptome data clustering are improved.

Description

Spatial transcriptome data clustering method, device and medium based on contrast learning

Technical Field

The invention relates to the technical field of data processing, in particular to a spatial transcriptome data clustering method, device and equipment based on contrast learning and a storage medium.

Background

The function of complex tissues is fundamentally related to the spatial environment of different cell types, and the relative location of transcriptional expression in tissues is critical for understanding their biological functions and for describing the biological network of interactions. The space transcriptome can not only provide data information such as transcriptome of a study object, but also can position the space position of the transcriptome in the tissue, thereby providing valuable insight for research and diagnosis.

There are many studies currently taking into account the addition of spatial information to cluster analysis and developing different clustering algorithms for spatial transcriptomes. The method generally constructs positive and negative sample pairs based on data enhancement, and common data enhancement modes for graph structures and gene expression matrices, such as random addition and deletion edges and nodes, random masking and random shuffling gene expression matrices, inevitably destroy biological meanings contained in original structures, and cannot achieve the effect of constructing positive samples well. Meanwhile, the requirement of memory and the consumption of time during training are greatly increased by constructing a large-scale negative sample. Space transcriptomics has great development prospect, and each large platform is also continuously updating technical methods. The platform technology difference can bring data difference, for example, the minimum unit of the capturing point can be a cell or a spot, the captured genes have different numbers, the histological images can not be obtained, and the capability of the existing method for being compatible with the data of each large platform is to be improved. Meanwhile, most of methods based on contrast learning are to learn potential representations of nodes first and then cluster the potential representations by using the learned potential representations, which results in that the process of learning the potential representations is not optimized for clustering tasks.

Disclosure of Invention

The invention provides a spatial transcriptome data clustering method, device equipment and storage medium based on contrast learning, aiming at improving the pertinence of spatial transcriptome data clustering.

In order to achieve the above object, the present invention provides a spatial transcriptome data clustering method based on contrast learning, the method comprising:

preprocessing a gene expression matrix constructed based on the space transcriptome data to obtain a weighted feature matrix and an adjacent matrix, and constructing an adjacent graph based on the weighted feature matrix and the adjacent matrix;

respectively inputting the adjacency graph into a first encoder and a second encoder of a twin network structure, and learning corresponding first node representation and second node representation through the first encoder and the second encoder;

constructing a positive sample set based on the first node representation and the second node representation, and calculating a contrast loss between a node representation predicted value of the second node representation and the first node representation according to the positive sample set;

determining soft cluster distribution and auxiliary distribution of each node, and calculating cluster loss based on the soft cluster distribution and the auxiliary distribution;

and training a model through the contrast loss and the clustering loss guiding model, and obtaining a clustering result of the corresponding node based on the soft clustering distribution after model training is completed.

Optionally, preprocessing the gene expression matrix constructed based on the space transcriptome data to obtain a weighted feature matrix and an adjacency matrix, and constructing an adjacency graph based on the weighted feature matrix and the adjacency matrix includes:

constructing a gene expression matrix based on gene expression quantity information of a space transcriptome, wherein behavior cells/capture points of the gene expression matrix are listed as gene expression quantities;

preprocessing the gene expression quantity data in the gene expression matrix, and performing dimension reduction processing on the gene expression matrix to obtain a feature matrix;

a adjacency graph of the feature matrix is determined based on spatial information between the cells/capture points.

Optionally, the determining the adjacency graph of the feature matrix based on the spatial information between the cells/capture points comprises:

taking the cells/capture points as nodes, and calculating the distance between the nodes according to the space coordinates;

adding edges to a plurality of neighbors of each node nearest to each node to obtain an adjacency matrix;

selecting target edges based on the distance threshold, and weighting each target edge based on the distance between the nodes to obtain a weighted adjacency matrix;

an adjacency graph is obtained based on nodes in the feature matrix and edges of the weighted adjacency matrix.

Optionally, the constructing a positive sample set based on the first node representation and the second node representation, and calculating a contrast loss between a node representation predictor of the second node representation and the first node representation according to the positive sample set includes:

constructing a positive sample set of nodes based on the first node representation and the second node representation of each node;

the contrast loss is calculated based on the total number of nodes, the first node representation within the positive sample set, and the node representation predicted value of the second node representation.

Optionally, the constructing a positive sample set based on the first node representation, the second node representation, and the spatial location comprises:

determining cosine similarity between the target node and the other nodes based on the second node representation of the target node and the first node representation of the other nodes;

after cosine similarity between each target node and other nodes is obtained, determining a K-neighbor node set of each target node;

determining the intersection of the target node in the K-neighbor node set and the neighbor nodes in the adjacency graph as a local semantic positive sample;

determining similar nodes belonging to the same K-means cluster as the target node, and determining the intersection of the K-neighbor node set and the similar nodes as a global semantic positive sample;

And determining the union set of the local semantic positive samples and the global semantic positive samples as a positive sample set of target nodes.

Optionally, before the constructing a positive sample set based on the first node representation and the second node representation, and calculating a contrast loss between a node representation predicted value of the second node representation and the first node representation according to the positive sample set, the method further includes:

and inputting the second node representation into a predictor, and obtaining a node representation predicted value of the second node representation.

Optionally, the determining a soft cluster distribution and an auxiliary distribution of each node, and calculating a cluster loss based on the soft cluster distribution and the auxiliary distribution includes:

performing initial K-means clustering on the second node representation to obtain a plurality of clusters, and determining centroid characteristic representation of each cluster;

determining soft cluster distribution of the corresponding nodes based on the barycenter characteristic representation of the cluster where the nodes are located and the node representation predicted value represented by the second node;

determining corresponding auxiliary distribution based on the soft cluster distribution of the nodes;

the cluster loss is calculated based on the soft cluster distribution and the auxiliary distribution of all the nodes.

In order to achieve the above object, the present invention further provides a spatial transcriptome data clustering device based on contrast learning, including:

The construction module is used for preprocessing a gene expression matrix constructed based on the space transcriptome data to obtain a weighted feature matrix and an adjacent matrix, and constructing an adjacent graph based on the weighted feature matrix and the adjacent matrix;

the learning module is used for inputting the adjacency graph into a first encoder and a second encoder of a twin network structure respectively, and learning corresponding first node representation and second node representation through the first encoder and the second encoder;

the contrast loss calculation module is used for constructing a positive sample set based on the first node representation and the second node representation, and calculating the contrast loss between the node representation predicted value of the second node representation and the first node representation according to the positive sample set;

the cluster loss calculation module is used for determining soft cluster distribution and auxiliary distribution of each node and calculating cluster loss based on the soft cluster distribution and the auxiliary distribution;

and the clustering module is used for guiding model training through the contrast loss and the clustering loss, and obtaining a clustering result of the corresponding node based on the soft clustering distribution after model training is completed.

To achieve the above object, the present invention also provides a spatial transcriptome data clustering device based on contrast learning, comprising a memory, a processor and a spatial transcriptome data clustering program based on contrast learning stored on the memory, which when executed by the processor, implements the steps of the method as described above.

To achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a contrast-learning-based spatial transcriptome data clustering program which, when executed by a processor, implements the steps of the method as described above

Compared with the prior art, the spatial transcriptome data clustering method, device and equipment based on contrast learning and storage medium provided by the invention comprise the following steps: preprocessing a gene expression matrix constructed based on the space transcriptome data to obtain a weighted feature matrix and an adjacent matrix, and constructing an adjacent graph based on the weighted feature matrix and the adjacent matrix; respectively inputting the adjacency graph into a first encoder and a second encoder of a twin network structure, and learning corresponding first node representation and second node representation through the first encoder and the second encoder; constructing a positive sample set based on the first node representation and the second node representation, and calculating a contrast loss between a node representation predicted value of the second node representation and the first node representation according to the positive sample set; determining soft cluster distribution and auxiliary distribution of each node, and calculating cluster loss based on the soft cluster distribution and the auxiliary distribution; and training a model through the contrast loss and the clustering loss guiding model, and obtaining a clustering result of the corresponding node based on the soft clustering distribution after model training is completed. In this way, node representation for constructing a positive sample set is obtained through comparison learning by the twin network structure, and further, the comparison loss and the clustering loss are calculated, and training is conducted on the basis of the comparison loss and the clustering loss between nodes, so that a data clustering method for the genome data is obtained on the basis of the comparison learning, and the pertinence and the accuracy of the spatial transcriptome data clustering are improved.

Drawings

FIG. 1 is a schematic hardware architecture of a spatial transcriptome data clustering apparatus based on contrast learning according to various embodiments of the present invention;

FIG. 2 is a flow chart of a first embodiment of a spatial transcriptome data clustering method based on contrast learning according to the present invention;

FIG. 3 is a schematic view of a scenario involved in a first embodiment of a contrast learning-based spatial transcriptome data clustering method of the present invention;

FIG. 4 is a schematic diagram of a refinement flow of a first embodiment of a spatial transcriptome data clustering method based on contrast learning according to the present invention;

FIG. 5 is a flow chart of a second embodiment of a spatial transcriptome data clustering method based on contrast learning according to the present invention;

FIG. 6 is a schematic diagram of functional modules of a first embodiment of a spatial transcriptome data clustering apparatus based on contrast learning according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the invention mainly relates to a spatial transcriptome data clustering device based on contrast learning, which refers to a network connection device capable of realizing network connection, and the spatial transcriptome data clustering device based on contrast learning can be a server, a cloud platform and the like.

Referring to fig. 1, fig. 1 is a schematic hardware structure of a spatial transcriptome data clustering apparatus based on contrast learning according to various embodiments of the present invention. In an embodiment of the present invention, the spatial transcriptome data clustering device based on contrast learning may include a processor 1001 (e.g., a central processing unit Central Processing Unit, a CPU), a communication bus 1002, an input port 1003, an output port 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communications between these components; the input port 1003 is used for data input; the output port 1004 is used for data output, and the memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory, and the memory 1005 may be an optional storage device independent of the processor 1001. Those skilled in the art will appreciate that the hardware configuration shown in fig. 1 is not limiting of the invention and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

With continued reference to FIG. 1, the memory 1005 of FIG. 1, which is a readable storage medium, may include an operating system, a network communication module, an application module, and a contrast learning-based spatial transcriptome data clustering routine. In fig. 1, the network communication module is mainly used for connecting with a server and performing data communication with the server; and the processor 1001 is configured to call a spatial transcriptome data clustering program based on contrast learning stored in the memory 1005, and perform the following operations:

The spatial transcriptome data clustering device based on the contrast learning provides a first embodiment of the spatial transcriptome data clustering method based on the contrast learning. Referring to fig. 2, fig. 2 is a flowchart of a first embodiment of a spatial transcriptome data clustering method based on contrast learning according to the present invention.

The first embodiment of the present invention proposes a spatial transcriptome data clustering method based on contrast learning, as shown in fig. 2, fig. 2 is a schematic flow chart of the first embodiment of the spatial transcriptome data clustering method based on contrast learning, and the method includes:

s101, preprocessing a gene expression matrix constructed based on space transcriptome data to obtain a weighted feature matrix and an adjacent matrix, and constructing an adjacent graph based on the weighted feature matrix and the adjacent matrix;

specifically, referring to fig. 3, fig. 3 is a scene graph related to a first embodiment of a spatial transcriptome data clustering method based on contrast learning according to the present invention. As shown in FIG. 3a, it is first necessary to construct a Gene Expression matrix (Gene Expression) based on the information of the Gene Expression level of the space transcriptome (spatial transcriptiomics), and the behavior cells/capture points of the Gene Expression matrix are listed as the Gene Expression level. And preprocessing the data, and then performing PCA dimension reduction to obtain a feature matrix X (Feature Matrix X). And calculating the distance between the cells/capture points according to the space coordinates by taking the cells/capture points as nodes, and adding edges to k nearest neighbors of each cell/capture point to obtain an adjacency matrix. A weighted adjacency matrix a of adjacency matrices is further obtained. And the feature matrix X is used as the attribute of the node, and finally, an adjacency Graph (X, a) is obtained, and in this embodiment, the adjacency Graph is represented as Graph (X, a) = (V, E), where V represents the node and the node set V contains all cells/capture points, E represents the edge set, and the edge set E contains edges between all nodes.

Step S102, inputting the adjacency graph into a first encoder and a second encoder of a twin network structure respectively, wherein the adjacency graph is passed through the first encoder f _ξ And said second encoder f _θ Learning a corresponding first node representation H ^ξ And the second node represents H ^θ ；

The spatial transcriptome data clustering model based on contrast learning designed in the embodiment is based on a contrast learning method, potential representation of nodes is learned through a twin network structure, a positive sample set of the nodes is constructed to design contrast loss, then clustering loss is introduced to construct a total loss function to guide training, and feature learning and spatial clustering results are optimized.

The framework of the spatial transcriptome data clustering model based on contrast learning is shown in fig. 3, and the scene graph shown in fig. 3 is also the framework of the spatial transcriptome data clustering model based on contrast learning. As shown in fig. 3a, after preprocessing the space transcriptome data, an adjacency Graph (X, a) is obtained, and the adjacency Graph (X, a) is input into a first Encoder (Teacher Encoder) and a second Encoder (Student Encoder) of a twin network structure, respectively, and the first Encoder is denoted as f in this embodiment _ξ The second encoder is denoted as f _θ . I.e. inputting the edges in the weighted adjacency matrix a and the nodes in the feature matrix X into the twin network structure, a first encoder f _ξ Learning the first node representation H ^ξ The second encoder is denoted as f _θ Learning the second node representation H ^θ 。

The twin network refers to two networks of identical structure and different parameters, in this embodiment a first encoder f _ξ And a second braidingEncoder f _θ . First encoder f _ξ And a second encoder f _θ Two layers of GCNs (Graph Convolutional network, graph rolling networks) with the same structure and different parameters are respectively and randomly initialized. In this embodiment, the first encoder f _ξ Is based on the parameters of the second encoder f _θ Is obtained by momentum update of the parameters of the model (C). Will second encoder f _θ The parameter of (a) is denoted as θ, and the first encoder f _ξ The parameter of (a) is expressed as xi:

ξ←τξ+(1-τ)θ

where τ is the momentum update coefficient. The updating of xi is slower and smoother, thus effectively preventing possible crash phenomenon in the learning process, so that the model cannot learn meaningful representation.

Inputting the edge of the weighted adjacent matrix A obtained after preprocessing and the node of the characteristic matrix X, and a first encoder f _ξ Learning the first node representation H ^ξ The second encoder is denoted as f _θ Learning the second node representation H ^θ ：

H ^ζ ＝f _ξ (X,A)

H ^θ ＝f _θ (X,A)

Thus, the first node representation H of each node is obtained by the twinning network structure ^ξ And the second node represents H ^θ 。

Step S103, constructing a positive sample set based on the first node representation and the second node representation, and calculating a contrast loss between a node representation predicted value of the second node representation and the first node representation according to the positive sample set;

referring to fig. 4, fig. 4 is a schematic diagram of a refinement flow of a first embodiment of a spatial transcriptome data clustering method based on contrast learning according to the present invention, as shown in fig. 4, step S103 includes:

step S1031: constructing a positive sample set of nodes based on the first node representation and the second node representation of each node;

representing the first node as H ^ξ And the second node represents H ^θ Is determined as the target row, and the target is determinedThe node is denoted as v _i I.e. the first target node representsAnd the second target node representation->v _i E V is represented by different characteristics learned by different encoders. For any target node v _i By->And->Determining a target node v _i Positive sample set P of (2) _i . In actual operation, each row needs to be taken as a target row to determine a corresponding positive sample set.

Specifically, first, a cosine similarity sim (v) between the target node and the other nodes is determined based on the second node representation of the target node and the first node representation of the other nodes _i ,v _j ) The target node is any one of all nodes;

the selection strategy of positive samples is important in relation to whether the comparison learning can successfully learn a meaningful node representation. For a given target node v _i Calculating a target node v _i Through a second encoder f _θ Learned second node representationWith other nodes v _j Through the first encoder f _ξ The first node learned represents +>Distance between, i.e. calculate +.>Andcosine similarity of the target node to other nodes is expressed as sim (v _i ,v _j ) The following steps are:

wherein II represents the modulo operation on the vector. Thus, the cosine similarity between the nodes can be calculated.

After the cosine similarity between each target node and other nodes is obtained, determining a K-neighbor node set N of each target node _i The method comprises the steps of carrying out a first treatment on the surface of the Representing a set of K-neighbor nodes as N _i The present embodiment determines the K-Neighbor node set based on the known steps of the KNN (K-Nearest Neighbor) algorithm, and will not be described in detail herein. K-neighbor node set N _i Node and v in (2) _i Is spatially adjacent in representation, so N _i Can be taken as v _i And (3) reasonably selecting a positive sample set.

Considering only nearest neighbors in the representation space may not only ignore the original structural information in the adjacency graph, but may also ignore the global semantic information of the graph. Based on this, the present embodiment also designs a Local semantic Positive sample (Local Positive) and a Global semantic Positive sample (Global Positive) to capture the target node v, respectively _i Positive samples in local and global semantic contexts.

The embodiment sets the target node in the K-neighbor node set N _i The intersection with the neighbor node in the adjacency graph is determined to be a local semantic positive sample; the local semantic positive sample is denoted as L _i The following steps are:

L _i ＝N _i ∩A _i

wherein A is _i Representing neighbor nodes in the adjacency graph G. Local semantic positive sample L _i In K-neighbor node set N _i On the basis of which local semantic information between nodes is considered.

Determining similar nodes belonging to the same K-means cluster as the target node, and combining the K-neighbor node set Ni with the similar nodesThe intersection of points is determined as a global positive semantic sample; representing global semantic positive samples as G _i The following steps are:

G _i ＝N _i ∩C _i

wherein C is _i Refers to the node v with the target node in the K-means clustering result _i Like nodes in the same cluster. C (C) _i In K-neighbor node set N _i On the basis of which global semantic information between nodes is taken into account.

Finally, determining the union of the local semantic positive sample and the global semantic positive sample as a positive sample set P of the target node _i . With continued reference to FIG. 3, as shown in FIG. 3c, the union of local semantic positive samples and global semantic positive samples is a positive sample set P _i . Representing a positive sample set of target nodes as P _i Then there is P _i ＝L _i ∪G _i 。

Step S1032: calculating a contrast loss L based on the total number of nodes, the first node representation in the positive sample set, and the node representation predicted value represented by the second node _loss 。

With continued reference to fig. 3, as shown in fig. 3b, the present embodiment represents the second node as H in advance ^θ Representing input predictors (predictors q) _θ ) Obtaining a node representation predicted value represented by the second node, the present embodiment represents the node representation predicted value represented by the second node as Z ^θ 。

The second node represents H through feature transformation of the predictor ^θ Predicted value Z of (2) ^θ Representing H with the first node ^ξ The difference in characteristics between them can be further enlarged. Based on this, a contrast loss function is designed with the aim of reducing the distance of each node to its corresponding node in the positive sample set P, i.e. as close as possible to the second node predictorAnd positive sample set P _i Other nodes in (b) representWill beThe contrast loss is denoted as L _con The following steps are: />

Where N represents the total number of nodes, i.e., the number of nodes, i represents the ith row in which the node is located,is the second node representing H ^θ The i-th row node of the list represents->Predicted value of v _j Representing a target node v _i Nodes other than->Is->The other nodes are represented, T represents the transpose of the vector, and II represents the modulo operation.

Step S104, determining soft cluster distribution and auxiliary distribution of each node, and calculating cluster loss L based on the soft cluster distribution and auxiliary distribution _cluster ；

Because the graph clustering task is unsupervised, nodes which cannot be learned in the training process represent whether the nodes are well optimized or not. In order to enable the generated node representation to better serve the clustering task, the spatial transcriptome data clustering model based on contrast learning related to the embodiment adds clustering into a training process, and the training of the encoder is optimized through clustering loss.

Firstly, carrying out initial K-means clustering on a second node representation to obtain a plurality of clusters, and determining the centroid characteristic representation of each cluster; taking the centroid feature as a learning process represented by a soft tag supervision node;

for the learned second node characteristic H ^θ Firstly, using k-means to make initial clustering, and said algorithm can produce several polymersClass clusters and deriving a centroid signature representation of the centroid of each cluster, e.g. the centroid signature representation of cluster u is denoted μ _u . One way to solve the unsupervised learning task is to generate a "soft" tag, and then use this "soft" tag to supervise the process of parameter learning.

Then, determining soft cluster distribution of the corresponding nodes based on the barycenter characteristic representation of the cluster where the nodes are located and the node representation predicted value represented by the second node;

Representing predicted value z by soft cluster distribution measurement node based on t-distribution _i And cluster centroid mu _u Is a similarity of (3). From the previous outputs, the characteristic representation of each node and the centroid characteristic representation of each cluster can be obtained at present, so that node v can be obtained _i Probability q belonging to cluster u _iu Seen as a soft cluster distribution for each node.

Representing the soft cluster distribution of node i as q _iu The following steps are:

wherein z is _i Representing predicted values for nodes, μ _u Centroid feature representation, mu, of cluster u where node i is located _k The centroid feature representing cluster k represents the number of clusters.

Determining corresponding auxiliary distribution based on the soft cluster distribution of the nodes; distributing soft clusters q _iu The corresponding auxiliary distribution is denoted as p _iu The following steps are:

where k represents the number of clusters.

The cluster loss is calculated based on the soft cluster distribution and the auxiliary distribution of all the nodes. The present embodiment defines a soft cluster distribution q _iu Auxiliary profile p _iu Then the soft cluster distribution q is pulled up by KL divergence _iu Auxiliary profile p _iu To optimize both cluster distribution and node representation, thus obtaining cluster loss.

The cluster loss is expressed as L _cluster The calculation formula of the cluster loss can be expressed as follows:

Wherein KL (P||Q) represents the soft cluster distribution Q _iu Auxiliary profile p _iu KL divergence between (Kullback-Leibler divergence). As shown in fig. 3d, KL divergence draws in soft cluster distribution q _iu Auxiliary profile p _iu Is a distance of (3).

And step S105, training a model through the contrast loss and the clustering loss, and obtaining a clustering result of the corresponding node based on the soft clustering distribution after model training is completed.

In the auxiliary distribution p _iu In (2) by the square q _iu An "emphasis" effect can be achieved to emphasize those effects of high probability "confidence allocation". During training, the auxiliary distribution p _iu In effect, a label is provided. And finally, fitting the difference between the two probability distributions through a calculation formula of the clustering loss to achieve the aim of unsupervised clustering. Meanwhile, the calculation formula of the clustering loss is also used as the clustering loss L _cluster The whole training process is guided.

The embodiment obtains the second node representation H through contrast learning ^θ And contrast loss L _con Wherein the second node represents H ^θ Further used for an unsupervised clustering module to obtain a clustering loss L _cluster Cluster loss L _cluste Loss of contrast L _con The sum is the total loss. The total loss is denoted as L _all The following steps are:

L _all ＝L _con +L _cluster

after the training of the spatial transcription data clustering model based on contrast learning is completed, the spatial transcription data clustering model is distributed according to soft clusters q _iu An estimate of the target node vi label is obtained,namely, the clustering result, expressed as si, is:

i.e. for soft cluster distribution q _iu And taking the arg function as the maximum value of the clustering result.

According to the scheme, the genetic expression matrix constructed based on the space transcriptome data is preprocessed to obtain the weighted feature matrix and the adjacent matrix, and an adjacent graph is constructed based on the weighted feature matrix and the adjacent matrix; respectively inputting the adjacency graph into a first encoder and a second encoder of a twin network structure, and learning corresponding first node representation and second node representation through the first encoder and the second encoder; constructing a positive sample set based on the first node representation and the second node representation, and calculating a contrast loss between a node representation predicted value of the second node representation and the first node representation according to the positive sample set; determining soft cluster distribution and auxiliary distribution of each node, and calculating cluster loss based on the soft cluster distribution and the auxiliary distribution; and training a model through the contrast loss and the clustering loss guiding model, and obtaining a clustering result of the corresponding node based on the soft clustering distribution after model training is completed. In this way, node representation for constructing a positive sample set is obtained through comparison learning by the twin network structure, and further, the comparison loss and the clustering loss are calculated, and training is conducted on the basis of the comparison loss and the clustering loss between nodes, so that a data clustering method for the genome data is obtained on the basis of the comparison learning, and the pertinence and the accuracy of the spatial transcriptome data clustering are improved.

As shown in fig. 5, a second embodiment of the present invention proposes a spatial transcriptome data clustering method based on contrast learning, based on the first embodiment shown in fig. 2, the step S101 includes:

step S1011, constructing a gene expression matrix based on gene expression quantity information of the space transcriptome, wherein behavior cells/capture points of the gene expression matrix are listed as gene expression quantity;

the information of the gene expression level required in this example was extracted from the spatial transcriptome, and mainly includes the cell/capture site and the gene expression level. Then, the cells/capture spots are used as rows and the gene expression amounts are used as columns to construct a gene expression matrix.

Step S1012, preprocessing gene expression quantity data in the gene expression matrix, and performing dimension reduction processing on the gene expression matrix to obtain a feature matrix X;

the original gene expression quantity data has a large amount of noise, and data preprocessing is needed. And deleting genes expressed in less than 3 cells according to the high-dimensional sparsity of the gene expression data, carrying out normalization and standardization treatment on the data, and finally reducing the dimension by using PCA to obtain a feature matrix X.

Step S1013, determining an adjacency graph of the feature matrix X based on the spatial information between the cells/capture points.

The graph formed by the cells/capture points is subjected to self-supervision clustering, so that a graph which can display the relationship between the cells/capture points, namely an adjacent graph, is constructed by utilizing spatial information.

Specifically, the cell/capture points are used as nodes, and the distance between the nodes is calculated according to the space coordinates; the present embodiment calculates the distance between any two nodes.

Adding edges to a plurality of neighbors of each node nearest to each node to obtain an adjacency matrix; selecting target edges based on the distance threshold value, and weighting each target edge based on the distance between the nodes to obtain a weighted adjacency matrix A; referring to fig. 3a, a closer target edge is determined based on a distance threshold (threshold).

An adjacency graph is obtained based on nodes in the feature matrix X and edges of the weighted adjacency matrix. Taking the feature matrix X as the attribute of the node, finally obtaining an adjacency graph G (X, a) = (V, E), wherein E represents a node set, V contains all cells/capture points, E represents an edge set, and E contains edges between all nodes.

According to the embodiment, the weighted feature matrix and the weighted adjacent matrix are obtained by preprocessing the gene expression matrix constructed based on the space transcriptome data through the scheme, and the adjacent graph is constructed based on the weighted feature matrix and the adjacent matrix, so that the space transcriptome data is processed, the adjacent graph favorable for contrast learning is obtained, and the implementation of the contrast learning step is facilitated.

Further, to achieve the above objective, the present invention further provides a spatial transcriptome data clustering device based on contrast learning, specifically, referring to fig. 6, fig. 6 is a schematic functional block diagram of a first embodiment of a spatial transcriptome data clustering device based on contrast learning according to the present invention, where the device includes:

the construction module 10 is used for preprocessing a gene expression matrix constructed based on the space transcriptome data to obtain a weighted feature matrix and an adjacent matrix, and constructing an adjacent graph based on the weighted feature matrix and the adjacent matrix;

a learning module 20, configured to input the adjacency graph into a first encoder and a second encoder of a twin network structure, respectively, and learn corresponding first node representations and second node representations through the first encoder and the second encoder;

a contrast loss calculation module 30, configured to construct a positive sample set based on the first node representation and the second node representation, and calculate a contrast loss between a node representation predicted value of the second node representation and the first node representation according to the positive sample set;

a cluster loss calculation module 40, configured to determine a soft cluster distribution and an auxiliary distribution of each node, and calculate a cluster loss based on the soft cluster distribution and the auxiliary distribution;

And the clustering module 50 is used for guiding model training through the contrast loss and the clustering loss, and obtaining a clustering result of the corresponding node based on the soft clustering distribution after model training is completed.

Further, the building block 10 comprises:

a construction unit for constructing a gene expression matrix based on gene expression amount information of the space transcriptome, wherein behavior cells/capture points of the gene expression matrix are listed as gene expression amounts;

the preprocessing unit is used for preprocessing the gene expression quantity data in the gene expression matrix and performing dimension reduction processing on the gene expression matrix to obtain a feature matrix;

and an adjacency graph determining unit for determining an adjacency graph of the feature matrix based on the spatial information between the cells/capture points.

Further, the adjacency graph determination unit includes:

a calculating subunit, configured to calculate a distance between nodes according to the spatial coordinates by using the cell/capture point as a node;

an adding subunit, configured to add edges to a plurality of neighbors of each node that are closest to each node, to obtain an adjacency matrix;

the weighting subunit is used for selecting target edges based on the distance threshold value, and carrying out weighting processing on each target edge based on the distance between the nodes to obtain a weighted adjacent matrix;

An obtaining subunit is configured to obtain an adjacency graph based on the nodes in the feature matrix and the edges of the weighted adjacency matrix.

Further, the contrast loss calculation module 30 includes:

a positive sample set construction unit for constructing a positive sample set of nodes based on the first node representation and the second node representation of each node;

and the contrast loss calculation unit is used for calculating the contrast loss based on the total number of the nodes, the first node representation in the positive sample set and the node representation predicted value of the second node representation.

Further, the positive sample set constructing unit includes:

a cosine similarity determination subunit, configured to determine cosine similarity between the target node and the other nodes based on the second node representation of the target node and the first node representation of the other nodes;

the K-neighbor node set determining subunit is used for determining the K-neighbor node set of each target node after the cosine similarity between each target node and other nodes is obtained;

the local semantic positive sample set determining subunit is used for determining the intersection set of the target node in the K-neighbor node set and the neighbor nodes in the adjacent graph as a local semantic positive sample;

the global semantic positive sample set determining subunit is used for determining similar nodes belonging to the same K-means cluster with the target node, and determining the intersection set of the K-neighbor node set and the similar nodes as a global semantic positive sample;

And the positive sample set determining subunit is used for determining the union set of the local semantic positive samples and the global semantic positive samples as a positive sample set of a target node.

Further, the contrast loss calculation module 30 further includes:

Further, the cluster loss calculation module 50 further includes:

the clustering unit is used for carrying out initial K-means clustering on the second node representation, obtaining a plurality of clustering clusters and determining the centroid characteristic representation of each clustering cluster;

and a soft cluster distribution determining unit. The soft cluster distribution of the corresponding node is determined based on the centroid feature representation of the cluster where the node is located and the node representation prediction value of the second node representation;

an auxiliary distribution determining unit, configured to determine a corresponding auxiliary distribution based on the soft cluster distribution of the node;

and the cluster loss calculation unit is used for calculating cluster loss based on the soft cluster distribution and the auxiliary distribution of all the nodes.

In addition, the invention also provides a computer readable storage medium, on which a spatial transcriptome data clustering program based on contrast learning is stored, and the steps of the spatial transcriptome data clustering method based on contrast learning described above are implemented when the spatial transcriptome data clustering program based on contrast learning is run by a processor, which are not described herein.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or modifications in the structures or processes described in the specification and drawings, or the direct or indirect application of the present invention to other related technical fields, are included in the scope of the present invention.

Claims

1. A spatial transcriptome data clustering method based on contrast learning, the method comprising:

Training a model by the contrast loss and the clustering loss guidance model, and obtaining a clustering result of the corresponding node based on the soft clustering distribution after model training is completed _。

2. The method of claim 1, wherein preprocessing the gene expression matrix constructed based on the spatial transcriptome data to obtain a weighted feature matrix and an adjacency matrix, and constructing an adjacency graph based on the weighted feature matrix and adjacency matrix comprises:

3. The method of claim 2, wherein the determining the adjacency graph of the feature matrix based on spatial information between the cells/capture points comprises:

obtaining an adjacency graph based on nodes in the feature matrix and edges of the weighted adjacency matrix _。

4. The method of claim 1, wherein constructing a positive sample set based on the first node representation and the second node representation, and calculating a loss of contrast between a node representation predictor of the second node representation and the first node representation from the positive sample set comprises:

5. The method of claim 4, wherein the constructing a positive sample set based on the first node representation, the second node representation, and spatial location comprises:

6. The method of claim 1, wherein prior to constructing a positive sample set based on the first node representation and the second node representation and calculating a contrast loss between a node representation predictor of the second node representation and the first node representation from the positive sample set, further comprising:

inputting the second node representation into a predictor to obtain a node representation predicted value of the second node representation _。

7. The method of claim 1, wherein the determining soft cluster distribution and auxiliary distribution for each node and calculating cluster loss based on the soft cluster distribution and auxiliary distribution comprises:

8. A spatial transcriptome data clustering device based on contrast learning, comprising:

9. A contrast learning based spatial transcriptome data clustering device comprising a memory, a processor and a contrast learning based spatial transcriptome data clustering routine stored on the memory, which when executed by the processor performs the steps of the method according to any of claims 1-7.

10. A computer readable storage medium, characterized in that it has stored thereon a contrast learning based spatial transcriptome data clustering program, which when executed by a processor, implements the steps of the method according to any of claims 1-7.