CN116189785A

CN116189785A - Spatial domain identification method based on spatial transcriptomics data feature extraction

Info

Publication number: CN116189785A
Application number: CN202310097081.0A
Authority: CN
Inventors: 贾松卫; 崔议文; 兰猛
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-02-10
Filing date: 2023-02-10
Publication date: 2023-05-30

Abstract

The invention discloses a spatial domain identification method based on spatial transcriptome data feature extraction, which mainly solves the problems of overfitting and low spatial domain identification precision in the spatial transcriptome data feature extraction in the prior art. The implementation scheme is as follows: preprocessing gene expression data and spatial information measured in a spatial transcriptome; constructing a gene similarity network and a space neighborhood network based on the gene expression feature matrix and the space information; carrying out data enhancement on the gene similarity network and the space neighborhood network; constructing a feature extraction model, and inputting enhanced data into the model to calculate contrast loss and reconstruction loss; according to the calculation loss training model, inputting unreinforced data into the trained model to obtain low-dimensional embedding; clustering the low-dimensional embeddings completes spatial domain identification. The method avoids overfitting in the characteristic extraction process, improves the accuracy of spatial domain identification, and can be used for providing reference data for exploring biological development and treating diseases.

Description

Spatial domain identification method based on spatial transcriptomics data feature extraction

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a spatial domain identification method which can be used for providing reference data for exploring biological development and treating diseases.

Background

In tissue sections, some regions have a similar spatial gene expression profile, forming specific structures or substructures in the tissue. These regions have different functional compartments due to the differences in cell type composition and gene expression, thereby forming spatial domains with specific biologically significant structures. The identification of spatial domains is critical for studying the effects of tissue structure and cell-cell interactions.

Single cell transcriptome sequencing techniques scRNA-seq can be used to provide high resolution gene expression profiles, however, limitations are imposed on downstream analysis due to the inability to retain spatial position information when preparing samples. Spatial transcriptome sequencing techniques, including in situ hybridization-based imaging techniques and spatial barcode-based in situ sequencing techniques, provide both gene expression profile and spatial location information that is critical to understanding the biological significance of healthy tissue development and disease tumor microenvironment. The presentation of spatial transcriptome data thus helps better describe the spatial organization of cells. Regions with similar expression patterns are mined for spatial transcriptome data by clustering to interpret the spatial organization of cells, i.e., identifying spatial domains is one of the most important tasks of spatial transcriptomics.

Traditional clustering algorithms, such as: louvain and K-means cannot effectively utilize available spatial information, so that a clustering result cannot continuously identify a tissue region with an obvious layered structure in a tissue section, and cannot provide accurate reference for downstream analysis, and therefore a spatial clustering method for developing data of a spatial transcriptome while utilizing a gene expression profile and spatial position coordinates is required.

2021 Jian Hu et al proposed a deep learning algorithm called SpaGCN on Nature Methods to integrate gene expression, spatial location and histological images through a graph rolling network. Firstly, constructing a graph representing the relation between points by combining a spatial position and a histological image, and then gathering gene expression information from adjacent points by utilizing a graph roll lamination; and then adopting an unsupervised iterative clustering algorithm, and clustering the points by using an aggregation expression matrix.

2021 Edward Zhao et al proposed an algorithm named Bayespace on Nature Biotechnology. The Bayes space algorithm models the low-dimensional representation of the gene expression matrix, and introduces a spatial neighbor structure into the pre-inspection algorithm through a Bayes statistical method to encourage adjacent pixels to belong to the same cluster, so that spatial clustering is realized.

2022 Shihua Zhang et al proposed a new frame STAGATE based on a graph attention self-encoder on Nature Communications, which utilized the graph attention self-encoder to automatically learn weights of inter-node edges through an attention mechanism while embedding spatial information, taking into account spatial similarity of spatial domain boundary pixels.

2022 Chang Xu et al, on Nucleic Acids Research, proposed a deep neural network framework deep ST that uses a neural network to extract histological image features and creates a spatially enhanced gene expression matrix with gene expression and spatial location, using a graph convolution network and denoising self-encoder in combination to generate a potential representation of enhanced ST data.

These algorithms all suffer from the following disadvantages:

firstly, due to the addition of histological features, the complexity of the model is increased while the clustering precision is improved, and the occupied memory is large and the running time is long;

meanwhile, some algorithms attach importance to spatial information excessively so as to overcorrect gene expression characteristics and cause overfitting of clustering, so that some fine areas cannot be identified, and the identified spatial domains cannot be subjected to accurate analysis of biological functions.

Thirdly, the results of each repetition are unstable, the difference is large, and the results are only good on the data set measured by the space transcriptome sequencing means based on in-situ sequencing, but poor on the data set based on imaging, and the data set of the space transcriptome cannot be widely analyzed.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a spatial domain identification method based on spatial transcriptome data feature extraction, so as to extract the combined features of gene expression spectrum and spatial position information in spatial transcriptome data, improve the generalization capability of spatial transcriptome based on sequencing and imaging based on two sequencing means and finish the accurate analysis of spatial domain biological functions.

The technical scheme of the invention is as follows: preprocessing gene expression data and spatial information measured in a spatial transcriptome; constructing a gene similarity network and a space neighborhood network based on the gene expression feature matrix and the space information; carrying out data enhancement on the gene similarity network and the space neighborhood network; constructing a feature extraction model, and inputting enhanced data into the model to calculate contrast loss and reconstruction loss; according to the calculation loss training model, inputting unreinforced data into the trained model to obtain low-dimensional embedding; clustering the low-dimensional embeddings completes spatial domain identification. The implementation steps comprise the following steps:

(1) Simultaneously measuring a gene expression value and a spatial position coordinate of each pixel point in a required tissue slice by using a spatial transcriptome sequencing technology to obtain spatial transcriptome data comprising a pixel point-gene expression matrix and the spatial position of each pixel point in the tissue slice;

(2) Preprocessing a gene expression matrix of the space transcriptome data:

(2a) Deleting the expressed genes with gene expression values less than three pixels in the space transcriptome data;

(2b) Carrying out numerical normalization on the deleted data to enable the count sum of each cell to be the median of all cells, carrying out logarithmic transformation on the normalized data, and normalizing the normalized data into zero mean and unit variance;

(2c) Performing Principal Component Analysis (PCA) on the standardized data, extracting the first n principal components, and generating a feature matrix X of gene expression;

(3) Constructing a spatial neighborhood network:

(3a) Calculating Euclidean distance d between each pixel point in the tissue slice on the basis of the space coordinate information;

(3b) Selecting the first k nearest neighbors of each pixel point based on the Euclidean distance d calculated by the space coordinates, and constructing an adjacency matrix A representing the space information;

(3c) Taking the gene expression characteristic matrix X generated in the step (2) as a node attribute characteristic matrix;

(3d) Based on the adjacency matrix A representing the space information and the node attribute feature matrix X, a space neighborhood network G is formed ₁ (A，X)；

(4) Constructing a gene expression similarity network:

(4a) Calculating the Euclidean distance d' between the gene expression values of each pixel point in the tissue slice based on the gene expression characteristic matrix X generated in the step (2);

(4b) Based on Euclidean distance d' between gene expression values, selecting the first k nearest neighbors of each pixel point, and constructing an adjacency matrix B for representing the gene expression similarity;

(4c) Based on the adjacent matrix B and node attribute feature matrix X for representing the gene expression similarity, a gene expression similarity network G is formed ₂ (B，X)；

(5) Data enhancement:

(5a) Masking probability p for edge and node attribute features in a spatial neighbor network according to a given edge consistent with Bernoulli distribution _r And node feature mask probability p _m Masking to obtain an enhanced spatial neighbor network G ₁ (A ₁ ，X ₁ )；

(5b) Masking probability p for edge and node attribute features in a gene expression similarity network according to a given edge conforming to Bernoulli distribution _r And node feature mask probability p _m Masking to obtain a gene expression similarity network G after enhancement ₂ (B ₁ ，X ₂ )；

(6) Constructing a feature extraction model of spatial transcriptome data consisting of a concatenation of encoder f (, parallel decoder h (), and projector g (), and using contrast loss L _con And reconstruction loss L _recon As a loss function L;

(7) Training a feature extraction model of the spatial transcriptome data:

(7a) Enhanced space neighbor network G ₁ (A ₁ ，X ₁ ) Adjacent matrix a of (a) ₁ And node attribute feature matrix X ₁ Gene expression similarity network G ₂ (B ₁ ，X ₂ ) Adjacent matrix B of (a) ₁ And node attribute feature matrix X ₂ Input into a space transcriptome feature extraction model, and generate a low-dimensional embedded Z by an encoder ₁ and Z₂ The decoder generates a reconstructed gene expression feature matrix

and

(7b) Computing a low-dimensional embedded Z ₁ and Z₂ Is characterized by the contrast loss and the reconstruction gene expression of (1)

and

The reconstruction loss of the node attribute feature matrix X is updated according to the calculated loss until the loss function L converges, and a trained space transcriptome feature extraction model is obtained;

(8) Inputting an adjacency matrix A and a node attribute feature matrix X of a spatial neighborhood network which is not subjected to data enhancement into a trained spatial transcriptome feature extraction model in the step (7 b) to obtain a combined low-dimensional embedded Z containing spatial information and gene expression;

(9) And clustering the obtained combined low-dimensional embedded Z by using a Leiden clustering algorithm, and obtaining a region with consistent gene expression, namely a spatial domain, on the tissue slice.

Compared with the prior art, the invention has the following advantages:

1) Because the spatial information and the gene expression profile of the spatial transcriptome data are combined to construct the spatial neighborhood network and the gene expression similarity network, compared with the prior method, the spatial information and the gene expression information can be better balanced, the overfitting of the gene expression profile is prevented, and the accuracy and the robustness of spatial domain identification are improved.

2) The invention only uses gene expression spectrum and spatial information, and does not add the characteristics of histological images, so that the efficiency of the model is improved, the running time is reduced, and the challenges of larger data sets generated in the future can be dealt with.

3) According to the invention, due to the introduction of contrast loss training low-dimensional embedding, similar samples in the samples are more similar, dissimilar samples are far away, the problem of spatial clustering is well fitted, and compared with the prior art, the generalization capability on a data set generated by two sequencing means of a spatial transcriptome is improved.

4) According to the invention, as the model architecture of the cascade connection of the encoder and the decoder is designed, and meanwhile, the contrast loss and the reconstruction loss are considered, the denoised data can be generated, and the biological significance in the original sample is better reserved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of data enhancement in accordance with the present invention;

FIG. 3 is a feature extraction model diagram of spatial transcriptome data constructed in the present invention;

FIG. 4 is a visual representation of spatial clustering results using the present invention and the existing STAGATE and deep ST methods, respectively.

Detailed Description

Embodiments and effects of the present invention are described in further detail below with reference to the accompanying drawings.

Existing spatial transcriptome data includes imaging techniques based on in situ hybridization and in situ sequencing techniques based on spatial barcodes, where imaging techniques include STAPmap, MERFISH and in situ sequencing techniques include spatial transcriptomics, 10x visual, slide-seq. This example takes the spatial transcriptome dataset of 10x visual spatial transcriptome sequenced human dorsal lateral forehead cortex 151673 slices, which contains 3639 pixels with 33538 genes per pixel.

Referring to fig. 1, the implementation steps of this example are as follows:

step 1, preprocessing a gene expression matrix of space transcriptome data.

1.1 Acquiring pixel-gene expression matrix data in a spatial transcriptome data set of a spatial transcriptome sequenced human dorsal lateral forehead cortex layer 151673 slice of 10x Visum, deleting genes expressed in less than three pixels in gene expression values in the spatial transcriptome data to realize data filtering, and obtaining 3639 pixels and 19151 genes remained after filtering;

1.2 Median normalization of the transcriptome data after filtering, that is, dividing each column of data by the median of the column of data, and then carrying out logarithmic conversion on the data after normalization of the median, and normalizing the data into zero mean and unit variance;

1.3 Main component analysis PCA is carried out on the standardized data, the first 300 main components are extracted, and a feature matrix X of gene expression is generated:

X＝[x ₁ ；x ₂ …；x _i ；…；x _n ] ^T

wherein, [;]representing the splicing operation, x _i For the gene feature vector of i pixels, i=1..n, n is the number of all pixels in a tissue slice, and T represents the transpose.

And 2, constructing a space neighborhood network.

2.1 Acquiring spatial position coordinate data in a spatial transcriptome dataset of a 10x visual spatial transcriptome sequenced human dorsal lateral forehead cortex layer 151673 slice, and calculating the euclidean distance d of each pixel point on a spatial position based on the spatial coordinate information:

wherein ,(a_i ，b _i) and (a_j ，b _j ) Spatial coordinates of pixel i and pixel j on the tissue slice;

2.2 The first 5 nearest neighbors of each pixel point are selected based on the Euclidean distance d calculated by the space coordinates, and an adjacency matrix A for representing the space information is constructed:

wherein ,

i and j represent two nodes in the spatial neighborhood network respectively for the elements of the ith row and jth column in the adjacency matrix A of the spatial neighborhood network, if node i is included in the first 5 nearest neighbors of node j calculated based on spatial coordinates, then i and j are adjacent, otherwise i and j are not adjacent, i, j=1..n, n=3639 represents the number of nodes included in the spatial neighborhood network;

2.3 Taking the gene expression characteristic matrix X generated in the step 1 as a node attribute characteristic matrix;

2.4 Based on the adjacent matrix A representing the space information and the node attribute characteristic matrix X, a space neighborhood network G is formed ₁ (A，X)。

And 3, constructing a gene expression similarity network.

3.1 Based on the gene expression feature matrix X generated in step 1), calculating the euclidean distance d' between the gene expression values of each pixel point in the tissue section:

wherein x_jk and x_ik The values of the kth dimension of the pixel i and pixel j gene expression feature vectors, respectively, k=1..m, m=300 being the dimension of each pixel gene expression feature vector;

3.2 Based on the Euclidean distance d' calculated by the gene expression value, selecting the first 5 nearest neighbors of each pixel point, and constructing an adjacency matrix B for representing the gene expression similarity:

wherein ,

for elements of the ith row and jth column in the adjacency matrix B of the gene expression similarity network, i and j represent two nodes in the gene expression similarity network respectively, if the node i is included in the first 5 nearest neighbors calculated based on the gene expression matrix of the node j, i and j are adjacent, otherwise i and j are not adjacent, i, j=1..n, n=3639 represents the number of nodes contained in the gene expression similarity network, and the number of nodes is the same as the number of nodes in the spatial neighborhood network;

3.3 Based on the adjacent matrix B and node attribute characteristic matrix X for representing the gene expression similarity, forming a gene expression similarity network G ₂ (B，X)。

Step 4, for the space neighbor network G ₁ (A, X) and Gene expression similarity network G ₂ The edge and node attribute features in (B, X) are enhanced.

In order to increase training samples and improve the self-supervision capability of the model, data enhancement needs to be carried out on an adjacent matrix A and a node attribute feature matrix X of the spatial neighbor network and an adjacent matrix B and a node attribute feature matrix X of the gene expression similarity network.

Referring to fig. 2, the present step is specifically implemented as follows:

4.1 According to each element a in the adjacency matrix A of the spatial neighborhood network _ij Sampling an edge mask matrix according to Bernoulli distribution

in the formula ,

mask matrix for edge->

The element in the ith row and the jth column in (a), if a _ij =1, then->

From Bernoulli distribution B (1-p _r ) Middle sampling, if a _ij =0, then->

wherein ,p_r =0.2 is the probability that each edge in the spatial neighborhood network is deleted, i and j represent two nodes in the spatial neighborhood network, i, j=1..n, n=3639 represents the number of nodes contained in the spatial neighborhood network, respectively;

4.2 A) combining the adjacency matrix A of the spatial neighborhood network with the sampling matrix generated in 4.1)

Enhanced adjacency matrix A by element multiplication ₁ ：

in the formula ,

the operator represents the adjacency matrix A and sampling matrix in the spatial neighborhood network>

According to element multiplication, a _ij Is the element of the ith row and jth column in the adjacency matrix A of the space neighborhood network, ++>

For sampling matrix->

Elements of the ith row and the jth column;

4.3 According to Bernoulli distribution B (1-p) _m ) Sampling a random vector to generate a node feature mask vector with the same dimension as the gene feature vector

Where pm=0.3 is the probability that the value in each node feature vector in the spatial neighborhood network is deleted;

4.4 Node attribute feature matrix X of the spatial neighbor network with 4.3) generated node feature mask vector

Obtaining the enhanced node attribute feature matrix X according to element multiplication ₁ ：

Wherein, [;]representing the splicing operation, x _i Is X ₁ The i < th > row of the spatial neighborhood network represents the gene feature vector of the spatial neighborhood network on the node i, i=1..n, n=3639 is the number of nodes of the spatial neighborhood network;

4.5 Each element B in the adjacency matrix B according to the gene expression similarity network _ij Sampling an edge mask matrix R epsilon {0,1} according to Bernoulli distribution ^N×N ：

in the formula ,

if b, for the element of the ith row and jth column of the edge mask matrix R _ij =1, then R _ij From Bernoulli distribution B (1-p _r ) Middle sampling, if b _ij =0, then R _ij =0, where p _r ' 0=2 is that each edge in the gene expression similarity network is coveredThe probability of deletion, i and j, respectively represent two nodes in the gene expression similarity network, i, j=1..n, n=3639 represents the number of nodes contained in the gene expression similarity network, consistent with the number of nodes in the spatial neighborhood network;

4.6 Multiplying the adjacency matrix B of the gene expression similarity network with the sampling matrix R generated in 4.5) according to elements to obtain the adjacency matrix B of the enhanced gene expression similarity network ₁ ：

in the formula ,

the operator represents multiplying the adjacency matrix B and the sampling matrix R of the gene expression similarity network by elements, B _ij For elements of the ith row and jth column in adjacency matrix B of the gene expression similarity network, R _ij The element of the ith row and the jth column in the sampling matrix R;

4.7 According to Bernoulli distribution B (1-p) _m ) Sampling a random vector to generate a node feature mask vector m with the same dimension as the gene feature vector, wherein p _m ' 0.3 is the probability that the value in each node feature vector in the gene expression similarity network is deleted;

4.8 Multiplying the node attribute feature matrix X of the gene expression similarity network with the node attribute mask vector m generated by 4.7) according to elements to obtain the node attribute feature matrix X of the enhanced gene expression similarity network ₂ ：

Wherein, [;]representing the splicing operation, x _i Is X ₂ Represents the feature vector of the gene on the node i in the gene expression similarity network.

And 5, constructing a feature extraction model of the space transcriptome data.

Referring to fig. 3, the specific implementation of this step is as follows:

5.1 Establishing an encoder consisting of an input GCN layer and two layers of hidden GCN layers, wherein the input dimension is 300 dimensions of transcriptome gene characteristic dimension, the first hidden GCN layer is 256 dimensions, the second hidden GCN layer is 128 dimensions, and a PRelu function is used as an activation function between each two layers of GCNs;

5.2 Establishing a decoder consisting of an input full-connection layer and three layers of hidden full-connection hierarchies, wherein the input full-connection layer is 128-dimensional, the first hidden full-connection layer is 128-dimensional, the second hidden full-connection layer is 256-dimensional, the third hidden full-connection layer dimension is 300-dimensional of transcriptome gene characteristic dimension, and a Relu function is used as an activation function between each layer of full-connection layer;

5.3 The projector formed by the cascade of the input full-connection layer and the hidden full-connection layer is established, the input full-connection layer is 128-dimensional, the hidden full-connection layer is 128-dimensional, and no activation function is arranged between each layer;

5.4 Cascading the encoder with the decoder and the projector respectively to form a feature extraction model of the space transcriptome data;

5.5 Let the loss function L of the feature extraction model be a weighted sum of the contrast loss and the reconstruction loss, expressed as follows:

L＝λ _con L _con +λ _recon L _recon

wherein ,λ_con ＝1，λ _recon =0.01 is the super parameter for measuring the weight of contrast loss and reconstruction loss, respectively, L _recon Represents reconstruction loss, L _con Representing contrast loss.

And step 6, training a feature extraction model of the space transcriptome data.

6.1 G) are extracted separately by encoder f (·) ₁ (A ₁ ，X ₁ ) Node low-dimensional embedding Z of (E) ₁ and G₂ (B ₁ ，X ₂ ) Node low-dimensional embedding Z of (E) ₂ ：

Z ₁ ＝f(X ₁ ，A ₁ )＝GC _k+1 (GC _k (X ₁ ，A ₁ )，A ₁ )

Z ₂ ＝f(X ₂ ，B ₁ )＝GC _k+1 (GC _k (X ₂ ，B ₁ )，B ₁ )，

wherein ,GC_k (. Cndot.) represents the k-layer network of the encoder, X ₁ and A₁ Node characteristic matrix and adjacent matrix of space neighborhood network respectively, X ₂ and B₁ Respectively a node characteristic matrix and an adjacent matrix in the gene expression similarity network, wherein k=1;

6.2 Embedding Z) the two low dimensions obtained in step 6.1) ₁ and Z₂ Respectively as input of decoder h (·) to obtain G ₁ (A ₁ ，X ₁ ) Reconstructed gene expression feature matrix

and G₂ (B ₁ ，X ₂ ) Reconstructed Gene expression characterization matrix->

6.3 Embedding the two low dimensions generated in step 6.1) into Z ₁ and Z₂ Respectively input to a projector g (·) to obtain Z ₁ Low dimensional embedding of Z 'for contrast loss of (C)' ₁ and Z₂ Low dimensional embedding of Z 'for contrast loss of (C)' ₂ ：

Z′ ₁ ＝g(Z ₁ )

Z′ ₂ ＝ _g (Z ₂ )；

6.4 Calculating the contrast loss l (z 'of each node i and other nodes k according to the result of the step 6.3)' _1i ，z′ _2i )：

Wherein, θ (·) represents the cosine similarity distance, and τ is a given hyper-parameter; z'. _1i Is Z' ₁ Vector of the i-th row representing pixel point i at G ₁ (A ₁ ，X ₁ ) As input, the output of the projector; z'. _2i Is Z' ₂ Vector of the i-th row representing pixel point i at G ₂ (B ₁ ，X ₂ ) As output of the projector at the time of input, i, k=1..n, n=3639 is the number of all pixel points in the tissue slice;

6.5 Calculating the contrast loss L of the whole model according to the contrast loss of each node obtained in the step 6.4) _con ：

6.6 According to the reconstructed gene feature matrix generated in step 6.2)

and

Calculating reconstruction loss L _recon ：

in the formula ,x_i A vector of the ith row of X, which represents the genetic characteristic of pixel point i;

is->

Vector of the i-th row representing the pixel point i is represented by G ₁ (A ₁ ，X ₁ ) Reconstructed gene signature;

Is->

Vector of the i-th row representing the pixel point i is represented by G ₂ (B ₁ ，X ₂ ) Reconstructed gene signature;

6.7 According to contrast loss L _con And reconstruction loss L _recon Calculating a loss function L:

L＝λ _con L _con +λ _recon L _recon

6.8 Updating the network parameters of the encoder and the decoder according to the loss function L obtained in the step 6.7) until the loss function L converges, and obtaining a trained spatial transcriptome feature extraction model.

Step 7, inputting an adjacency matrix A and a node attribute feature matrix X of the spatial neighborhood network which are not subjected to data enhancement into the spatial transcriptome feature extraction model trained in the step 6, so as to obtain a combined low-dimensional embedded Z containing spatial information and gene expression;

and step 8, clustering the combined low-dimensional embedding obtained in the step 7 by using a Leiden clustering algorithm.

8.1 Calculating the neighbor of each pixel point according to the combined low-dimensional embedding Z extracted in the step 7, constructing a neighborhood graph, and storing a neighborhood label l';

8.2 Performing dimension reduction on the combined low-dimension embedded Z through UMAP algorithm to obtain an embedded Z' with reduced dimension;

8.3 Obtaining a clustering label l through a Leiden algorithm according to the neighborhood label l 'obtained in the step 8.1) and the low-dimensional post-embedding Z' obtained in the step 8.2);

8.4 UMAP visualization is carried out on the clustering label l and the low-dimensional embedded Z', each pixel point is dyed on the tissue slice according to the clustering label l, and the pixel points with the same color are regarded as a domain, namely the identification of the spatial domain is realized.

The technical effects of the present invention will be described below in connection with simulation experiments.

Simulation conditions:

the CPU of the computer hardware of the simulation experiment is Intel Core (TM) i7-8700, and the memory of the computer hardware is 32G;

computer software: python3.8 integrated development software on WINDOWS10 system.

Second, simulation content:

simulation 1: the spatial clustering was performed with the present invention and the existing 6 methods SEDR, STAGATE, deepST, scanpy, stlearn, spaGCN on a dataset generated by two spatial transcriptome sequencing means, namely a spatial transcriptome dataset based on 12 slices of 10x visual human dorsal lateral prefrontal cortex layer DLPFC and a spatial transcriptome dataset based on imaged STARmap mouse visual cortex, and the results were as shown in table 1 using the adjusted rand index ARI as an evaluation index for evaluating the spatial clustering results of each method:

table 1 evaluation of the invention and the 6 existing methods in a tagged dataset

The existing 6 spatial domain identification methods are as follows:

SEDR，Ling S，Huazhu F，et al.Unsupervised Spatially Embedded Deep Representation ofSpatial Transcriptomics[J].bioRxiv，2021.

STAGATE，Dong K，Zhang S.Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder[J].Nature communications，2022，13(1)：1-12.

DeepST，Xu C，Jin X，Wei S，et al.DeepST：identifying spatial domains in spatial transcriptomics by deep learning[J].Nucleic Acids Research，2022.

Scanpy，Wolf F A，Angerer P，Theis F J.SCANPY：large-scale single-cell gene expression data analysis[J].Genome Biology，2018，19(1)：1-5.

Stlearn，Pham D，Tan X，Xu J，et a1.stLearn：integrating spatial location，tissue morphology and gene expression to find cell types，cell-cell interactions and spatial trajectories within undissociated tissues[J].bioRxiv，2020.

SpaGCN，LiM，Hu J，Li X，et al.SpaGCN：Integrating gene expression，spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network[J].Nature Methods，2021，18(10)：1342-1351.

as can be seen from Table 1, the present invention has better results than the other methods on 12 data sets of DLPFC of 10xVisium, and the mean value is higher than that of the other methods. On the mouse visual cortex dataset of STAPmap, the performance of the invention and STAGATE is significantly higher than other methods, but the invention has higher accuracy than STAGATE. The simulation result shows that the invention maintains higher accuracy in both in-situ sequencing-based data sets and imaging-based data sets, and has good generalization capability.

Simulation 2: the present invention was used to spatially cluster on 10x visual mouse brain slices and human breast cancer spatial transcriptome datasets with the current 3 methods DeepST, SEDR, STAGATE, and profile factor Silhouette Coefficient score and DB score Davies-Bouldin score were used as evaluation indicators for evaluating the spatial clustering results of each method, the results are shown in table 2:

table 2 evaluation of the invention and the prior 3 methods in unlabeled dataset

As can be seen from Table 2, the performance of the invention with STAGATE is significantly higher than the other methods on the spatial transcriptome dataset of the 10XVisium mouse brain, but the invention is slightly higher than the STAGATE index. The invention has great advantages over other methods on spatial transcriptome datasets of 10x visual human breast cancer. The simulation result shows that on a plurality of unlabeled data sets needing fine recognition, the clustering result of the invention is better, and the biological significance in the original sample is better reserved.

Simulation 3: the data set of the mouse brain coronal plane at 10x visual with the two existing methods deep st and STAGATE of the invention is used for identifying the spatial domain through spatial clustering, and each pixel point is stained with the clustering result on a histological section, and the result is shown in fig. 4. Wherein fig. 4 (a) shows a spatial cluster visualization of the present invention, fig. 4 (b) shows a spatial cluster visualization of STAGATE, and fig. 4 (c) shows a spatial cluster visualization of deep st.

As can be seen from fig. 4, the existing methods STAGATE and deep cannot accurately identify the spatial domain on the mouse brain coronal slice, and cannot clearly represent the difference between each domain, especially the hippocampal region on the dataset, and the spatial clustering result of the present invention is more in accordance with biological significance. The simulation result shows that the characteristics extracted by the method do not cause overfitting on the gene expression profile, and the accuracy and the robustness of spatial domain identification are improved.

Claims

1. The spatial domain identification method based on the spatial transcriptomics data feature extraction is characterized by comprising the following steps:

(2) Preprocessing a gene expression matrix of the space transcriptome data:

(3) Constructing a spatial neighborhood network:

(3a) Calculating the Euclidean distance d between each pixel point in the tissue slice on the basis of the space coordinate information:

(4) Constructing a gene expression similarity network:

(4b) Based on the Euclidean distance d' calculated by the gene expression value, selecting the first k nearest neighbors of each pixel point, and constructing an adjacent matrix B for representing the gene expression similarity;

(5) Data enhancement:

(5b) Masking probability p for edge and node attribute features in a gene expression similarity network according to a given edge conforming to Bernoulli distribution _r ' and node feature mask probability p _m ' masking to obtain a network of similarity of enhanced gene expression G ₂ (B ₁ ，X ₂ )；

(6) Construction consists of a cascade of encoder f (·) with decoder h (·) and projector g (·) respectivelyFeature extraction model of spatial transcriptome data of (2) and using contrast loss L _con And reconstruction loss L _recon As a loss function L;

(7) Training a feature extraction model of the spatial transcriptome data:

(7a) Spatial neighbor network G with enhanced data ₁ (A ₁ ，X ₁ ) Adjacency matrix A1 and node attribute feature matrix X ₁ Gene expression similarity network G ₂ (B ₁ ，X ₂ ) Adjacent matrix B of (a) ₁ And node attribute feature matrix X ₂ Input into a space transcriptome feature extraction model, and generate a low-dimensional embedded Z by an encoder ₁ and Z₂ The decoder generates a reconstructed gene expression feature matrix

and

and

Updating network parameters of the encoder and the decoder according to the calculated loss until the loss function L converges to obtain a trained space transcriptome feature extraction model;

2. The method of claim 1, wherein step (2 c) generates a gene expression profile X, expressed as follows:

X＝[x ₁ ；x ₂ …；x _i ；…；x _n ] ^T

3. The method of claim 1, wherein the euclidean distance d in spatial position between each pixel point in the tissue slice is calculated in step (3 a) as follows:

wherein ,(a_i ，b _i) and (a_j ，b _j ) The spatial coordinates of pixel i and pixel j on the tissue slice, respectively.

4. The method of claim 1, wherein the adjacency matrix a of the spatial neighborhood network constructed in step (3 b) is represented as follows:

wherein ,

for the elements of the ith row and jth column in the adjacency matrix A of the spatial neighborhood network, i and j represent two nodes in the spatial neighborhood network respectively, if the node i is included in the first k nearest neighbors calculated based on the spatial coordinates of the node j, the i and j are adjacent, otherwisei and j are not adjacent, i, j=1..n, n represents the number of nodes contained in the spatial neighborhood network.

5. The method of claim 1, wherein the euclidean distance d' between the gene expression values for each pixel in the tissue section is calculated in step (4 a) as follows:

wherein x_jk and x_ik The values of the kth dimension of the pixel i and pixel j gene expression feature vectors, respectively, k=1..m, m being the dimension of each pixel gene expression feature vector.

6. The method of claim 1, wherein the adjacency matrix B of the gene expression similarity network constructed in step (4B) is represented as follows:

wherein ,

for the elements of the ith row and jth column in the adjacency matrix B of the gene expression similarity network, i and j represent two nodes in the gene expression similarity network, respectively, if the node i is included in the first k nearest neighbors calculated based on the gene expression matrix of the node j, i and j are adjacent, otherwise i and j are not adjacent, i, j=1.

7. The method of claim 1, wherein the step (5 a) is performed on a spatial neighbor network G ₁ The edge and node attribute features in (a, X) are masked according to probabilities as follows:

(5a1) According to spaceEach element a in adjacency matrix a of the neighborhood network _ij Sampling an edge mask matrix according to Bernoulli distribution

It is represented as follows:

in the formula ,

mask matrix for edge->

The element in the ith row and the jth column in (a), if a _ij =1, then->

From Bernoulli distribution B (1-p _r ) Middle sampling, if a _ij =0, then->

wherein ,p_r The probability that each edge in the spatial neighborhood network is deleted is that i and j respectively represent two nodes in the spatial neighborhood network, i, j=1..n, n represents the number of nodes contained in the spatial neighborhood network;

(5a2) Combining the adjacency matrix A of the spatial neighborhood network with the sampling matrix generated in (5 a 1)

Enhanced adjacency matrix A by element multiplication ₁ ：

in the formula ,

For sampling matrix->

Elements of the ith row and the jth column;

(5a3) According to Bernoulli distribution B (1-p _m ) Sampling a random vector to generate a node feature mask vector with the same dimension as the gene feature vector

wherein ,p_m Is the probability that the value in each node feature vector in the spatial neighborhood network is deleted;

(5a4) Node attribute feature matrix X of the space neighbor network and the node feature mask vector generated in (5 a 3)

Wherein, [;]representing the splicing operation, x _i Is X1 _{A kind of electronic device} And the ith row represents the gene characteristic vector of the node i in the space neighbor network.

8. Root of Chinese characterThe method of claim 1, wherein the gene is expressed in step (5 b) in a similarity network G ₂ The edge and node attribute features in (B, X) are masked according to probability as follows:

(5b1) Each element B in the adjacency matrix B according to the gene expression similarity network _ij Sampling an edge mask matrix R epsilon {0,1} according to Bernoulli distribution ^N×N The expression is as follows:

in the formula ,

if b, for the element of the ith row and jth column of the edge mask matrix R _ij =1, then R _ij From Bernoulli distribution B (1-p _r ) Middle sampling, if b _ij =0, then R _ij =0, where p _r ' is the probability that each edge in the gene expression similarity network is deleted, i and j represent two nodes in the gene expression similarity network, respectively, i, j=1..n, n representing the number of nodes contained in the gene expression similarity network;

(5b2) Multiplying the adjacency matrix B of the gene expression similarity network with the sampling matrix R generated in (5B 1) by elements to obtain an adjacency matrix B of the enhanced gene expression similarity network ₁ ：

in the formula ,

the operator represents multiplying the adjacency matrix B and the sampling matrix R of the gene expression similarity network by elements, B _ij For elements of the ith row and jth column in adjacency matrix B of the gene expression similarity network, R _ij For the first sample matrix Ri row j column elements;

(5b3) According to Bernoulli distribution B (1-p _m ) Sampling a random vector to generate a node feature mask vector m with the same dimension as the gene feature vector, wherein p _m ' is the probability that the value in each node feature vector in the gene expression similarity network is deleted;

(5b4) Multiplying the node attribute feature matrix X of the gene expression similarity network with the node feature mask vector m generated in step (5 a 3) according to elements to obtain the node attribute feature matrix X of the enhanced gene expression similarity network ₂ ：

9. The method of claim 1, wherein the encoder, decoder, projector parameters and loss functions of the spatial transcriptome data feature extraction model constructed in step (6) are as follows:

the encoder f (·) is formed by cascade of an input graph neural network GCN layer and two layers of hidden GCN layers, wherein the input dimension is a transcriptome gene feature number, the first hidden GCN layer is 256 dimensions, the second hidden GCN layer is 128 dimensions, and a PRelu function is used as an activation function between each layer of GCNs;

the decoder h (-) consists of an input full-connection layer and three hidden full-connection layers which are cascaded, wherein the input full-connection layer is 128-dimensional, the first hidden full-connection layer is 128-dimensional, the second hidden full-connection layer is 256-dimensional, the third hidden full-connection layer is the feature number of the transcriptome genes, and a Relu function is used as an activation function between each full-connection layer;

the projector g (-) consists of an input full-connection layer and a hidden full-connection layer which are cascaded, wherein the input full-connection layer is 128-dimensional, the hidden full-connection layer is 128-dimensional, and an activation function is not arranged between each layer;

the loss function, which is a weighted sum of the contrast loss and the reconstruction loss, is expressed as:

L＝λ _con L _con +λ _recon L _recon

wherein ,λ_con and λ_recon Is a super parameter for measuring the weight of contrast loss and reconstruction loss, L _recon Represents reconstruction loss, L _con Representing contrast loss.

10. The method of claim 1, wherein the step (7 a) of generating a low-dimensional embedding by an encoder and reconstructing a gene expression feature matrix by a decoder is performed as follows:

(7a1) Encoder f (·) extraction G ₁ (A ₁ ，X ₁ ) Node low-dimensional embedding Z of (E) ₁ and G₂ (B ₁ ，X ₂ ) Node low-dimensional embedding Z of (E) ₂ ：

Z ₁ ＝f(X ₁ ，A ₁ )＝GC _k+1 (GC _k (X ₁ ，A ₁ )，A ₁ )

Z ₂ ＝f(X ₂ ，B ₁ )＝GC _k+1 (GC _k (X ₂ ，B ₁ )，B ₁ )，

(7a2) Embedding Z into the two low dimensions obtained in (7 a 1) ₁ and Z₂ Respectively as input of decoder h (·) to obtain G ₁ (A ₁ ，X ₁ ) Reconstructed gene expression feature matrix

11. The method of claim 1, wherein the calculation of the contrast loss and the reconstruction loss in step (7 b) is performed as follows:

(7b1) Inputting the low-dimensional embeddings generated in the step (7 a) into a projector g (·) respectively to obtain Z respectively ₁ and Z₂ Low dimensional embedding Z 'for contrast loss' ₁ and Z′₂ ：

Z′ ₁ ＝g(Z ₁ )

Z′ ₂ ＝g(Z ₂ )；

(7b2) Calculating the comparison loss l (z 'of each node i and other nodes k according to the result of (7 b 1)' _1i ，z′ _2i )：

Wherein, θ (·) represents the cosine similarity distance, and τ is a given hyper-parameter; z'. _1i Is Z' ₁ Vector of the i-th row representing pixel point i at G ₁ (A ₁ ，X ₁ ) As input, the output of the projector; z'. _2i Is Z' ₂ Vector of the i-th row representing pixel point i at G ₂ (B ₁ ，X ₂ ) As output of the projector at the time of input, i, k=1..n, n is the number of all pixel points in the tissue slice;

(7b3) Calculating the whole model according to the comparison loss of each node obtained in (7 b 2)Contrast loss L of (2) _con ：

(7b4) According to the reconstructed gene feature matrix generated in step (7 a)

and

Calculating reconstruction loss L _recon ：

is->

Is->

(7b5) According to contrast loss L _con And reconstruction loss L _recon Calculating a loss function L:

L＝λ _con L _con +λ _recon L _recon

wherein ,λ_con and λ_recon Is a super parameter for measuring the weights of the contrast loss and the reconstruction loss, respectively.

12. The method of claim 1, wherein step (9) uses Leiden clustering algorithm to cluster the combined low-dimensional embedding Z obtained after training, implemented as follows:

(9a) Calculating the neighbor of each pixel point according to the combined low-dimensional embedding Z extracted in the step (8), constructing a neighborhood graph, and storing a neighborhood label l';

(9b) Performing dimension reduction on the combined low-dimension embedded Z through UMAP algorithm to obtain an embedded Z' with reduced dimension;

(9c) Obtaining a cluster label l through a Leiden algorithm according to the neighborhood label l 'in the step (9 a) and the embedded Z' in the step (9 b);

(9d) And carrying out UMAP visualization on the cluster label l and the low-dimensional embedded Z', and dyeing each pixel point on the tissue slice according to the cluster label l, wherein the pixel points with the same color are regarded as a domain, namely, the identification of the spatial domain is realized.