CN116705151A

CN116705151A - Dimension reduction method and system for space transcriptome data

Info

Publication number: CN116705151A
Application number: CN202310674214.6A
Authority: CN
Inventors: 刘瑾
Original assignee: Chinese University of Hong Kong Shenzhen
Current assignee: Chinese University of Hong Kong Shenzhen
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-09-05

Abstract

The application discloses a dimension reduction method and system for space transcriptome data, belongs to the technical field of data analysis, and can solve the problem that the existing dimension reduction method is high in computational complexity and cannot simultaneously reduce dimensions of a plurality of slices. The method comprises the following steps: step 1, acquiring a gene expression matrix and a space coordinate matrix of a plurality of slices in space transcriptome data; step 2, determining initial dimension reduction parameters and spatial dependence among spots in each slice according to the gene expression matrix and the spatial coordinate matrix; step 3, constructing a dimension reduction model according to the spatial dependence among the gene expression matrix, the initial dimension reduction parameters and the spots in the slices; and 4, estimating parameters of the dimension reduction model by using an iterative optimization algorithm to obtain the final low-dimension representation of each slice. The method is used for reducing the dimension of the space transcriptome data.

Description

Dimension reduction method and system for space transcriptome data

Technical Field

The application relates to a dimension reduction method and system for space transcriptome data, and belongs to the technical field of data analysis.

Background

In the prior literature, a plurality of dimension reduction methods are available and are mainly divided into a parameter dimension reduction method and a non-parameter dimension reduction method; among them, the parametric dimension reduction methods such as principal component analysis (Principal Component Analysis, PCA) and zero-expansion factor analysis (zero-inflation factor analysis, ZIFA), the non-parametric dimension reduction methods such as T-distribution random nearest neighbor embedding (T-Stochastic Neighbor Embedding, tSNE), uniform flow pattern approximation and projection (Uniform manifold approximation and projection, UMAP) and single cell variation reasoning (scVI). PCA is the most common dimension reduction method applied to many software flows; tSNE and UMAP are two non-parametric dimension reduction methods widely used for visualization; scVI is a deep generation characterization learning method for single cell RNA sequencing data analysis. None of these methods, however, take into account the spatial characteristics of the spatial transcriptome (Spatially Resolved Transcriptoms, SRT) data in generating the low-dimensional embeddings.

In SRT data analysis, gene expression between adjacent spatial locations appears to be "similar" due to shared cellular microenvironment. Recently proposed SpatialPCA and non-negative spatial decomposition (non-negative spatial factorization, NSF) use gaussian kernels and spatial location-based gaussian process priors, respectively, to achieve spatial perceptual dimensionality reduction. To reduce computational burden, spatialPCA uses a low rank approximation strategy with O (n ² ) Computational complexity, where n is the number of spatial locations. Whereas NSF uses a sparse gaussian process, its computational complexity is proportional to the cube of the number of points induced, but is not applicable to spatial locations from multiple slices. With the increase in spatial resolution of SRT techniques, the number of spatial locations has increased significantly, requiring multiple slices to generate a spatial map or to recover a spatiotemporal transcriptomic map of the entire organ. However, the existing dimension reduction methods have higher computational complexity, and cannot simultaneously reduce the dimension of the space of a plurality of slices.

Disclosure of Invention

The application provides a dimension reduction method and a dimension reduction system for space transcriptome data, which can solve the problem that the existing dimension reduction method has higher calculation complexity and cannot simultaneously carry out space dimension reduction on a plurality of slices.

In one aspect, the present application provides a method for dimension reduction of spatial transcriptome data, the method comprising:

step 1, acquiring a gene expression matrix and a space coordinate matrix of a plurality of slices in space transcriptome data;

step 2, determining initial dimension reduction parameters and spatial dependence among spots in each slice according to the gene expression matrix and the spatial coordinate matrix;

step 3, constructing a dimension reduction model according to the spatial dependence among the gene expression matrix, the initial dimension reduction parameters and the spots in the slices;

and 4, estimating parameters of the dimension reduction model by using an iterative optimization algorithm to obtain the final low-dimension representation of each slice.

Optionally, the step 2 specifically includes:

step 21, determining initial dimension reduction parameters and initial low-dimension representation of each slice by using a principal component analysis method according to the gene expression matrix;

step 22, determining the spatial dependency between the spots in each slice according to the spatial coordinate matrix and the initial low-dimensional representation of each slice.

Optionally, the spatial dependencies between the spots in the slice are specifically:

given the low-dimensional representation of all the boot samples outside the target boot in the slice, the low-dimensional representation of the target boot obeys a normal distribution; the average value of the normal distribution is the average value of the low-dimensional representation of the neighbors of the target spot.

Optionally, the step 1 specifically includes:

step 11, obtaining the hypervariable genes and the space coordinate matrix of each slice in the space transcriptome data to be analyzed, and integrating the hypervariable genes of each slice to obtain integrated hypervariable genes;

and step 12, obtaining a gene expression matrix of each slice according to the integrated hypervariable genes.

Optionally, the step 12 specifically includes:

step 121, extracting a count matrix on the integrated hypervariable gene of each slice;

step 122, regularizing the counting matrix of each slice to obtain a gene expression matrix of each slice.

Optionally, the dimension reduction model is:

the gene expression matrix for each slice is the sum of the sample-specific intercept term for that slice, the product of the sample-sharing load matrix and the low-dimensional representation of that slice, and the random noise term for that slice.

Optionally, the dimension reduction model specifically includes:

x _mi ＝α _m +Wv _mi +∈ _mi ；

wherein x is _mi ∈R ^p Gene expression vector, alpha, for the mth slice, i spot _m ∈R ^p For the sample-specific intercept term of the mth slice, W ε R ^p×q Share a load matrix for a sample, v _mi ∈R ^q E for the low-dimensional representation of the mth slice, i-th spot _mi ～N(0,Λ _m ) Random noise of normal distribution for the mth slice, ith spot, where Λ _m ＝diag(λ _mi ,…,λ _mp ) Is a diagonal array.

Optionally, the dimension reduction model is:

under the condition of the poisson rate of a target spot in a given slice, the gene expression of the target spot obeys poisson distribution;

and, the logarithm of the poisson rate of the target spot is the sum of the sample-specific intercept term of the slice, the product of the sample-sharing load vector and the low-dimensional representation of the slice, and the random noise term of the slice.

Optionally, the dimension reduction model specifically includes:

wherein, the liquid crystal display device comprises a liquid crystal display device,a is the count expression level of gene j of the ith spot in slice m _mi Regularization factor, w, for the ith spot in slice m _j ∈R ^q For the sample corresponding to gene j, the load vector is shared, alpha _mj For the sample-specific intercept term corresponding to gene j in slice m, f _mij Poisson rate for gene j of the ith spot in slice m; v _mi ∈R ^q E for the low-dimensional representation of the ith spot of slice m _mij ～N(0,λ _mj ) Random noise of normal distribution of the gene j of the ith spot of slice m.

Optionally, the iterative optimization algorithm is an ICM-EM algorithm.

Optionally, the regularization process is logarithmic regularization or negative two-term regression regularization.

In another aspect, the present application provides a system for the maintenance reduction of spatial transcriptome data, the system comprising:

the data acquisition module is used for acquiring a gene expression matrix and a space coordinate matrix of a plurality of slices in the space transcriptome data;

the parameter determining module is used for determining initial dimension reduction parameters and spatial dependence among spots in each slice according to the gene expression matrix and the spatial coordinate matrix;

the model construction module is used for constructing a dimension reduction model according to the gene expression matrix, the initial dimension reduction parameters and the spatial dependence among the spots in the slices;

and the iterative optimization module is used for estimating parameters of the dimension reduction model by using an iterative optimization algorithm to obtain the final low-dimension representation of each slice.

The application has the beneficial effects that:

the dimension reduction method of the space transcriptome data, which is recorded as ProFAST, can effectively estimate low-dimension representation of a plurality of slices, capture inherent biological effects and consider local expression similarity caused by shared cell microenvironment. Compared to the prior art, proFAST allows for simultaneous spatial dimension reduction of multiple slices when modeling the counting properties. Furthermore, proFAST uses conditional autoregressions to model local spatial dependencies with O (n) computational complexity, making it suitable for analysis of multi-slice high resolution SRT data.

Drawings

FIG. 1 is a flow chart of a method for dimension reduction of spatial transcriptome data according to an embodiment of the present application;

fig. 2 is a schematic diagram of a dimension reduction principle of spatial transcriptome data according to an embodiment of the present application.

Detailed Description

The present application is described in detail below with reference to examples, but the present application is not limited to these examples.

The embodiment of the application provides a dimension reduction method of space transcriptome data, which is marked as ProFAST; as shown in fig. 1 and 2, the method includes:

and step 1, acquiring a gene expression matrix and a space coordinate matrix of a plurality of slices in the space transcriptome data.

The step 1 specifically comprises the following steps:

Step 12 specifically includes:

In the embodiment of the application, assuming that the number of slices is M, the gene expression matrix can be marked as X _m The spatial coordinate matrix may be denoted as P _m M=1, …, M. Gene expression matrix X _m Is n _m X p matrix, where n _m The sample size of the mth slice, p is the number of hypervariable genes. The gene expression matrix can be UMI count matrix of gene expression or regularized expression matrix. The regularization processing method can select logarithmic regularization or negative two-term regression regularization and the like. Space coordinate matrix P _m Is n _m And (2) a matrix, wherein the ith row is the spatial coordinate of the ith spot in the mth slice. Spots in the context of spatial transcriptomics refer to discrete locations in tissue samples where gene expression data is collected, and these spots are typically identified by physical coordinates or by imaging techniques such as microscopy. From a practical point of view, a spot can be considered as a cell unit, representingA single cell or a group of cells in a tissue.

By way of example, the data to be analyzed may include UMI count matrices and spatial coordinate information (i.e., spatial coordinate matrix P) for a plurality of spatial transcriptome slices obtained from a genomic database (e.g., GEO) _m ). The hypervariable genes for each slice were selected using the findspgs function in R-package dr.sc, or the findsvariablefeateurs function in R-package setup. And integrating the hypervariable genes of each slice according to the ordering rule, so as to obtain the integrated hypervariable genes. Regularizing UMI count on the integrated hypervariable gene by using a normazeData function or an SCTransform function in R package Seurat to obtain a regularized gene expression matrix X _m 。

the step 2 specifically comprises the following steps:

step 21, according to the gene expression matrix X _m Determining an initial dimension reduction parameter and an initial low-dimensional representation of each slice using a Principal Component Analysis (PCA) method;

The spatial dependencies between the spots in the slice are specifically:

under the condition of the low-dimensional representation of all the spot samples outside the target spot in the given slice, the low-dimensional representation of the target spot obeys normal distribution; the average value of the normal distribution is the average value of the low-dimensional representation of the neighbor of the target spot.

Specifically, according to the gene expression matrix X _m Space coordinate matrix P _m Determining initial dimension reduction parameters (i.e., W, ψ) using Principal Component Analysis (PCA) method _m ,Λ _m ) And an initial low-dimensional representation of each sliceWherein v is _mi Is a low-dimensional representation of the mth slice, the ith spot。

To capture the spatial dependencies between the spots in each slice, v _mi Modeling was performed using a continuous multivariate hidden Ma Erke Fu random field. Specifically, let v _mi From a CAR (conditional autoregressive) model.

The spatial dependence between the spots in the slice is shown in formula (1):

wherein the subscript [ n ] _m ]I represents all samples in slice m except the ith spot, L _mi For the number of neighbors of the ith spot in slice m,as the conditional mean value of the ith spot neighbor in slice m, ψ _m ∈R ^q ^×q Is a conditional covariance matrix.

In regularized gene expression matrix X _m Modeling is an example, and the dimension reduction model is specifically shown as formula (2):

x _mi ＝α _m +Wv _mi +∈ _mi ； (2)

wherein x is _mi ∈R ^p Gene expression vector, alpha, for the mth slice, i spot _m ∈R ^p For the sample-specific intercept term of the mth slice, W ε R ^p×q Sharing a load matrix for samples for capturing shared information between slices, v _mi ∈R ^q For the mth slice, a low-dimensional representation of the ith spot, for capturing biological information, e _mi ～N(0,Λ _m ) Random noise of normal distribution for the mth slice, ith spot, where Λ _m ＝diag(λ _mi ,…,λ _mp ) Is diagonalAn array.

The simultaneous expression (1) and expression (2) is a Gaussian version ProFAST model, and parameters to be estimated comprise (W, alpha) _m ,Λ _m ,Ψ _m M.epsilon. {1, …, M }). Order theX＝(X ₁ ,…,X _M ) ^T ∈R ^N×p ，V＝(V ₁ ,…,V _M ) ^T ∈R ^N×q Wherein->Is the total sample size. Suppose there is a V _m Is defined as +.>The pseudo-full log likelihood of the model can be expressed as shown in equation (3):

wherein P (x) _mi |v _mi ) Obtained by the formula (2),obtained by the formula (1).

In a non-regularized gene expression matrixModeling is taken as an example, and the dimension reduction model is specifically shown in the formula (4) and the formula (5):

wherein, the liquid crystal display device comprises a liquid crystal display device,a is the count expression level of gene j of the ith spot in slice m _mi The regularization factor of the ith spot in slice m may be set to 1 or the sum of the counts of the ith spots in slice m, w _j ∈R ^q For the sample corresponding to gene j, the load vector is shared, alpha _mj For the sample-specific intercept term corresponding to gene j in slice m, f _mij The poisson rate of gene j, which is the i-th spot in slice m, is an unknown quantity; v _mi ∈R ^q E for the low-dimensional representation of the ith spot of slice m _mij ～N(0,λ _mj ) Random noise of normal distribution of the gene j of the ith spot of slice m.

Let the load matrixIntercept vector alpha _m ＝(α _m1 ,…,α _mp )，/>The simultaneous expression (1), (4) and (5) are poisson version ProFAST model, and the parameters to be estimated include (W, alpha) _m ,Λ _m ,Ψ _m M.epsilon. {1, …, M }). Suppose there is a V _m Is defined as +.>The pseudo-full log likelihood of the model can be expressed as shown in equation (6):

wherein, the liquid crystal display device comprises a liquid crystal display device,obtained by formula (4), lnP (f) _mij |v _mi ) Obtained by formula (5)>Obtained by the formula (1).

In practical application, the ProFAST model may be iteratively optimized using an ICM-EM algorithm to update model parameters, the algorithm iterates ICM steps and EM steps alternately until convergence, and finally outputs a low-dimensional representation of the slice, parameter estimation of the model, and likelihood function values.

The ProFAST model provided by the application can simultaneously carry out space dimension reduction on a plurality of slices. The model captures shared information between slices through a sample-shared payload matrix, capturing biological information through a sample-specific low-dimensional representation, which allows the ProFAST to analyze multiple slices simultaneously. By establishing a continuous multi-element hidden Ma Erke Fu random field for the low-dimensional representation, the spatial dependence among the spots in each slice is captured, so that the model can realize the spatial perception dimension reduction.

The ProFAST model provided by the application provides two versions of algorithms, namely a Gaussian version and a Poisson version, and models a regularization matrix and a counting matrix respectively. In addition, proFAST has efficient computational efficiency, which models local spatial dependencies by using conditional autoregressions, so that the computational complexity of ProFAST is only O (n), which is applicable to multi-slice high resolution SRT data.

Another embodiment of the present application provides a system for reducing maintenance of spatial transcriptome data, the system comprising:

the parameter determining module is used for determining initial dimension reduction parameters and spatial dependency among spots in each slice according to the gene expression matrix and the spatial coordinate matrix;

The specific description of each module in the dimension reduction system can refer to the description of each step in the dimension reduction method, and will not be repeated here, and the dimension reduction system can realize the same function as the dimension reduction method side.

While the application has been described in terms of preferred embodiments, it will be understood by those skilled in the art that various changes and modifications can be made without departing from the scope of the application, and it is intended that the application is not limited to the specific embodiments disclosed.

Claims

1. A method for dimension reduction of spatial transcriptome data, the method comprising:

2. The method according to claim 1, wherein the step 2 specifically comprises:

3. The method according to claim 1 or 2, wherein the spatial dependencies between spots in the slice are in particular:

4. The method according to claim 1, wherein the step 1 specifically comprises:

5. The method according to claim 4, wherein the step 12 specifically includes:

6. The method of claim 5, wherein the dimension reduction model is:

7. The method of claim 4, wherein the dimension reduction model is:

8. The method of claim 1, wherein the iterative optimization algorithm is an ICM-EM algorithm.

9. The method of claim 3, wherein the regularization process is logarithmic regularization or negative bivariate regression regularization.

10. A system for the maintenance of spatial transcriptome data, the system comprising: