CN114944194A

CN114944194A - Method and system for deducing cell subset expression mode in space transcriptome

Info

Publication number: CN114944194A
Application number: CN202210552099.0A
Authority: CN
Inventors: 刘健; 阮志涵; 陈娇
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-08-26

Abstract

The invention discloses a method and a system for deducing an expression mode of a cell subset in a space transcriptome, and relates to the technical field of sequencing data analysis of the space transcriptome in bioinformatics. The method comprises the steps of performing quality control and pretreatment on the scRNA-seq data set to obtain a cell subset expression matrix; normalizing and normalizing the cell subpopulation expression matrix; constructing a variational neural network to learn the implicit variable distribution of each cell subset in the scRNA-seq data set; sampling in the trained latent variable distribution to generate an expression mode of the cell subset; deconvoluting the expression patterns of all spatial domains in the spatial transcriptome tissue section based on the expression patterns of the cell subsets to obtain a maximum a posteriori estimate of the distribution of the cell subsets in the spatial domains. The invention can keep a large amount of related information while reducing dimensionality of the single cell reference data required by the deconvolution method in the space transcriptome, improve the running speed and accuracy of the deconvolution method, and enable the distribution of cells in tissue slices to be more accurate.

Description

Method and system for deducing cell subset expression mode in space transcriptome

Technical Field

The invention belongs to the technical field of bioinformatics space transcriptome sequencing data analysis, and particularly relates to a method and a system for deducing an expression mode of a cell subset in a space transcriptome.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Spatial transcriptomics is a cross discipline of life sciences and computer science. Breakthrough advances in this area have brought new discoveries into the study of diseases and biological processes. However, due to the limitations of current sequencing technologies: spatial transcriptomics techniques are able to measure the location of transcript production, but not which cells produced the transcript separately. Whereas single cell technology (scRNA-seq) can obtain transcripts per cell, although spatial information is lost.

Some analytical tools integrate single-cell data with spatial transcriptome data and propose a method to understand convolution, i.e. consider each sample point (spot or bead) as a mixture of multiple cell types. The method takes the expression mode of cell subsets in a single cell as a basis to construct a model, takes the experimental data of each spot of a space transcriptome as input, and generates output which is the maximum posterior estimation of the distribution of the cell subsets in the space under the gene expression distribution of given spots.

The inventor finds that the current deconvolution method has very high requirements on the expression pattern of the cell subset, and the original scRNA-seq data has large scale and much noise, which can result in slow operation speed and general effect of the deconvolution method. Down-sampling directly in the data can lose a large amount of valuable information.

Therefore, it is necessary to develop a method for obtaining the expression pattern of cell subsets to solve the above problems.

Disclosure of Invention

The invention aims to provide a method and a system for deducing an expression mode of a cell subset in a space transcriptome, so that single cell reference data required by a deconvolution method in the space transcriptome is reduced in dimensionality and simultaneously retains a large amount of related information, thereby improving the running speed and accuracy of the deconvolution method.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

in a first aspect, the invention is a method of inferring an expression pattern of a subpopulation of cells within a spatial transcriptome, comprising:

performing quality control and pretreatment on the scRNA-seq data set to obtain a cell subset expression matrix;

normalizing and normalizing the cell subpopulation expression matrix;

constructing a variational neural network to learn the implicit variable distribution of each cell subset expression matrix in the scRNA-seq data set;

sampling in the trained latent variable distribution to generate an expression mode of the cell subset;

deconvoluting the expression patterns of all spatial domains in the spatial transcriptome tissue section based on the expression patterns of the cell subsets to obtain a maximum a posteriori estimate of the distribution of the cell subsets in the spatial domains.

Preferably, the quality control and pretreatment of the scRNA-seq data set comprises: filtering the cells with the low gene content and the genes which are not expressed in the cells and the mitochondrial genes, and screening out the genes with high expression.

Preferably, the method of normalizing and normalizing the expression matrix of a subpopulation of cells is as follows:

X _i ＝log(X _i +1)，i∈C

wherein X _i Expressing the expression matrix of each cell subset, wherein the normalization adopts a log normalization method, and the normalization adopts a min-max normalization method; obtained expression matrix X' _i Has a value range of [0, 1 ]]。

Preferably, the method for constructing the variational neural network to learn the implicit variational distribution of the expression matrix of each cell subset in the scRNA-seq data set is as follows:

for a preprocessed single-cell transcriptome gene expression matrix X _i Firstly, inputting a layer of coder consisting of all connection layers, and outputting mu and sigma; again from the Gaussian distribution Norm (μ, σ) ² ) Sampling to obtain an implicit variable Z, and finally generating final reference data through a decoder consisting of a full connection layer;

the formula for the neural network is as follows:

E＝ReLU(X _i W _E )

μ＝ReLU(X _i W _μ )

σ＝ReLU(X _i W _σ )

Z＝Sample[Norm(μ，σ ² )]

D＝ReLU(ZW _D )

wherein E and D represent hidden layers of an encoder and a decoder, respectively; μ and σ represent parameters of the implicit spatial gaussian distribution; z represents a hidden variable; x' _i Represents the expression matrix after reconstitution of the cell subset i.

Preferably, the method further comprises the steps of: setting an activation function, a loss function and a reparameterization method.

Preferably, the loss function expression is:

wherein α is used to represent | | | X _i -X′ _i || ² And

the ratio of (a) to (b).

Preferably, the expression for reparameterizing the hidden variable z is as follows:

Z＝Sample[Norm(μ，σ ² )]＝μ+εσ

wherein, epsilon to Norm (0, 1).

In a second aspect, the present invention provides a system for inferring expression patterns of a subpopulation of cells within a spatial transcriptome, comprising:

a quality control and pre-processing module configured to: performing quality control and pretreatment on the scRNA-seq data set to obtain a cell subset expression matrix;

a normalization module configured to: normalizing and normalizing the cell subpopulation expression matrix;

a hidden variable distribution learning module configured to: constructing a variational neural network to learn the implicit variable distribution of each cell subset expression matrix in the scRNA-seq data set;

an expression pattern generation module configured to: sampling in the trained latent variable distribution to generate an expression mode of the cell subset;

a deconvolution module configured to: deconvoluting the expression patterns of all spatial domains in the spatial transcriptome tissue section based on the expression patterns of the cell subsets to obtain a maximum a posteriori estimate of the distribution of the cell subsets in the spatial domains.

The above one or more technical solutions have the following beneficial effects:

the invention can accurately acquire the expression mode of each cell subset in the scRNA-seq data set by using the variational self-encoder, so that the deconvolution method in the space transcriptome can accurately obtain the maximum posterior estimation of the cell subset distribution in the space under the gene expression distribution of a given spot.

The invention ensures that the dimension of the single cell reference data required by the deconvolution method in the space transcriptome is reduced, and simultaneously, a large amount of related information is kept, thereby improving the operation speed and the accuracy of the deconvolution method and ensuring that the distribution of cells in the tissue slice is more accurate.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of a variational self-encoder of the present invention;

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Specific embodiments of the present invention are described hereinafter with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention, which can be embodied in various forms. Well-known and/or repeated functions and constructions are not described in detail to avoid obscuring the invention in unnecessary or unnecessary detail. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present invention in virtually any appropriately detailed structure.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example one

The embodiment of the invention provides a method for deducing an expression mode of a cell subset in a space transcriptome, which can be applied to the fields of space transcriptomics, single cell transcriptomics and the like, can be combined with a variational self-encoder to accurately obtain the expression mode of the cell subset, and further provides the maximum posterior estimation of the distribution of the cell subset in the space by utilizing a deconvolution method, wherein the method comprises the following steps:

step 1: the quality control of the scRNA-seq dataset, in this example, the kidney cell data of the 18-month-old mouse in the Tabula-muris dataset was selected as an expression matrix consisting of 3138 cells and 20138 genes, which is denoted as X. Quality control is carried out on the gene, cells with low gene content and genes which are not expressed in the cells are filtered, and genes with high expression are screened. After pretreatment, the expression matrix X consists of 2771 cells and 3000 hypervariable genes respectively.

Step 2: the cell subpopulation expression matrix was normalized and normalized. Tabula-muris gives the cell subset C to which each cell belongs, and in this example, cell subsets with a cell number less than 25 are excluded, and the matrix X is expressed on the cell subsets _i (i ∈ C) log normalization and min-max normalization were performed, as shown in the equation:

X _i ＝log(X _i +1)，i∈C

and 3, step 3: a Variational Autoencoder (VAE) was constructed to learn the latent variate distribution of the cell subset expression matrix in the scra-seq dataset. In this embodiment, the variational self-encoder belongs to one of neural networks, and realizes the learning of the cell expression pattern through the connection between nodes, describes the observation of hidden variables in a gaussian distribution mode, and finally reconstructs the cell subset expression pattern through the hidden variables. In this example, gene expression matrix X for a single cell transcriptome _i First, the data is passed through an Encoder (Encoder) consisting of a full connection layerThe values are given as μ and σ, and again from the Gaussian distribution Norm (μ, σ) ² ) Sampling to obtain a hidden variable Z, and finally generating final reference data through a Decoder (Decoder) consisting of a full connection layer.

E＝ReLU(X _i W _E )

μ＝ReLU(X _i W _μ )

σ＝ReLU(X _i W _σ )

Z＝Sample[Norm(μ，σ ² )]

D＝ReLU(ZW _D )

Wherein E and D represent the hidden layers of the encoder and decoder, respectively, which in this embodiment has a dimension of 400; w _E And W _D Respectively representing the weight parameters of the full connection layer; μ and σ represent parameters of the implicit spatial gaussian distribution; z represents a hidden variable, which in this embodiment has a dimension of 20; x' _i Represents the expression matrix after reconstitution of the cell subset i.

Furthermore, because the input values of the standardized expression matrixes are all between 0 and 1, the hidden layer adopts a ReLU activation function, and the output layer adopts a sigmoid function. The loss function of the VAE can be expressed as:

Loss＝E _z～q(z|x) [logp(x|z)]+KL(N(μ，σ ² ) N (O, I)) where the first term is also called reconstruction loss, the model herein employs L2 loss, i.e.:

||X _i -X′ _i || ²

the second term, klloss, is used to reflect the degree of fit between the reconstructed expression pattern and the original cell subpopulation expression pattern, and can be expressed in VAE as:

the final loss function is thus expressed as:

where α is used to represent the reconstruction loss and the KL loss fraction, is set to 2 in this embodiment. In the backward propagation, we need to re-parameterize the hidden variable z (replication) since the sampling operation is not guided.

Because Z to N (mu, sigma) ² ) And the following steps can be performed:

Z＝Sample[Norm(μ，σ ² )]＝μ+εσ

wherein epsilon-Norm (0, 1). By this technique, the gradient can be propagated back directly through μ and σ.

And 4, step 4: sampling in the trained implicit variable distribution of the cell subsets to generate an expression mode of the cell subsets, which specifically comprises the following steps: for each cell subset with a cell number greater than 25, as input to the variational self-encoder, in this example, the maximum number of iterations is set to 1000, and the learning rate is set to 10 ^-3 When KL loss is less than 10 ^-5 When so, training is stopped. For the output results, down-sampling was performed to a dimension of 25, resulting in a standard reference cell subpopulation.

And 5: deconvoluting the expression patterns of all spots in the tissue section of the space transcriptome based on the expression patterns of the cell subsets to obtain the maximum posterior estimation of the distribution of the cell subsets in the space, which specifically comprises the following steps: in this example, FFPE _ Kidney spatial transcriptome data Y obtained by 10X Visium sequencing technology has 3124 spots on tissue, 19465 genes, and 2675 genes having intersection with the cell subset obtained in step S4. Dividing the tissue section into regions by a spatial clustering method, taking X 'and Y as input of a deconvolution method, and outputting the proportion of each cell subset in X' in each region.

It should be noted that the spatial clustering method may adopt methods such as sourat, bayesian space and SpaGCN, and the deconvolution method may adopt methods such as SPOTlight, spacexr and stereoScope, which are well known and all fall within the scope of protection of the present patent.

Example two

The object of the present embodiment is to provide a computing device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the steps of the method in the first embodiment.

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of the first embodiment.

Example four

It is an object of this embodiment to provide a system for inferring an expression pattern of a subpopulation of cells within a spatial transcriptome, comprising:

a quality control and pre-processing module configured to: performing quality control and pretreatment on the scRNA-seq data set;

an expression pattern generation module configured to: sampling in the trained implicit variable distribution to generate an expression mode of the cell subset;

The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present disclosure.

Those skilled in the art will appreciate that the modules or steps of the present disclosure described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code executable by computing means, whereby the modules or steps may be stored in memory means for execution by the computing means, or separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof may be fabricated into a single integrated circuit module. The present disclosure is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A method of inferring an expression pattern of a subpopulation of cells within a spatial transcriptome, comprising:

normalizing and normalizing the cell subpopulation expression matrix;

sampling in the trained implicit variable distribution to generate an expression mode of the cell subset;

2. The method of claim 1, wherein the quality control and pre-processing of the scRNA-seq dataset comprises: filtering the cells with the low gene content and the genes which are not expressed in the cells and the mitochondrial genes, and screening out the genes with high expression.

3. The method of inferring the expression pattern of a subset of cells within a spatial transcriptome of claim 1, wherein the method of normalizing and normalizing the expression matrix of the subset of cells is as follows:

X _i ＝log(X _i +1)，i∈C

wherein X _i Expressing the expression matrix of each cell subset, wherein the normalization adopts a log normalization method, and the normalization adopts a min-max normalization method; expression matrix X 'obtained' _i Has a value range of [0, 1 ]]。

4. The method of inferring expression patterns of cell subsets within a spatial transcriptome of claim 1, wherein a variational neural network is constructed to learn the hidden variable distribution method of the expression matrix of each cell subset in the scRNA-seq dataset as follows:

for a preprocessed single-cell transcriptome gene expression matrix X _i Firstly, inputting a layer of coder consisting of all connection layers, and outputting mu and sigma; again from the Gaussian distribution Norm (μ, σ) ² ) Sampling to obtain hidden variable Z, and finally passing through a layer of fully-connected layerThe constituent decoders generate final reference data;

the formula for the neural network is as follows:

E＝ReLU(X _i W _E )

μ＝ReLU(X _i W _μ )

σ＝ReLU(X _i W _σ )

Z＝Sample[Norm(μ，σ ² )]

D＝ReLU(ZW _D )

5. The method of inferring the expression pattern of a subpopulation of cells within a spatial transcriptome of claim 1, further comprising the step of: setting an activation function, a loss function and a reparameterization method.

6. The method of inferring expression patterns of subsets of cells within a spatial transcriptome of claim 5, wherein the loss function is expressed as:

wherein α is used to represent | | | X _i -X′ _i || ² And

the ratio of (a) to (b).

7. The method of inferring expression patterns of cell subsets within a spatial transcriptome of claim 5, wherein the expression for reparameterizing the latent variable z is:

Z＝Sample[Norm(μ，σ ² )]＝μ+εσ

wherein ε to Norm (0, 1).

8. A system for inferring the expression pattern of a subpopulation of cells within a spatial transcriptome, comprising:

9. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1 to 7 are performed when the program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method of any one of the preceding claims 1 to 7.