CN116825186A

CN116825186A - Single cell data batch effect correction method based on generation of countermeasure network

Info

Publication number: CN116825186A
Application number: CN202310723261.5A
Authority: CN
Inventors: 彭佳杰; 杨光
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-09-29

Abstract

The embodiment of the application provides a single-cell data batch effect correction method based on a generation countermeasure network, which comprises the following steps: sequencing single-cell data to obtain a Gao Weishan cell gene expression count matrix; performing calculation and learning on a Gao Weishan cell gene expression counting matrix based on an identification model in a variation inference mode so as to approximate the posterior distribution, and using a distribution mean and a distribution variance of the random Gaussian noise and Gao Weishan cell data to approximate the potential low-dimensional representation distribution so as to obtain an approximate low-dimensional representation vector; reconstructing the low-dimensional representation vector based on a generation model, thereby obtaining single-cell reconstruction original data; wherein when the low-dimensional representation vector is reconstructed by the generative model, the low-dimensional representation vector is labeled with a batch for removing batch effects.

Description

Single cell data batch effect correction method based on generation of countermeasure network

Technical Field

The application relates to the technical field of novel organisms, in particular to a single-cell data batch effect correction method based on generation of an antagonism network.

Background

Single cell RNA sequencing technology (scRNA-seq) is a technology for performing high throughput sequencing on a transcriptome at a single cell level, and single cell data obtained by sequencing can reveal the gene structure and the gene expression state of single cells, so that the method has great significance in discovering and analyzing heterogeneity of cells, mining significant biological changes, researching correlation among cells and the like. Research and analysis of single cell data helps to find disease markers, find disease sources and improve diagnostic efficiency. Single cell data has been widely used in various fields such as oncology, microbiology, developmental biology, etc., and has been greatly successful in clinical research.

There is a certain degree of discrepancy between the collected data due to factors such as experimental environment, sequencing samples, sequencing technology, etc. of different batches, this discrepancy is called batch effect. The batch effect is a kind of noise which easily confuses the real biological differences and easily divides the data of the same cell type into different clusters by mistake during cluster analysis, thereby generating misleading conclusions and seriously affecting the accuracy of the results of downstream analysis tasks.

At present, the conventional batch effect correction and data integration method with single cell data needs to consume huge running memory and longer running time when facing to a large data set, and the correction performance of the method is gradually reduced along with the gradual increase of the data quantity and the batch quantity. In recent years, deep learning has been widely used in various fields such as image recognition and natural language processing, due to its adaptability to mass data and its ability to automatically extract features.

At present, batch effect correction of single cell data is realized based on deep learning, but the correction performance of the methods is not excellent enough at present, and a large improvement space is still provided.

In view of the above problems, no effective solution has been proposed at present.

It should be noted that the foregoing description of the background art is only for the purpose of providing a clear and complete description of the technical solution of the present invention and is presented for the convenience of understanding by those skilled in the art. The above-described solutions are not considered to be known to the person skilled in the art simply because they are set forth in the background of the invention section.

Disclosure of Invention

The present specification aims to provide a single cell data batch effect correction method based on generation of an antagonism network to solve the above problems.

The single cell data batch effect correction method based on the generation of the antagonism network provided by the specification comprises the following steps:

sequencing the single cell data to obtain a Gao Weishan cell gene expression count matrix;

performing calculation and learning on the Gao Weishan cell gene expression count matrix based on an identification model in a variation inference mode so as to approximate the posterior distribution, and using a distribution mean and a distribution variance of the random generated Gaussian noise and Gao Weishan cell data to approximate the potential low-dimensional representation distribution so as to obtain an approximate low-dimensional representation vector;

Reconstructing the low-dimensional representation vector based on a generation model, thereby obtaining single-cell reconstruction original data;

when the generation model reconstructs the low-dimensional representation vector, the low-dimensional representation vector is labeled with a batch for removing batch effects;

and incorporating a generator in a generating countermeasure network into the identification model so as to generate a newly generated model, and utilizing the newly generated model to judge and score the Gao Weishan cell gene expression count matrix and the single cell reconstruction original data so as to improve the capability of the generated model to reconstruct data.

Preferably, the reconstructing the low-dimensional representation vector includes:

single cell data is fitted by negative binomial distribution and a zero-expansion model is used to fit dropout events, thereby reducing the probability of dropout events occurring, reducing the bias between the single cell reconstructed raw data and the single cell data.

Preferably, the fitting of the single cell data based on the zero expansion model and the negative binomial distribution model includes:

dividing the negative binomial distribution model into a dropout event and a dropout event, and respectively fitting three parameters of the zero expansion and negative binomial distribution model by using the generation model: probability of dropout event occurrence, dispersion of negative binomial distribution, and expression mean.

Preferably, the generating the discrimination by the discriminator in the countermeasure network includes:

the generator in the generating countermeasure network updates network parameters according to the scores of the discriminators, so that the single cell reconstruction original data similar to the real single cell data is generated.

Preferably, in obtaining the approximated low-dimensional representation vector:

selecting a target batch for the single-cell data of multiple batches based on a star-shaped generation countermeasure network, and uniformly mapping the single-cell data of the rest batches into a data distribution space where the target batch is located through a generator, so that batch effects among the single-cell data of different batches are corrected.

Preferably, during correction of the batch effect between different batches of the single cell data by the star-shaped generation countermeasure network:

pre-clustering is performed in the target lot, and data of other lots are divided into cell clusters of the target lot closest to the target lot, thereby obtaining cell type pre-labels of a plurality of lot data.

Preferably, after the pre-clustering, a plurality of different cell clusters are obtained, and then the central position of the cell cluster is calculated according to the average value of samples in the cell clusters.

Preferably, in processing the data of the remaining batches, the distance between the sample of each batch and the center of the cell cluster of the target batch is calculated separately, and the data is divided into the cell clusters of the target batch closest to the target batch.

Preferably, when the pre-label is obtained by the pre-clustering, the pre-label is taken as a reference label, thereby restricting the generator and preventing the transformation process from changing the cell type of the data.

Preferably, the cell type recognition model is constructed separately when the pre-label is obtained by the pre-clustering, thereby predicting the cell type label of the data.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention adopts a variation inference mode by a variation self-encoder in the recognition model, and utilizes the random Gaussian noise and the distribution mean value and the distribution variance obtained by learning to approximate the potential low-dimensional representation distribution, thereby obtaining the approximate low-dimensional representation vector.

2. The invention adopts a zero-expansion negative binomial distribution model to sequence single cell data so as to reduce the probability of dropout events.

3. The invention supplements batch information by adding batch labels, so that the generation model can remove potential low-dimensional representation vectors of batch effects, and the identification model can learn the potential low-dimensional representation of the batch effects from the original data containing the batch effects.

4. According to the invention, the input data is identified by the identifier and the corresponding discrimination score is calculated, and the generator does not directly access the input original data, but updates the network parameters according to the discrimination score of the identifier, so that the imitation data expected to be generated by the generator can obtain a higher discrimination score, the capability of generating the reconstruction data of the model is high, and the generation model can generate the reconstruction data more similar to the original data.

5. According to the invention, one batch is selected as a target batch, the data are uniformly mapped into the data distribution space of the target batch, and the purpose of correcting the batch effect among the data of different batches is achieved by unifying the batch style of the data.

6. According to the method, the center position of the cell cluster is calculated according to the average value of the samples in the cell cluster, the distance between the samples in each batch and the cluster center of the target batch is calculated independently, and the data are divided into the clusters of the target batch closest to the target batch, so that accurate pre-labels of the cell types are obtained directly.

7. The invention predicts the cell type label of the data by separately constructing a cell type identification model when the pre-label is obtained by pre-clustering.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a general flow chart of a single cell data batch effect correction method based on generating an antagonism network provided by the embodiments of the present disclosure;

FIG. 2 is a model frame diagram of dimension reduction and correction based on high-dimension sparse single-cell data in a single-cell data batch effect correction method for generating an countermeasure network according to an embodiment of the present disclosure;

FIG. 3 is a model framework diagram of a method for correcting batch effects in a low-dimensional representation based on generating single-cell data batch effect correction for an countermeasure network, provided in an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a pre-clustering algorithm of a single cell data batch effect correction method based on generation of an countermeasure network according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram showing the distribution of cells before and after cell correction in a single cell data batch effect correction method based on generation of an countermeasure network according to an embodiment of the present disclosure;

fig. 6 is a block diagram of generating an countermeasure network GAN based on a single cell data batch effect correction method for generating an countermeasure network according to an embodiment of the present disclosure.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

In the description of the present application, it should be noted that the directions or positional relationships indicated by the terms "upper", "middle", "lower", "inner", "outer", "front", "rear", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the apparatus or component to be referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present application. The terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art. Hereinafter, an embodiment of the present application will be described in terms of its overall structure.

Referring to fig. 1 to 6, an embodiment of the present application provides a single cell data batch effect correction method based on generation of an countermeasure network, including:

sequencing single-cell data to obtain a Gao Weishan cell gene expression count matrix;

performing calculation and learning on a Gao Weishan cell gene expression counting matrix based on an identification model in a variation inference mode so as to approximate the posterior distribution, and using a distribution mean and a distribution variance of the random Gaussian noise and Gao Weishan cell data to approximate the potential low-dimensional representation distribution so as to obtain an approximate low-dimensional representation vector;

when the low-dimensional representation vector is reconstructed by the generation model, the low-dimensional representation vector is labeled with a batch for removing batch effect;

and the generator in the generation countermeasure network is integrated with the identification model, so that a newly generated model is generated, and the newly generated model is utilized to judge and score the Gao Weishan cell gene expression count matrix and the single cell reconstruction original data, so that the capability of generating the model reconstruction data is improved.

The variational self-encoder model is a generation model based on variational Bayesian inference. Specifically, recognition model obtained by learning through variational Bayesian inferenceDeapproximation estimates a difficult-to-calculate posterior distribution p _θ (z|x). Because the randomness does not exist in the potential low-dimensional representation learned by the traditional self-encoder, the traditional self-encoder is only a low-dimensional representation model and cannot be applied to the generation task; whereas the data is randomly sampled in the approximate distribution space of the low-dimensional representation using random gaussian noise at the variational self-encoder, the variational self-encoder can be used to generate diversified data.

The existing generation countermeasure network and the variable self-encoder have advantages in the generation mode, the generated image of the generation countermeasure network is generally clearer than the generated image of the variable self-encoder, but the generation countermeasure network cannot complete the dimension reduction task, and the variable self-encoder can acquire the potential low-dimension representation of the data.

The star-shaped generation countermeasure network only comprises a generator and a discriminator, the generator inputs source domain data x and a label d of a target domain, wherein the label of the target domain is encoded in a one-hot encoding mode, the source domain data is converted into imitated target domain data by utilizing the label of the target domain, and the discriminator is expected to be incapable of distinguishing real target domain data and imitated target domain data; the discriminator judges whether the input data is real data or imitation data, and calculates a discrimination score of the data.

In the application, the generation model is combined with the generation countermeasure network, and the capacity of the reconstruction data of the generation model is improved by utilizing the discriminator, so that the generation model can generate the reconstruction data more similar to the original data.

The application provides a model for dimension reduction and correction of single-cell data with high dimension sparsity, wherein a variation self-encoder in the model consists of a recognition model q (z|x) and a generation model p (x|z, b), and the real posterior distribution is fitted according to variation inference during the recognition of the model. The recognition model is assigned to the strain from the encoders in the encoder model. Where x is the original high-dimensional data and z is a potentially low-dimensional representation of the original high-dimensional data x.

As can be seen from FIG. 2, a high-dimensional single-cell gene expression count matrix x is input to the recognition model. Distribution mean μ ' and distribution variance σ ' obtained by using randomly generated Gaussian noise and learning ' ² The potential low-dimensional distribution p (z) is estimated approximately, and the approximate low-dimensional expression vector z=mu '+sigma' ∈, where e is random Gaussian noise, and obeys the standard normal distribution is further obtained.

After the process of analysis of variance inference, a conclusion is further obtained that minimizing the difference KL (q (z|x) |p (z|x)) between the recognition model q (z|x) and the true posterior distribution p (x|z) is equivalent to maximizing the lower variance bound, where the formula of the lower variance bound is defined asWherein L (x) is a variation lower bound for variation inference; KL (q (z|x) |p (z|x)) is the difference of the approximate distribution from the potential low-dimensional representation; p (x|z) is the reconstructed data distribution learned by the generation model.

The optimization objective of the above recognition model is to reduce the difference of the approximate distribution q (z|x) from the potentially low-dimensional representation prior distribution p (z), i.e., to minimize the inverse KL (q (z|x) |p (z|x)) of the previous term of the variant lower bound, notably that KL divergence is followed by a prior distribution p (z), so the expression of the loss function of the recognition model is:

wherein μ is the mean of the approximate distribution learned by the recognition model; sigma (sigma) ² The variance of the approximate distribution learned for the recognition model.

In a particular study, single cell data may be partially deleted due to dropout events occurring in the sequencing flow. There are non-true zero values in these missing values, and thus zero-dilation problems in the data distribution.

It is noted that in the process of sequencing single cell data, dropout events occur due to factors such as sampling technology, the number of transcribed samples, and the like, and further, a large amount of zero expression is generated in the Gao Weishan cell gene expression count matrix. According to the findings of those skilled in the art in practical operation, there are two cases of zero expression: one is the true zero expression level, i.e., the target gene in the cell sample is not expressed and therefore not detected; the other is the wrong zero expression level, that is, the partial gene expression due to the dropout event is deleted, so that the gene expression level is not detected. Because of the missing values, inaccurate dimension reduction results are produced, i.e., inaccurate low-dimensional representation vectors are obtained.

The value of each site in the single-cell gene expression count matrix is the number of expressions detected in the cell of the target gene, so that single-cell data are discrete data. Wherein the poisson distribution is defined as:where k is the count value in the gene expression matrix and λ is the parameter of poisson distribution.

In the above definition, the expected and variance of poisson distribution are equal to λ, and λ is fixed, but in the gene expression matrix, the sample mean and sample variance are generally unequal, and as the sample expression amount increases, the gap between the sample mean and sample variance increases, so that the distribution situation of single-cell data does not completely conform to the property of poisson distribution.

In order to overcome the above problems, the poisson distribution model is improved in the present application. That is, in fitting the single cell data using a spread poisson distribution, it is preferable thatSingle cell data processing was performed using negative binomial distribution. The negative binomial distribution is defined as:wherein r is a dispersion parameter of the negative binomial distribution; p is the probability of expressing the value; x is a true expression value; Γ is Gamma distribution.

Preferably, the negative binomial distribution model is divided into a dropout event and a dropout event, and the generated model is used to fit three parameters of the zero-expansion and negative binomial distribution models respectively: probability pi of dropout event occurrence, dispersion θ of negative binomial distribution, and expression mean μ. Then we get the definition of the zero-expansion negative binomial distribution model:

f _ZINB (x；π，μ，θ)＝πδ ₀ (x)+(1-π)f _NB (x；μ，θ)，

Wherein: f (f) _ZINB (x; pi, mu, theta) is a zero-expansion negative binomial distribution model; delta ₀ (x) Is a dirac function; pi is the probability of a dropout event occurring; θ is the dispersion of the negative binomial distribution; mu is the expression mean; f (f) _NB (x; mu, theta) is a negative binomial distribution model; Γ is Gamma distribution.

In the above embodiment, there are two cases of zero expression: one is the true zero expression level, and the other is the false zero expression level. In this embodiment, the zero-expansion negative binomial distribution model is mainly divided into two parts, the first part fitting the zero expression in the data, substituting x=0 into f _ZINB (x；π，μ，θ)＝πδ ₀ (x)+(1-π)f _NB And (x; mu, theta) to obtain a model formula of zero expression quantity:correspondingly, the former term fits the false zero expression level, and the latter term fits the true zero expression level. The second part thus fits non-zero expressions in the data, substituting x.noteq.0 for f _ZINB (x；π，μ，θ)＝πδ ₀ (x)+(1-π)f _NB (x; mu, theta) to obtain a model formula of non-zero expression quantity:

there is a lot effect in the raw data, and the resulting potential low dimension indicates that there is no lot effect. Then it is not possible for the recognition model to fit the potential low-dimensional representation z containing the batch effect when the potential low-dimensional representation is reconstructed. To address the batch effect in the raw data, a batch tag b is added when reconstructing the data using the generative model. Because of the supplementing batch information, generating the model p (x|z, b) enables reconstruction of the potentially low-dimensional representation z, from which the batch effect was removed, into raw data containing the batch effect.

In generating an anti-network, the goal of the generator is to generate simulated data that is consistent with the original data distribution, while the goal of the generation model is to reconstruct the low-dimensional representation into reconstructed data that approximates the original data distribution, so the goal of the generator is consistent with the goal of the generation model of the variational self-encoder. The present application combines a variational self-encoder model and a generation countermeasure network model. The generation model is further combined with the generation countermeasure network, and the capacity of the reconstruction data of the generation model is improved by utilizing the discriminator, so that the generation model can generate the reconstruction data which is more similar to the original data.

The loss of antagonism definition formula of generating the antagonism network model part is:

wherein L is _adv To combat losses; d (D) _src A discrimination score calculated for the discriminator; x is the original data; z is a potential low dimensional representation; b is a batch label; p (x|z, b) -generating model reconstruction data.

For the discriminator, in the generation of the reactance network, the discriminator recognizes the input data and calculates a score, and the higher the discrimination score is, the closer the input data is to the real data distribution; the lower the discrimination score, the closer the input data is to the distribution of the dummy data. That is, the arbiter expects to be able to distinguish between real data and counterfeit data, with the real data getting a higher discrimination score and the counterfeit data getting a lower discrimination score.

For the generator, the generator expects to generate dummy data similar to the real data and spoof the arbiter so that the arbiter cannot distinguish between the real data and the dummy data. The generator does not directly access the input raw data, but updates the network parameters according to the discrimination scores of the discriminators, so that the imitated data expected to be generated by the generator can obtain higher discrimination scores.

The present invention will be further described in detail in this example based on the analysis of a human peripheral blood mononuclear cell dataset, a human pancreatic cell dataset, a human peripheral blood mononuclear cell dataset obtained by a variety of different sequencing techniques, and a mouse brain dataset.

Parameter settings are first required. The recognition model consists of three layers of fully connected networks, wherein the neuron number of the first two layers of fully connected networks is 256 and 128 respectively, the third layer of network consists of two fully connected networks, the neuron number is consistent with the dimension of the potential low-dimensional representation, and the distribution mean mu ' and the distribution variance sigma ' of the potential low-dimensional representation are learned respectively ' ² The output of each fully connected network employs Rectified Linear Unit activation functions. Rectified Linear Unit activation function is:

f(x)＝max(x，0)。

to prevent the covariates of the features from shifting, and to reduce the risk of gradient extinction, batch standardization batch normalization was used between every two layers of fully connected networks. In order to reduce the risk of model overfitting, a dropout layer is adopted between every two layers of fully connected networks, and the dropout rate is set to be 0.1.

The discriminator consists of three layers of fully-connected networks, wherein the neuron number of the first two layers of fully-connected networks is 256 and 128 respectively, the third layer of network consists of two fully-connected networks, and the neuron number is 1 and the batch number respectively, so that discrimination scores and predicted batch labels are output respectively. The activation function in the discriminator adopts a ReLU, and a dropout layer is added between every two layers, wherein the dropout rate is set to be 0.2.

In the present application, the inventors set the learning rates of both the discriminator and the variation self-encoder to 0.001; during the small batch learning, the batch size is set to 128.

In the flow of pre-processing of single cell data, a single cell dataset is cleaned in a first step. The filtration of useless cell samples and redundant gene features is often performed by filtering cell samples with a small number of expressed genes and by filtering cell samples with few expressed gene features. In this example, cells with a total number of genes less than 200 and genes expressed in no more than 3 cells were filtered. In the second step, normalization is performed at the cellular level. A common normalization approach is to divide each gene expression count in each cell by the sequencing depth, then multiply 10000, and then convert the count to a natural logarithmic scale using a logarithmic function. Because the count value is 0 at the lowest, 1 is added to all count values, the normalized value is not lower than 0, and the conversion formula is that Wherein->Is x after conversion. Thirdly, screening the hypervariable genes. The more pronounced the change in characteristics, the more advantageous it is to distinguish between different cells, the more pronounced the gene signature that is required to be screened for in analyzing single cell data, these genes are referred to as hypervariable genes HVGs. The common screening method is to screen genes with larger variance and mean as hypervariable genes, and take the hypervariable genes as gene characteristics for identifying cells, and discard the other gene characteristics. The first 2000 genes were selected as hypervariable genes in this example.

Table one: human peripheral blood mononuclear cell data set

The human peripheral blood mononuclear cell data set is obtained by poo ń ski et al to merge and label cell types, and the data set contains 2 batches of data obtained by a 3 'method and a 5' method, wherein the data obtained by the 3 'method contains 8098 cells, the data obtained by the 5' method contains 7378 cells, and the merged data set contains 15476 cells and 33694 genes.

As can be seen from fig. 3, there is a significant batch difference between the data obtained from the 3 'method and the 5' method. There is a larger spacing between data for the same cell type. By integration, the marker genes of different cell types are substantially all clustered in one cell cluster, making the partitioning and annotation of cell types more convenient.

The present inventors calculated the integration coefficient LISI of the raw data and the data corrected using the reduced dimension and corrected model of the high-dimensional sparse single-cell data. LISI is divided into batch coefficients iLISI and cell type coefficients cLISI according to the target tags of the reference. The higher the iLISI, the higher the data mixing degree of different batches, i.e. the higher the batch effect correction degree; the lower the cLISI, the lower the degree of data mixing between different cell types. In an ideal calibration result, it is often desirable to mix the data of the different batches together uniformly, i.e. iLISI should be as high as possible, while the different cell types are completely separated and remain significantly different, i.e. iLISI should be as low as possible. In this embodiment, the average iLISI calculated from the raw data is 1.0081 and the average iLISI is 1.0426; the average iLISI calculated using the data corrected to generate the countermeasure network model was 1.7733 and the average iLISI was 1.0810. Comparison of the foregoing data shows that only minor changes in average calisi occur, while significant increases in average iLISI occur. That is, the degree of data mixing for different batches is significantly increased while the degree of data mixing between different cell types remains stable.

Further, the data are divided into a plurality of different clusters by a clustering mode, cluster labels are allocated, and ARI and NMI of clustering results are calculated by comparing the difference between the allocated cluster labels and the actual cell type labels. Obtaining ARI of 0.3915 and NMI of 0.6774 obtained by clustering the original data; the ARI obtained using the data clusters corrected to generate the countermeasure network model was 0.6605 and NMI was 0.7710. It is noted that ARI is a method for adjusting the Lande coefficient, and mainly measures the difference between the cluster labels and the real labels of the clustering result. NMI is standardized mutual information, and is mainly used for measuring similarity of cluster labels and real labels of clustering results.

According to the data, the mixing degree of the data is greatly improved after correction of the generated countermeasure network model is performed, and the batch effect is reduced. And the difference among different cell types can be reserved after correction, so that the accuracy of the clustering result is remarkably improved.

Correction analysis was performed on single cell data of inconsistent cell type composition. The data set is a human islet cell data set, which may include data collected by 5 different authors through different batches of sequencing experiments. The pooled dataset contained 16293 cells and 34290 genes in total, with a total of 28986 cells and 24946 genes in total in the pooled dataset of 28 different cell types. Together 10 different types of cells are included.

The inventors of the present application performed a test to generate an antagonism network on the data set of table two and combined the UMAP algorithm to visualize the original data and corrected distribution.

According to fig. 5, there is a certain batch difference between the data obtained from 5 different sequencing techniques before correcting the data. That is, the data for the different batches are not completely mixed. After correction, the data obtained for the 5 different sequencing techniques were relatively evenly mixed together.

And (II) table: detailed information of human islet cell dataset

In practice by those skilled in the art, prior to correcting the data, the ARI obtained using the raw data clustering was 0.3264 and NMI was 0.6268; the ARI obtained by clustering after correcting the data by using the dimension reduction and correction model of the high-dimension sparse single cell data is 0.5665, and the NMI is 0.7012. The application discloses a model for dimension reduction and correction of single-cell data with high dimension sparseness, namely a model for the actual processing of a single-cell data batch effect correction method based on a generated countermeasure network. The model for the aforementioned dimension reduction and correction of single-cell data thus sparse in high dimension may be named scVGAN.

According to the data analysis, after the dimension reduction of the single-cell data with high dimension sparseness and the processing of the corrected model, the batch effect in the data can be greatly reduced, and the difference between different cell types can be reserved. Further, after correction, the accuracy of the clustering analysis result of the data can be improved.

After calculation, the inventor calculates the average batch coefficient of 1.7227 and the average cell type coefficient of 1.4681 from the original data before correcting the data; the average lot size obtained after correcting the data by using the dimension reduction and correction model of the high-dimension sparse single cell data was 3.7878, and the average cell type coefficient was 1.4258. That is, by correction, the batch effect in the data is significantly reduced and the type differences between cells are preserved.

According to Table II, in this example, single cell data collected from a number of different sequencing techniques was corrected. The data employs human peripheral blood mononuclear cell datasets obtained by a variety of different sequencing techniques. Obtained using 7 different sequencing techniques, 10xChromium, CEL-Seq2, drop-Seq, seq-Well, smart-Seq2 and inDrops, respectively, whereas the 10xChromium sequencing technique is subdivided into 10xChromium (v 2), 10xChromium (v 2) A, 10xChromium (v 2) B,

A total of 28986 cells and 24946 genes were found in the 10xChromium (v 3) pooled dataset, of which there were 10. The majority of cell types in the data for the different batches are substantially identical for the different cell types.

Table three: human peripheral blood mononuclear cell data set

According to the data in the third table, the average batch coefficient obtained by calculation of the original data is 1.7227 and the average cell type coefficient is 1.4681 before the data are corrected; the average batch coefficient obtained by calculation after correcting the data by using the dimension reduction and correction model of the high-dimension sparse single cell data is 3.7878, and the average cell type coefficient is 1.4258. The batch effect can be greatly reduced, and the accuracy of the clustering analysis result of the corrected data is remarkably improved.

Table four: data of cerebral cortex and hippocampus in adult mice

According to table four above, in the present example, for the correction of the large single cell dataset, data of the cerebral cortex layer and hippocampus of the adult mouse was mainly used. The dataset was divided into 15 batches, each batch representing data of different regions, namely the front cingulate gyrate region (ACA), granule-free island region (AI), auditory region (AUD), entorhinal region (ENT), hippocampal region (HIP), primary motor region (MOp), secondary motor region (mos_frp), hypotonic region (PAR-POST-PRE-SUB-ProS), forehead region (PL-ILA-ORB), posterior parietal region (PTLp), splenic posterior Region (RSP), primary somatosensory region (SSp), auxiliary somatosensory (SSs-GU-VISC-AIp), lateral region (TEa-PERI-ECT) and visual region (VIS), respectively. The total 1169213 cells and 27824 genes were in the pooled dataset.

Experimental results show that the average batch coefficient calculated for the raw data before correcting the batch effect of the data is 3.2738 and the average cell type coefficient is 1.0772; the average batch coefficient obtained by calculation after correcting the data by using the dimension reduction and correction model of the high-dimension sparse single cell data is 3.0312, and the average cell type coefficient is 1.0597. According to the experimental result, the batch effect can be effectively corrected.

Preferably, in obtaining the approximated low-dimensional representation vector: based on the star-shaped generation countermeasure network, selecting a target batch for single-cell data of multiple batches, and uniformly mapping single-cell data of other batches into a data distribution space where the target batch is located through a generator, so that batch effects among single-cell data of different batches are corrected.

As shown in fig. 3, the low-dimensional representation is corrected for batch effect by using a star-shaped generation countermeasure network, so as to obtain single-cell data with uniform batch style. Notably, in the course of correction, batches are taken as fields, any one batch being selected as the target field.

Specifically, the generator inputs the original domain data x _real And a lot tag b of a target domain requiring conversion _trg . Converting it into a imitated target domain data x _fake ＝G(x _real ，b _trg ). The discriminators respectively input real target domain data x _real And imitated target field data x _fake . And judging the input data. The lot tag of the data is input while the discrimination is performed. It is known that the loss countermeasure is defined as:l in the formula _adv To combat losses; dsrc is a discrimination score calculated by the discriminator; x is source domain data; b is a batch label of the target domain; g (x, b) is simulated target domain data converted by the generator.

In the present embodiment, the generator is not only capable of calculating the discrimination score D of the input data _src Domain label D capable of predicting input data _cls . The objective of the discriminant is to maximize the challenge loss, i.e., minimize the inverse discriminant loss function of the challenge loss, as:

for the input of batch labels, the difference between the predicted batch label and the real batch label is measured by adopting cross entropy, and the calculation formula is as follows:in the formula->Predicting loss for batch tags of real data; b _org A source domain batch label; x is x _real Is the real source domain data; d (D) _cls (x _real ) Batch tags for source domain data predicted by the arbiter. The inventors of the present application introduced a gradient penalty L in the arbiter _gp 。

In summary, the loss function of the arbiter is defined as:in the formula->Predictive loss for real data->Coefficients of (2); lambda (lambda) _pg Penalty coefficients for gradients; l (L) _gp Punishment for gradients.

In this embodiment, the object of the generator is the input object domain label b _trg To source domain data x _real Conversion to imitated target field data x _fake And expects imitation data x _fake Is similar to the real data enough that the arbiter can not distinguish the real data x _real And imitate data x _fake Thereby obtaining a higher discrimination score.

Specifically, the loss function formula of the generator is defined as:wherein L is _G A generator loss function; b is a batch label; g (x, b) is simulated data converted by the generator. In the conversion process, the generator may predict the batch label of the dummy data as the batch label of the target domain, and the available batch label prediction loss function of the dummy data is as follows: />In->Predicting losses for lot tags of the dummy data; b _trg A lot label for the target domain; g (x, b) _trg ) Imitation target domain data converted by the generator; d (D) _cls (G(x，b _trg ) Lot labels that are imitated target domain data predicted by the arbiter.

Notably, to ensure that critical target features can be preserved in the transformed data, the generator needs to follow a circular consistency. I.e. the generator is able to reconstruct the imitated target domain data into source domain data by means of the source domain label. The cyclic coherence loss is calculated as the absolute difference between the source domain data and the reconstructed data. The cyclic consistency loss function is: In which L _rec Is a cyclical consistent loss function; g (G (x, b) _trg )，b _org ) The generator reconstructs the imitated target domain data into source domain data according to the source domain label.

In summary, the loss function of the generator is as follows:lambda in _rec Reconstructing coefficients of the loss; />Coefficients of predicted loss of domain labels for the imitated data for the arbiter.

Preferably, during the correction of the batch effect between different batches of single cell data by the star-shaped generation countermeasure network: pre-clustering is performed in the target lot, and data of other lots are divided into cell clusters of the target lot closest to the target lot, thereby obtaining cell type pre-labels of the plurality of lot data.

In particular, the distance between data for the same cell type in different batches is smaller than the distance between cell data for different cell types in the same batch. I.e. the difference between cell types is larger than the difference between batches. More preferably, after pre-clustering, a plurality of different cell clusters are obtained, and then the center position of the cell cluster is calculated according to the average value of samples in the plurality of cell clusters.

In the process of this example, as can be seen from fig. 4, one lot is selected as the target lot, a plurality of different cell clusters are obtained by pre-clustering, and the center positions of the cell clusters are calculated. For the data of the remaining batches, it is preferable to separately calculate the distance of the sample of each batch from the cluster center of the target batch, and divide the data into the cell clusters of the target batch closest to the target batch by calculating the distance of the sample of each batch from the cluster center of the target batch.

Cell type pretag can be well obtained by the above-described processing of the cell clusters of the target lot and the distance between the cell clusters of the remaining lot and the cell clusters of the target lot.

In order to reduce the impact of these non-uniform cell type samples on the results, the present application also assigns training weights to each sample based on the distance of the sample from the center of the cell cluster. That is, samples consistent with a target lot of cell types must be closer to the cluster in the nearest target lot than samples inconsistent with the cell types. Samples closer to the cluster center in the target lot will therefore be given more training weight, and samples farther away will be given less training weight.

Specifically, the sample distance of each batch is normalized, and then the weight value is set between 0 and 1 in a normalization mode, and the calculation formula of the distance weight is as follows:wherein: d, d _x∈B The inverse number of the distance between the sample in the batch B and the clustering center of the target batch; mu (mu) _B For d in batch B _x∈B Is the average value of (2); sigma (sigma) _B For d in batch B _x∈B Standard deviation of (2); epsilon is a very small constant; />For distance->Is the minimum of (2); />For distance->Is a maximum value of (a).

Preferably, when the pre-label is obtained by pre-clustering, the pre-label is taken as a reference label, thereby constraining the generator from the transformation process changing the cell type of the data. Cell type pre-labels obtained by a pre-clustering algorithm are used as reference labels, so that a constraint generator expects that cell type labels predicted by data before and after transformation are consistent with the cell type pre-labels, and the transformation process can be prevented from changing the cell type of the data.

It will be appreciated that when transforming single cell data for different batches, the cell type in the data is unknown. In the case of unknown cell types, the data may be randomly mapped to other cell types in the data distribution space of the target lot, thereby producing error correction.

Further, according to other embodiments of the present application, the objective of the arbiter is to determine whether the data is a lot tag of real data and predicted data. Therefore, the cell type identification model is inconsistent with the identification target of the arbiter, and thus cannot share network parameters with the arbiter.

Preferably, the cell type recognition model is constructed separately when the pre-label is obtained by pre-clustering, thereby predicting the cell type label of the data. The prediction result is converted into a probability vector of a corresponding cluster by using a softmax algorithm, and the probability vector is defined as:cell type recognition model employing cross entropy as the modelA loss function defining the formula:in the two definition formulas, c is a cell type pretag obtained by a pretlustering algorithm; c (x) is the predicted outcome of the cell type recognition model.

Notably, the cell type recognition model is similar to a discriminant. Which would access the real data directly and pass the predicted result to the generator. The generator adjusts network parameters according to the prediction result, so that the cell type recognition model adopts the same training mode as the discriminator and trains simultaneously with the discriminator.

In summary, in the arbiter, the cell type recognition model needs to predict the cell type of the real data, and expects that the predicted tag of the real data is consistent with the pre-tag; in the generator, the generator expects the modeled data transformed to preserve the original cell type characteristics. And the generator expects the predictive label of the cell type recognition model against the counterfeit data to also be consistent with the pre-label results. The final loss function of the arbiter and generator is further obtained as:

in (1) the->A final loss function for the discriminator; />A generator final loss function; l (L) _D Countering the loss for the arbiter; l (L) _G Generator fight loss; />Predicting loss for batch tags of real data; />Batch tag prediction for real dataLoss->Coefficients of (2); />Predicting loss for cell type tags of real data; />Predicting loss for cell type tag of real data +.>Coefficients of (2); l (L) _gp Punishment for gradients; lambda (lambda) _pg Penalty coefficients for gradients; />Predicting losses for lot tags of the dummy data; />Predicting loss for lot tags of counterfeit dataCoefficients of (2); />Predicting loss for cell type tags that imitate data; />Predicting loss for cell type tag of imitation data +. >Coefficients of (2); l (L) _rec Is a cyclical consistent loss; lambda (lambda) _rec Loss of L for cyclic uniformity _rec Is a coefficient of (a). In this embodiment, pair +.>And->Are all set to 1, & gt>And->Are all set to 0.5 lambda _pg Set to 10 lambda _rec Set to 10.

Although various embodiments are described in this disclosure, the present application is not limited to the specific embodiments described in the industry standard or examples, and some industry standard or embodiments modified in light of the above description may be used to achieve the same, equivalent or similar embodiments or the same or a different embodiment may be implemented in different forms. Examples of data acquisition, processing, output, judgment, etc. using these modifications or variations are still within the scope of alternative embodiments of the present application.

Although the present application has been described by way of examples, one of ordinary skill in the art will recognize that there are many variations and modifications of the present application without departing from the spirit of the application, and it is intended that the appended embodiments encompass such variations and modifications without departing from the application.

Claims

1. A single cell data batch effect correction method based on generation of an antagonism network, the method comprising:

2. The method of claim 1, wherein reconstructing the low-dimensional representation vector comprises:

3. The method for generating a antagonism network based single cell data batch effect correction of claim 2, wherein the fitting the single cell data based on the zero-expansion model and the negative binomial distribution model comprises:

4. The method for correcting single cell data batch effect based on generating an countermeasure network according to claim 1, wherein the step of generating the discriminators in the countermeasure network includes:

5. The method of generating a single cell data batch effect correction based on an antagonism network of claim 1, wherein in obtaining the approximated low-dimensional representation vector:

6. The method of claim 5, wherein during calibration of the star-shaped generation of the lot effects between different lots of the single-cell data based on generation of the countermeasure network:

7. The method of claim 6, wherein the pre-clustering is performed to obtain a plurality of different cell clusters, and the center position of the cell clusters is calculated according to the average value of samples in the plurality of cell clusters.

8. The method of claim 7, wherein in processing the data of the remaining batches, the distance between the sample of each batch and the center of the cell cluster of the target batch is calculated separately, and the data is divided into the cell clusters of the target batch closest to the target batch.

9. The method of claim 6, wherein the pre-labeling is used as a reference label when the pre-labeling is obtained by the pre-clustering, thereby constraining a generator to prevent transformation processes from changing cell types of data.

10. The method of claim 9, wherein a cell type identification model is constructed separately when the pre-label is obtained by the pre-clustering, thereby predicting a cell type label of data.