CN114038505A - Method and system for integrating multi-source single cell data on line - Google Patents

Method and system for integrating multi-source single cell data on line Download PDF

Info

Publication number
CN114038505A
CN114038505A CN202111213929.9A CN202111213929A CN114038505A CN 114038505 A CN114038505 A CN 114038505A CN 202111213929 A CN202111213929 A CN 202111213929A CN 114038505 A CN114038505 A CN 114038505A
Authority
CN
China
Prior art keywords
data
cell
single cell
batch
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111213929.9A
Other languages
Chinese (zh)
Inventor
张强锋
熊磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202111213929.9A priority Critical patent/CN114038505A/en
Publication of CN114038505A publication Critical patent/CN114038505A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The invention discloses a method for integrating multi-source single cell data on line, which comprises the following steps: inputting single cell data with batch effect from a plurality of different sources; projecting, by a batch effect-independent encoder, the single cell data to a batch effect-independent, generalized biological information-only single cell space; aligning the same type of cells from different sources in the single cell space, wherein the different types of cells are respectively positioned and separated from each other; adding the variable information of the specific batch into each single cell information of the single cell space through a specific decoder to reconstruct single cell data; the invention realizes the batch-independent single cell data integration, has very good generalization characteristic, can well fit data after model training (align to), can well fit data of new batches, can continuously integrate newly generated data, and realizes the online integration function.

Description

Method and system for integrating multi-source single cell data on line
Technical Field
The invention relates to the technical field of biology, in particular to a method and a system for integrating multi-source single cell data on line.
Background
Single cell sequencing technology (scRNA-seq) and single cell epigenome technology (scATAC-seq) are capable of breaking down different cell types and situations, elucidating the organization rules and the function of various systems. With the explosive accumulation of single cell studies, the integrated analysis of experimental data from different environments is essential for the characterization of heterogeneous cell populations, however, critical biological information is often mixed with batch effects caused by different sample donors, conditions, and analysis platforms. When batch effects are detected, the variable of time in experiments is frequently recorded, then the differentially expressed genes are clustered to see whether the differentially expressed genes are related to the time, and if so, the batch effects are proved to exist.
Generally, data from different platforms, data from different periods of the same platform, data from different reagents in the same sample, and data from different times in the same sample, etc. will all produce a batch effect. This effect, if present extensively, should be sufficiently valued to cause failure of the entire experiment and the final conclusions. If there is a batch effect between the data of different platforms, it cannot be simply integrated.
For single-cell data, current strategies, which mainly identify similar cells or cell populations across batches, have two drawbacks, one is to mix non-overlapping cell populations in different batches of source data, and the other is to remove the effect of the currently accessed batch, and not to process batch effects in other, newly generated batches of data, and to re-perform the entire integration process as long as a new batch is added. Yet another strategy is to model the inherent layout and structure of the input single-cell data using a variable-condition automated encoder framework, which can maintain the overall internal data in both high-and low-dimensional spaces, but also contains a series of batch conditioning parameters, thus inhibiting the encoder's learning to eliminate batch effects.
Disclosure of Invention
In view of the above, the present invention has been developed to provide a solution that overcomes, or at least partially solves, the above-mentioned problems. Accordingly, in one aspect of the present invention, there is provided a method for online integration of multi-source single-cell data, the method comprising:
inputting single cell data with batch effect from a plurality of different sources;
projecting, by a batch effect-independent encoder, the single cell data to a batch effect-independent, generalized biological information-only single cell space;
aligning the same type of cells from different sources in the single cell space, wherein the different types of cells are respectively positioned and separated from each other;
adding specific batch variable information to each single cell information of the single cell space through a specific decoder to reconstruct single cell data.
Optionally, the method further includes: quantifying the degree of cell type differentiation using a contour score, silouette score; the batch entropy mixing score, was used to quantify the degree of alignment of the same cell type between different batches.
Optionally, projecting the single-cell data to a batch-effect-independent generalized biological-information-only single-cell space by a batch-effect-independent encoder, comprising:
randomly sampling all the input single cell data of different batches from different sources to form a small batch of data mini batch; and carrying out normalization processing on the small batch data to reduce distribution deviation.
Optionally, spatially aligning cells of the same type from different sources at the single cell, comprising: major cell types in the partially overlapping data sets are downsampled to combine the partially overlapping data sets by constructing test data sets having common cell types.
Optionally, the method further includes: after the single-cell space is constructed, additional data is projected onto the established single-cell space.
Optionally, the additional data is projected onto a new location in proximity to a cell similar thereto.
Optionally, aligning the same type of cells from different sources in the single cell space, and positioning the different types of cells separately from each other, comprising: the common cell types are arranged together and at the same location in the cell space, and the non-overlapping cell types are individually localized in the cell space.
The invention also provides a system for integrating multi-source single cell data on line, which comprises:
the system comprises an original data acquisition module, a data acquisition module and a data acquisition module, wherein the original data acquisition module is used for inputting single cell data with batch effect from a plurality of different sources;
a single cell data space transformation module for projecting the single cell data to a generalized single cell space that retains only biological information, independent of batch effects;
the asymmetric variational self-encoder aligns the cells of the same type from different sources in the single cell space, and the cells of different types are respectively positioned and separated from each other;
and the specific decoder is used for adding the variable information of the specific batch to each single cell information of the single cell space so as to reconstruct the single cell data.
Optionally, the system further comprises: an asymmetric self-coding evaluation module for quantifying the degree of cell type differentiation using a contour score, silouette score; the batch entropy mixing score, was used to quantify the degree of alignment of the same cell type between different batches.
Optionally, the single-cell data space conversion module includes:
randomly adopting a submodule for randomly sampling all the input single cell data of different batches from different sources integrally to form a small batch of data mini batch;
a normalization submodule for performing normalization processing on the small batch data
normalization to reduce distribution bias.
Optionally, the system further comprises: and the single cell data updating module is used for projecting additional data onto the established single cell space after the single cell space is constructed.
The technical scheme provided by the application at least has the following technical effects or advantages: in the application, a variational self-encoder framework (VAE) is adopted, a batch-independent encoder is utilized, and the encoder is trained to only keep batch-independent biological variables, so that batch-independent single cell data integration is realized, the generalization characteristic is very good, the data can be well fitted (align to) after model training, the data of a new batch can be well fitted, newly generated data can be continuously integrated, and the online integration function is realized; the present invention enables accurate integration of partially overlapping data sets without mixing non-overlapping cell populations; the invention can generate a continuously expandable single cell atlas based on single cell data of a plurality of different sources, can efficiently operate a huge data set, and can construct and research a large-scale single cell atlas based on integrated data of various sources.
The above description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the technical solutions of the present invention and the objects, features, and advantages thereof more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.
Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 shows a flow chart of a method for integrating multi-source single-cell data online provided by the invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
In order to inhibit the batch effect of single cell data from different sources, the application provides a new framework for integrating the single cell data on line, namely, an asymmetric variational self-encoder and a decoder are utilized, batch-related information is not given to the asymmetric variational self-encoder, so that the encoder only keeps biological information irrelevant to the batch effect, the aim of keeping biological difference information is fulfilled, and batch-related information is only given to the decoder, and batch specificity is kept so as to restore the original data.
In one aspect of the present invention, there is provided a method for online integration of multi-source single-cell data, as shown in fig. 1, the method comprising:
s1, inputting single cell data with batch effect from a plurality of different sources;
s2, projecting the single cell data to a batch effect-independent generalized biological information-only single cell space through a batch effect-independent encoder;
s3, aligning the same type of cells from different sources in the single cell space, and respectively positioning the different types of cells to be separated from each other;
and S4, adding the variable information of the specific batches to the information of each single cell in the single cell space through a specific decoder to reconstruct the single cell data.
The underlying idea of the invention is that the encoder has the function as a data projector, i.e. to integrate various batches of single-cell data into a generalized batch-invariant cell embedding space, thereby removing batch-related variables from the single-cell data and preserving batch-invariant bio-signals in the cell embedding, and becoming a tool for integrated analysis of various single-cell data sets, without depending on the search for cell similarities.
The encoder used in the invention is irrelevant to batches and is universal to single cells of each batch, so that the encoder has very good generalization characteristic, and the encoder model can be well fitted to data after being trained, and can be well fitted to new data due to the elimination of batch effect. By using the batch-independent encoder to project the single cell data and project the data to a generalized cell embedding space, the integration of the single cell data from a plurality of different sources is realized, the purpose is to eliminate the batch effect, and the encoder is trained to only keep the biological difference.
After the encoder model is trained, the encoder is irrelevant to the batch, and even new batch data can be well fitted, so that newly generated data can be continuously integrated, and an online integration function is realized; the present invention enables accurate integration of partially overlapping data sets without mixing non-overlapping cell populations; the invention can generate a continuously expandable single cell atlas based on single cell data of a plurality of different sources, can efficiently operate a huge data set, and can construct and research a large-scale single cell atlas based on integrated data of various sources.
The method comprises the steps of projecting single cell data of different sources to a generalized cell space, fitting the same cell types together, fitting partially overlapped single cell data, enabling the non-overlapped single cell data not to be overlapped together, achieving on-line integration (online integration) of multi-source cell data, enabling a model to be well trained once, enabling codes to be irrelevant to batches, enabling the model to be well integrated even new batches of data are obtained, and accordingly enabling the integrated data to be continuously updated.
The integrated single cell data is utilized to construct a cell map, a plurality of cell tissues and millions of cell maps can be included, the map is very useful, new single cell data are continuously integrated, the map can be continuously updated, and the map can be used for being compared with collected cell images, and the cell specificity can be found.
And the batch-specific decoder is used for independently normalizing each batch, so as to learn the difference of the batch effect, and the capacity of the decoder for capturing the batch effect is increased through decoupling. Domain-specific batch normalization, multi-branch batch normalization, is performed in the decoder to support the insertion of well-defined batch variables to reconstruct single-cell data.
In the process of encoder training, randomly sampling all input single cell data of different batches from different sources to form a small batch of data mini batch; and carrying out normalization processing batch normalization on the small batch of data to realize approximate consistency in format and distribution and eliminate deviation so as to reduce distribution deviation.
The data for each micro-batch were aligned in the cell embedding space under KL differential limits, including: arranging common cell types together and at the same position in the cell space, wherein the non-overlapped cell types are separately positioned in the cell space; integrating the partially overlapping data sets by constructing test data sets having a common cell type and down-sampling major cell types in the partially overlapping data sets; after the cell embedding space is constructed, additional data is projected onto the established cell embedding space and additional data is projected onto a new location that is close to a similar cell.
Cells of the same type from different sources are aligned in the cell embedding space (align) by an encoder that fits the same cell type together, with the different types of cells being positioned separately from each other. Aligning (align) the overlapping single-cell data with respect to the partially overlapping single-cell data, downsampling the main cell types in the partially overlapping data sets by constructing test data sets having common cell types to integrate the partially overlapping data sets without mixing non-overlapping cell groups, and individually positioning the non-overlapping single-cell data to maintain true biological difference information.
The specific process of projecting single-cell data into a generalized cell-embedding space is described in detail below. The central goal of single cell data integration is to identify and align (align) similar cells across different batches, while retaining true biological differential information within the same cell type and between different cell types. As a fundamental key, it is the decomposition of batch-related components from batch-independent components in the single-cell data, followed by projection of the batch-independent components to a generalized, batch-independent cell embedding space. To accomplish this, the present invention constructs an asymmetric VAE structure that employs a batch-independent encoder that extracts only biologically relevant latent features (z) from input single-cell data (x) and a batch-specific decoder that is responsible for reconstructing the original data based on the insertion of batch information into the latent features (z).
While the decoder is responsible for raw data reconstruction, the encoder is required to provide batch information to the decoder at the time of reconstruction, which learns the reproduction of batch independent data for each single cell in model training. During this learning process. Randomly slicing all the input single cells so that the single cell data is changed from different batches (differential batches) to micro-batches (mini-batches), facilitates the above-mentioned learning process of the encoder. Each micro-batch data was forcibly aligned (align) under the KL differential constraint of the same cell embedding space. The encoder serves as a projector to project different batches of single-cell data into a generalized, batch-independent cell embedding space, thereby removing batch-dependent variable information while retaining batch-independent biological information.
The single cell data integration method provided by the invention can be applied to various well-organized scRNA-seq data sets including human pancreas, heart, liver and the like, and can also be applied to human non-small cell lung cancer, peripheral blood mononuclear cells and the like. Moreover, experiments prove that compared with some existing single cell data integration methods, the method can better remove batch characteristics and better realize the fitting of cells of the same type.
A contour score (silouette score) can assess the distinctiveness of the biological difference information, and a batch entropy mixing score (batch entropy mixing score) can quantify the degree of alignment of the same cell type between different batches, the degree of separate processing of different cell types). To better illustrate the effectiveness of the present invention, the single-cell integration performance of the present invention can be quantified using a contour score (siloette score) and a batch entropy mixing score, and experiments have shown that the performance of the present invention performed by using the contour score, based on the evaluation of the batch entropy mixing score, is comparable to the best data integration method, seruat and Harmony. The present invention enables significantly lower batch entropy mixing scores to be achieved relative to Seurat v3and Harmony for liver datasets (containing specific batches of cell types and thus partially overlapping datasets). However, due to the offset of different cell types together, Seurat v3and Harmony may be able to achieve a higher batch entropy mixing score, which in fact is not ideal for evaluating mixed batches of partially overlapping datasets.
Experiments prove that the method has expandability and high-efficiency computing capability on a large number of data sets. As an embodiment, the invention can be applied to 1369619 cells from a data set of a human fetal atlas (two data batches), which can be accurately integrated and show good alignment of the same cell type. With downsampling from a human fetal atlas data set to obtain a downsampled data set, the present invention takes significantly less run time and memory than MNN, saurtat v3and Conos. The method can efficiently run on GPU equipment, only 10 minutes of time and 16G of storage space are needed for integrating and processing the 1M data set, so that the running time is greatly reduced and the storage space is saved compared with other existing methods.
The present invention is applicable not only to cross-mode data (such as the previously mentioned scRNA-seq, scaTAC-seq data), but also to scaTAC-seq data. As a specific embodiment, the invention can integrate mouse brain scATAC-seq data (analyzed as two batches by snATAC-seq and 10X), align common cell sub-populations, and treat different cell sub-populations separately. The present invention also integrates PBMC data for data cross-patterns between scRNA-seq and scATAC-seq, demonstrating that the present invention is able to correctly integrate both types of data and to distinguish cells that are rare, particularly for scRNA-seq data, including pDC enzymes and platelet cells. Thus, it can be said that the present invention has a wide integration capability and can be applied to a wide variety of types of single-cell data.
Single-cell data integration in local cell similarity-based methods is difficult to process partially overlapping data sets because it often leads to overcorrection (e.g., significant mixing of cell types). The present invention avoids this problem as a global integration method of projecting cells into a generalized cellular space. For example, the liver data set is a partially overlapped data set, the liver cell population comprises a plurality of subtypes of specific different batches, three subtypes are specific liver-GSE, 24395, and two other subtypes only appear in liver-GSE 115469.
For clarity of illustration of the processing of partially overlapping datasets by the present invention, as a specific embodiment, we constructed test datasets with a range of common cell types obtained by down-sampling six major cell types of the pancreatic dataset. The integration approach of the present invention is accurate for all single cell integrations, and allows alignment of the same cell types without overcorrection, unlike the current sources of Seurat v3and Harmony, which often confuse cell types, especially for low overlap. Repeated down-sampling of cell types for 12 cell types of PBMC data sets as a more complex example of partial overlap and observation of similar results, it can be seen that the present invention is very powerful in retaining bio-variable information for partially overlapping data sets.
In the methods provided herein, the invisible data is projected into an existing cell embedding space. The integration approach provided by the present invention can be accurately, quantifiably, and efficiently implemented depending on the ability of the encoder to project cells from various sources into a generalized, batch-invariant cell embedding space. Once the cell embedding space is established after the existing data is integrated, the present invention can project additional data (such as data not previously seen) onto the same cell embedding space using the same encoder. As a specific embodiment, using the pancreatic dataset, SCALEX integration removes the strong batch effect on the raw data, lining up the same cell types together, keeping distinct cell types clearly differentiated. These cell types include rare cells such as schwann cells, epsilon cells.
As a specific embodiment, three new batches of pancreatic tissue are projected into the pancreatic cell space using the same encoder trained with the pancreatic dataset as previously mentioned. After projection, the majority of cells of the new batch are accurately aligned with the correct cell type in the pancreatic cell space so that their precise annotation is conveyed by the cell type label. We used the cell type Information of the original study as gold standard to test the accuracy of the annotation with the benchmark questions by calculating the adjusted Land Index (ARI), the standard consensus Information (the Normalized Mutual Information), and the F1 score. Experiments prove that the method has very high accuracy.
Projecting new single cell data into the generalized cell embedding space enables the present invention to extend the cell space. As a specific embodiment, we project two additional melanoma data batches to the PBMC space that had been previously constructed. The common cell type is correctly projected at the same position in the PBMC cell space. For tumor and plasma cells that only appear in melanoma data batches, SCALEX did not project these cells onto any existing cell population in the PBMC space, but instead projected them onto a new location near similar cells, projected plasma cells onto a location near B cells, and projected tumor cells onto a location near HSC cells.
The invention enables post-annotation of unknown cell types in existing cell spaces using new data. As a specific embodiment, a group of cells, which have not been previously described in the pancreatic dataset, were found to show high expression levels for known epithelial genes. Thus, a pool of epithelial cells is aggregated from the bronchial epithelial data and projected into the pancreatic cell space, and a group of antigen-presenting airway epithelial cells (SCL16A + epithelial) are found projected onto the same location on undescribed cells. Thus, in combination with the observation that two cell populations displaying similar marker gene expression, these undescribed cells were also SCL16A + epithelial cells. The present invention enables discovery in cell biology by supporting exploratory analysis of a large number of different data sets.
Integrating partially overlapping data into a generalized cell embedding space makes the present invention an advantageous tool for constructing single cell atlases from a variety of large amounts of data collected. As a specific embodiment, the present invention is applied to integrate a rat data set. Although the raw data has strong batch effect, the invention integrates three batches of rat data sets to form a uniform cell embedding space. The common cell types were well aligned in the same position in the cell space and the non-overlapping cell types were individually positioned in the cell space, indicating that the biological variables were well preserved.
It is very important that the formed atlas can be used and further expanded by projecting new single cell data, supporting comparative studies of the original atlas and the new data. To illustrate, we project two additional data batches of mouse tissue and two single tissue data into the mouse tissue space, and find that the same cell type of the new data batch is correctly projected onto the same location in the cell embedding space of the initial mouse atlas, as evidenced by accurate cell type annotation of the new data, which is communicated by the tag at the corresponding cell type of the initial atlas. On the one hand, the mouse atlas may be used to accurately identify cells of new data based on the projected location of the cell space, and on the other hand, the projection of new data expands the existing atlas.
In summary, the present invention enables researchers to evaluate a particular projection of a single-cell dataset by exploiting the existing information of large-scale cell atlases, and enables atlas creators to integrate new datasets and accompanying biological features from many research projects with information.
The invention also provides a system for integrating multi-source single cell data on line, which comprises:
the system comprises an original data acquisition module, a data acquisition module and a data acquisition module, wherein the original data acquisition module is used for inputting single cell data with batch effect from a plurality of different sources;
a single cell data space transformation module for projecting the single cell data to a generalized single cell space that retains only biological information, independent of batch effects;
the asymmetric variational self-encoder aligns the cells of the same type from different sources in the single cell space, and the cells of different types are respectively positioned and separated from each other;
and the specific decoder is used for adding the variable information of the specific batch to each single cell information of the single cell space so as to reconstruct the single cell data.
Optionally, the system further comprises: an asymmetric self-coding evaluation module for quantifying the degree of cell type differentiation using a contour score, silouette score; the batch entropy mixing score, was used to quantify the degree of alignment of the same cell type between different batches.
Optionally, the single-cell data space conversion module includes:
randomly adopting a submodule for randomly sampling all the input single cell data of different batches from different sources integrally to form a small batch of data mini batch;
and the normalization submodule is used for performing normalization processing batch normalization on the small batch data so as to reduce distribution deviation.
Optionally, the system further comprises: and the single cell data updating module is used for projecting additional data onto the established single cell space after the single cell space is constructed.
In the application, the batch-independent encoder is utilized, and the encoder is trained to only keep batch-invariant biological variables, so that the encoder is batch-independent and has very good generalization characteristics, and the model can be well fitted (align to) to data after being trained and can also be well fitted to data of a new batch. The method comprises the steps of projecting single cell data of different sources to a generalized cell space, fitting the same cell types together, fitting partially overlapped single cell data, enabling the non-overlapped single cell data not to be overlapped together, achieving on-line integration (online integration) of multi-source cell data, enabling a model to be well trained once, enabling codes to be irrelevant to batches, enabling the model to be well integrated even new batches of data are obtained, and accordingly enabling the integrated data to be continuously updated.
The invention models the patterns which are inherent in the single cell data and are irrelevant to the batch, is not based on the similarity of the cells, and does not depend on the identification of the common cell types of the cross-domain batch, thereby avoiding the overcorrection problem of the cell types and overcoming the problem of complicated calculation caused by calculating the similarity of the cells.
The present invention is very useful for constructing and integrating large single cell datasets based on a wide variety of data, and data integration is accomplished by projecting all single cell data into a generalized cell embedding space using a general data projector (e.g., encoder). The data projector need only be trained once to keep continuously integrating newly added data into an existing single cell atlas, this continuously growing SCALEX atlas being a flexible source, allowing integration of many single cell studies and supporting development.
The invention can integrate data of various researches and platforms, is suitable for the current single cell biological research era, and is very helpful for a large amount of integrated researches through the capability of exploratory analysis on a generalized cell space tissue. The present invention constructs a projection single cell dataset from a variety of different cell types, which helps to guide the discovery of previously unknown cell types, for example, the present invention constructs a pan-cancerous single cell atlas from a variety of different cancer type cells, which may potentially discover unknown cancer types because of some distinct tumors, but common in pathogenesis, malignant development, metastasis.
In summary, the present application provides a method for deeply generating a framework for aggregating various single cell data, which can draw cells into a generalized, batch-invariant cell embedding space, accurately and efficiently combine various heterogeneous single cell data using multiple benchmarks, and can aggregate partially overlapping data sets, accurately arrange similar cell populations while maintaining actual biological differences. The framework can construct a continuous and expandable single cell atlas for people, mice and novel coronaviruses, the atlas integrates various data sources, and the development can be kept through the new data sources. Analysis of these maps revealed human, mouse tissues and types of charts of peripheral relief associated with the severity of the novel coronary inflammatory disease.
The method and the system provided by the invention are realized by executing the computer program on the processor, and the computer program is stored in a storage device of a server, a big data analysis processing platform or a special computer for real-time calling.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim.

Claims (13)

1. A method for integrating multi-source single-cell data online, the method comprising:
inputting single cell data with batch effect from a plurality of different sources;
projecting, by a batch effect-independent encoder, the single cell data to a batch effect-independent, generalized biological information-only single cell space;
aligning the same type of cells from different sources in the single cell space, wherein the different types of cells are respectively positioned and separated from each other;
adding specific batch variable information to each single cell information of the single cell space through a specific decoder to reconstruct single cell data.
2. The method of claim 1, further characterized in that the method further comprises: quantifying the degree of cell type differentiation using a contour score, silouette score; the batch entropy mixing score, was used to quantify the degree of alignment of the same cell type between different batches.
3. The method of claim 1, further characterized by projecting the single-cell data through a batch-effect-independent encoder to a batch-effect-independent, generalized biological-information-only single-cell space, comprising:
randomly sampling all the input single cell data of different batches from different sources to form small-batch data minipatch; and carrying out normalization processing on the small batch data to reduce distribution deviation.
4. The method of claim 1, further characterized by spatially aligning cells of the same type from different sources at the single cell, comprising: major cell types in the partially overlapping data sets are downsampled to combine the partially overlapping data sets by constructing test data sets having common cell types.
5. The method of claim 1, further characterized by further comprising: after the single-cell space is constructed, additional data is projected onto the established single-cell space.
6. The assay of claim 5, further characterized by projecting the additional data onto a new location proximate to a cell similar thereto.
7. The method of claim 1, further characterized by spatially aligning cells of the same type from different sources in the single cell, the cells of different types each individually positioned apart from each other, comprising: the common cell types are arranged together and at the same location in the cell space, and the non-overlapping cell types are individually localized in the cell space.
8. A system for integrating multi-source single-cell data online, the system comprising:
the system comprises an original data acquisition module, a data acquisition module and a data acquisition module, wherein the original data acquisition module is used for inputting single cell data with batch effect from a plurality of different sources;
a single cell data space transformation module for projecting the single cell data to a generalized single cell space that retains only biological information, independent of batch effects;
the asymmetric variational self-encoder aligns the cells of the same type from different sources in the single cell space, and the cells of different types are respectively positioned and separated from each other;
and the specific decoder is used for adding the variable information of the specific batch to each single cell information of the single cell space so as to reconstruct the single cell data.
9. The system of claim 8, further characterized in that the system further comprises: an asymmetric self-coding evaluation module for quantifying the degree of cell type differentiation using a contour score, silouette score; the batch entropy mixing score, was used to quantify the degree of alignment of the same cell type between different batches.
10. The system of claim 8, further characterized in that the single cell data space transformation module comprises:
randomly adopting a submodule for randomly sampling all the input single cell data of different batches from different sources integrally to form a small batch of data mini batch;
and the normalization submodule is used for performing normalization processing batch normalization on the small batch data so as to reduce distribution deviation.
11. The system of claim 8, further characterized in that the system further comprises: and the single cell data updating module is used for projecting additional data onto the established single cell space after the single cell space is constructed.
12. A computer program for executing the steps of the method for the on-line integration of multi-source single-cell data according to claim 1 by a processor.
13. An information storage medium characterized by storing the computer program according to claim 12.
CN202111213929.9A 2021-10-19 2021-10-19 Method and system for integrating multi-source single cell data on line Pending CN114038505A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111213929.9A CN114038505A (en) 2021-10-19 2021-10-19 Method and system for integrating multi-source single cell data on line

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111213929.9A CN114038505A (en) 2021-10-19 2021-10-19 Method and system for integrating multi-source single cell data on line

Publications (1)

Publication Number Publication Date
CN114038505A true CN114038505A (en) 2022-02-11

Family

ID=80141638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111213929.9A Pending CN114038505A (en) 2021-10-19 2021-10-19 Method and system for integrating multi-source single cell data on line

Country Status (1)

Country Link
CN (1) CN114038505A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160026754A1 (en) * 2013-03-14 2016-01-28 President And Fellows Of Harvard College Methods and systems for identifying a physiological state of a target cell
US20200176080A1 (en) * 2017-07-21 2020-06-04 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Analyzing Mixed Cell Populations
WO2020154885A1 (en) * 2019-01-29 2020-08-06 北京大学 Single cell type detection method, apparatus, device, and storage medium
CN112908414A (en) * 2021-01-28 2021-06-04 中山大学 Large-scale single cell typing method, system and storage medium
US20210287759A1 (en) * 2020-03-12 2021-09-16 Bostongene Corporation Systems and methods for deconvolution of expression data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160026754A1 (en) * 2013-03-14 2016-01-28 President And Fellows Of Harvard College Methods and systems for identifying a physiological state of a target cell
US20200176080A1 (en) * 2017-07-21 2020-06-04 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Analyzing Mixed Cell Populations
WO2020154885A1 (en) * 2019-01-29 2020-08-06 北京大学 Single cell type detection method, apparatus, device, and storage medium
US20210287759A1 (en) * 2020-03-12 2021-09-16 Bostongene Corporation Systems and methods for deconvolution of expression data
CN112908414A (en) * 2021-01-28 2021-06-04 中山大学 Large-scale single cell typing method, system and storage medium

Similar Documents

Publication Publication Date Title
Liu et al. DNA methylation atlas of the mouse brain at single-cell resolution
Cieslak et al. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis
Zheng et al. SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation
Wang et al. MultiP-SChlo: multi-label protein subchloroplast localization prediction with Chou’s pseudo amino acid composition and a novel multi-label classifier
Lin et al. Continuous-state HMMs for modeling time-series single-cell RNA-Seq data
US20110118130A1 (en) Compositions and methods for defining cells
Zhang et al. scMC learns biological variation through the alignment of multiple single-cell genomics datasets
Zagar et al. Stage prediction of embryonic stem cell differentiation from genome-wide expression data
Hsieh et al. A faster cDNA microarray gene expression data classifier for diagnosing diseases
Chen et al. CRNET: an efficient sampling approach to infer functional regulatory networks by integrating large-scale ChIP-seq and time-course RNA-seq data
Zhang et al. MatchMixeR: a cross-platform normalization method for gene expression data integration
Mehine et al. 3′ RNA sequencing accurately classifies formalin-fixed paraffin-embedded uterine leiomyomas
Hwang et al. Tissue-specific gene expression templates for accurate molecular characterization of the normal physiological states of multiple human tissues with implication in development and cancer studies
Borisov et al. Transcriptomic harmonization as the way for suppressing cross-platform bias and batch effect
Seal et al. CASSL: A cell-type annotation method for single cell transcriptomics data using semi-supervised learning
Prieto et al. Algorithm to find gene expression profiles of deregulation and identify families of disease-altered genes
CN116312745B (en) Intestinal flora super donor image information detection generation method
CN114038505A (en) Method and system for integrating multi-source single cell data on line
Dall’Olio et al. BRAQUE: Bayesian reduction for amplified quantization in UMAP embedding
García-Gómez et al. Sparse Manifold Clustering and Embedding to discriminate gene expression profiles of glioblastoma and meningioma tumors
Valdes et al. Methods to detect transcribed pseudogenes: RNA-Seq discovery allows learning through features
Song et al. Detecting spatially co-expressed gene clusters with functional coherence by graph-regularized convolutional neural network
Gupta et al. Evaluating morphological methods for sex estimation on isolated human skeletal materials: comparisons of accuracies between German and South African skeletal collections
Wei et al. Self-supervised deep learning of gene–gene interactions for improved gene expression recovery
Ruiz et al. Tabular Deep Learning when $ d\gg n $ by Using an Auxiliary Knowledge Graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination