CN117854599A

CN117854599A - Batch effect processing method, equipment and storage medium for multi-mode cell data

Info

Publication number: CN117854599A
Application number: CN202410259150.8A
Authority: CN
Inventors: 荣志炜
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2024-03-07
Filing date: 2024-03-07
Publication date: 2024-04-09
Anticipated expiration: 2044-03-07
Also published as: CN117854599B

Abstract

The invention relates to the technical field of cytohistology, and discloses a batch effect processing method, equipment and a storage medium of multi-mode cell data, wherein the method comprises the following steps: obtaining histology data of the single cells; inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the histology data of each modal unit to obtain global characteristics, and inputting the global characteristics into the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters; a plurality of sets of data are integrated into a single cell obtained from the results of the modality generation. The invention can adapt to multi-mode histology data because the preset single-cell data integration model comprises the self-scaling attention module, so as to realize accurate analysis of the characteristics of each mode in the histology data and reduce the influence of batch effect in data integration.

Description

Batch effect processing method, equipment and storage medium for multi-mode cell data

Technical Field

The present invention relates to the field of cytohistology, and in particular, to a method, an apparatus, and a storage medium for processing batch effects of multi-modal cell data.

Background

Single cell histology is a technique that enables high throughput analysis of single cells, including: single-cell transcriptome sequencing (Single-cell RNA sequencing, scRNA-seq), single-cell proteomics (Single-cell proteomics), and the like, to perform a plurality of histologic measurements on the same cell, and to integrate the obtained sets of data, and to enable finer analysis at the cellular level using these abundant resources.

However, due to differences in experimental environments, sequencing techniques, and data processing modes of different batches, the batch effect is susceptible to when the histologic data of cells with modal diversity are integrated. In the existing single-cell transcriptomic data integration method, a complex calculation method is required to reduce the influence of the batch effect, and calculation scene limitation exists, so that the accuracy of integrated data is not high when the method is applied to multi-mode cytohistology data.

The foregoing is provided merely for the purpose of facilitating understanding of the technical scheme of the present invention and is not intended to represent an admission that the foregoing is related art.

Disclosure of Invention

The invention mainly aims to provide a batch effect processing method, equipment and storage medium for multi-mode cell data, and aims to solve the technical problems that the existing single-cell transcriptome data integration mode is high in calculation complexity and the accuracy of integrated data is not high and difficult when the method is applied to multi-mode cell transcriptome data.

To achieve the above object, the present invention provides a batch effect processing method for multi-modal cell data, the method comprising the steps of:

obtaining histology data of the single cells;

inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the group study data of each modal unit to obtain global characteristics, and inputting the global characteristics to the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters;

a plurality of sets of integrated data for the single cell are obtained from the modality generation.

Optionally, the preset single-cell data integration model is constructed based on a variation self-encoder, and the preset single-cell data integration model further includes: the system comprises a modal encoding module, a graph encoding module and a discriminator module, wherein the discriminator module is arranged between the self-scaling attention fusion module and the hybrid decoding module.

Optionally, the inputting the omic data into a preset single-cell data integration model to obtain a modal feature generation result corresponding to the omic data includes:

extracting features of the group of data through the mode coding module to obtain mode features of each mode unit;

performing feature fusion on the modal features through the self-scaling attention fusion module to obtain global features;

batch distribution coordination is carried out on the global features through the discriminator module, and batch distribution information is obtained;

extracting graph characteristics of a preset guidance graph through the graph coding characteristic module to obtain priori characteristics;

and carrying out feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain the modal feature distribution parameters corresponding to the group of the mathematical data.

Optionally, the hybrid decoding module includes: a hybrid fusion sub-module and a decoding sub-module; the step of performing feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain modal feature distribution parameters corresponding to the group of chemical data, wherein the method comprises the following steps:

mapping and fusing the prior feature, the global feature and the batch distribution information based on a mode unit through the hybrid fusion sub-module to obtain an integral expression feature;

and integrating and generating the integral expression characteristics through the decoding module to obtain modal characteristic distribution parameters corresponding to the omics data.

Optionally, the preset single-cell data integration model is trained by semi-supervised learning, and the method further comprises:

collecting multi-modal cytology data and converting the multi-modal cytology data into raster data;

constructing a cytological dataset based on the raster data;

training the single cell data integration model to be trained through the cytology data set to obtain a preset single cell data integration model.

Optionally, the constructing a cytological dataset based on the raster data includes:

judging whether annotation information exists in each unit data in the raster data;

taking each unit data with annotation information as a first training data set and taking each unit data without annotation information as a second training data set;

a cytological dataset is constructed from the first training dataset and the second training dataset.

Optionally, the training the single-cell data integration model to be trained through the cytology data set to obtain a preset single-cell data integration model includes:

based on a supervised learning algorithm, performing first training on the single-cell data integration model to be trained through the first training data set to obtain an initialized single-cell data integration model;

based on an unsupervised learning algorithm, performing second training on the initialized single-cell data integration model through the second training data set, and updating the single-cell data integration model according to a learning result;

and repeatedly executing the first training and the second training to obtain a preset single-cell data integration model.

Optionally, the obtaining the multi-group of the single cell data sets according to the modal generation result further comprises:

obtaining each reconstruction feature corresponding to the omic data according to the modal generation result;

and carrying out data integration on each reconstruction feature in a shared embedded space to obtain a multi-group integrated data set of the single cells.

In addition, in order to achieve the above object, the present invention also provides a batch effect processing apparatus for multi-modal cell data, the apparatus comprising: the system comprises a memory, a processor and a batch effect processing program of the multi-modal cell data stored on the memory and capable of running on the processor, wherein the batch effect processing program of the multi-modal cell data is configured to realize the steps of the batch effect processing method of the multi-modal cell data.

In addition, in order to achieve the above object, the present invention also provides a storage medium, on which a batch effect processing program of multi-modal cell data is stored, which when executed by a processor, implements the steps of the batch effect processing method of multi-modal cell data as described above.

The invention obtains the histology data of single cells; inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the group study data of each modal unit to obtain global characteristics, and inputting the global characteristics to the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters; a plurality of sets of integrated data for the single cell are obtained from the modality generation. Because the preset single-cell data integration model comprises the self-scaling attention module, the invention can adapt to the histology data with modal diversity to obtain the fusion characteristics, and the mixed decoding module performs characteristic mapping to obtain modal characteristic distribution parameters, thereby realizing the accurate analysis of each modal characteristic in the histology data, reducing the influence of batch effect in data integration and obtaining a multi-group-study integration data set with higher accuracy.

Drawings

FIG. 1 is a schematic diagram of a batch effect processing device for multi-modal cell data in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flow chart of a first embodiment of a batch effect processing method for multi-modal cell data in accordance with the present invention;

FIG. 3 is a flow chart of a second embodiment of a batch effect processing method for multi-modal cell data in accordance with the present invention;

FIG. 4 is a diagram of a model structure of a pre-set single cell data integration model in a batch effect processing method of multi-modal cell data according to an embodiment of the present invention;

FIG. 5 is a flow chart of a third embodiment of a batch effect processing method for multi-modal cell data in accordance with the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a batch effect processing apparatus for multi-modal cell data in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the batch effect processing apparatus for multi-modal cell data may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the configuration shown in FIG. 1 does not constitute a limitation of the batch effect processing apparatus for multi-modal cell data and may include more or fewer components than shown, or certain components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include an operating system, a network communication module, a user interface module, and a batch effect processing program for multi-modal cell data.

In the batch effect processing device of the multi-modal cell data shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the batch effect processing device for multi-modal cell data may be disposed in the batch effect processing device for multi-modal cell data, where the batch effect processing device for multi-modal cell data invokes a batch effect processing program for multi-modal cell data stored in the memory 1005 through the processor 1001, and executes the batch effect processing method for multi-modal cell data provided by the embodiment of the invention.

An embodiment of the present invention provides a batch effect processing method for multi-modal cell data, and referring to fig. 2, fig. 2 is a flow chart of a first embodiment of the batch effect processing method for multi-modal cell data.

In this embodiment, the batch effect processing method of the multi-modal cell data includes the following steps:

step S10: histology data of individual cells were obtained.

It should be noted that, by adopting the single-cell histology technology, single-cell histology data of different modes can be extracted from cells sampled from the same biological sample or tissue, and the integration of the single-cell histology data can establish multi-mode connection, which is beneficial to more comprehensive biological internal analysis. However, due to differences in experimental environment, individuals or tissues, the histology data of the cells show batch effects in the integration analysis, which affects the accuracy of the integration data.

It should be noted that, in the existing methods, processing of data of a specific group of single cells is mostly limited, that is, correction of batch effect is performed on data of a group of single cells in one mode. However, a single cell may correspond to histology data of multiple modalities that lack common features in different modalities and different features of different modalities exist in different dimensional spaces, so alignment of the cell states of the histology data of each modality unit is required before single cell-based multi-modality data integration.

It should be understood that this embodiment provides a batch effect processing method for multi-modal cell data, by inputting the histology data of single cell into a preset single cell data integration model to obtain the modal feature generation result of the corresponding histology data, and because the preset single cell data integration model is provided with a self-scaling attention fusion module, each modal feature under different modal combinations in the histology data can be adaptively fused, global features are obtained, and decoded and output, so as to obtain the modal feature distribution parameters of the histology data, further realize cell state alignment, integrate each histology data, and be favorable for obtaining an accurate multi-set of the histology integration data set.

It should be further noted that, the method of the present embodiment may be applied to batch effect elimination of the multi-mode histology data, or in the process of cell alignment before single-cell multi-mode histology data integration, the execution subject of the method of the present embodiment may be a computing service device with functions of data processing, model calling, data storage and program operation, such as a mobile phone, a personal computer, a cell analyzer, etc., or other electronic devices capable of implementing the same or similar functions, which is not limited in this embodiment. The present embodiment and the following embodiments will be described by taking a batch effect processing apparatus (hereinafter referred to as a processing apparatus) for multi-modal cell data as an example.

It should be understood that the histology data of a single cell may include histology data of different modalities that may include chromatin accessibility (DNA level), gene expression (RNA level), and epitopes (protein level), among others.

Step S20: inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the group study data of each modal unit to obtain global characteristics, and inputting the global characteristics to the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters.

It should be noted that the preset single cell data integration model may be based on a variation self-encoder (Variational Autoencoders, VAE). The VAE is constructed as a model, and the integration and analysis of the data can be realized by pre-learning the potential form of the histology data. The VAE may include an encoder structure and a decoder structure, where the input data can be mapped to a potential space by the encoder and the potential vectors can be mapped back to the original space of the input data by the decoder, and the data integration analysis is performed based on the feature distribution.

It should also be noted that the self-scaling attention mechanism can allow attention calculations for each location in the modality feature with all other locations, while a scaling factor can also be introduced to adjust the magnitude of the attention weight. In this embodiment, the histology data may be a combination of data of different mode units, so the self-scaling attention fusion module included in the preset single-cell data integration model may perform attention weight-based adaptive fusion on mode features in different mode combinations, so as to obtain a global feature. This global feature represents a low-dimensional representation of the cell state of the individual cells.

It should be appreciated that the global featureThe parameterization can be performed by a mixed distribution, which can be made up of discrete variables +.>And the continuous variable->Composition is prepared. Discrete variable->Can represent different cell state categories, while the continuous variable +.>The degree of change in the cell state class or other relevant information may be represented.

It can be understood that by the self-scaling attention fusion module, the histology data of the input model can be effectively extracted and modeled in terms of cell states, and the information of different modes can be coded and fused to obtain global features of low-dimensional representation, so that more effective feature representation can be provided for subsequent tasks.

It should be noted that, after the global feature is obtained, in order to implement the integration of the omic data based on the model, the global feature may be input to the hybrid decoding module, and the prior feature and the batch distribution information input to the hybrid decoding module are combined to obtain the modal feature distribution parameter corresponding to the omic data, so as to obtain a richer modal feature generation result.

Step S30: a plurality of sets of integrated data for the single cell are obtained from the modality generation.

It should be noted that, the processing device may obtain each reconstruction feature corresponding to the omic data according to the modal generating result; and carrying out data integration on each reconstruction feature in a shared embedded space to obtain a multi-group integrated data set of the single cells.

It can be understood that the above-mentioned modal feature generation result may include modal feature distribution parameters corresponding to the histology data, and because the above-mentioned hybrid decoding module may map the overall expression feature obtained by fusing the global feature with the prior feature and the batch distribution information to the shared embedded space (Common embedding space), low-dimensional space mapping is achieved, so that the reconstructed histology feature of the single cell can be obtained in the low-dimensional space, the influence of the batch effect can be eliminated, the integration of (multiple) histology data of the single cell is achieved, and a plurality of groups of chemically integrated data sets are obtained.

The present example was performed by obtaining histology data of individual cells; inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the group study data of each modal unit to obtain global characteristics, and inputting the global characteristics to the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters; a plurality of sets of integrated data for the single cell are obtained from the modality generation. Because the preset single-cell data integration model in the embodiment comprises the self-scaling attention module, the method can adapt to the histology data with modal diversity to obtain the fusion characteristics, and further, the mixed decoding module performs characteristic mapping to obtain modal characteristic distribution parameters, so that the accurate analysis of each modal characteristic in the histology data is realized, the influence of batch effect in data integration can be reduced, and a multi-group data integration data set with higher accuracy is obtained.

Referring to FIG. 3, FIG. 3 is a flow chart illustrating a second embodiment of a batch effect processing method for multi-modal cell data according to the present invention;

based on the above embodiments, in order to further describe the processing procedure of the preset single cell data integration model on the input data, in this embodiment, the preset single cell data integration model further includes: the mode encoding module, the graph encoding module and the discriminator module, the discriminator module is disposed between the self-scaling attention fusion module and the hybrid decoding module, therefore, step S20 includes:

step S201: and extracting features of the group of the data through the mode coding module to obtain mode features of each mode unit.

It should be noted that, the mode encoding module may be an encoder structure in a VAE architecture, and since the histology data in a single cell includes a plurality of mode units, corresponding encoders may be respectively set based on different mode units. The data of each mode unit in the group data is subjected to feature extraction through the different encoders, so that corresponding mode features can be obtained, and better expression of the mode information of each mode unit in the group data is facilitated.

Step S202: and carrying out feature fusion on the modal features through the self-scaling attention fusion module to obtain global features.

It should be noted that, in order to model the cell state of a single cell, feature fusion may be performed on the modal features of each modal unit. The single cells input into the model may have differences, and the corresponding histology data may also have differences in the composition of the mode units, for example, the histology data may include histology data of RNA mode and protein mode, and may also include histology data of DNA mode and protein mode. Therefore, the adaptive fusion of the modal characteristics of each modal unit can be carried out based on the adaptive attention mechanism, and the global characteristics are obtained.

Step S203: and carrying out batch distribution coordination on the global features through the discriminator module to obtain batch distribution information.

It will be appreciated that in order to adjust the distribution of global features in different batches, parameterized global features may be input to the arbiter, the batches being randomly extracted portions of samples from the input histology data, and the global features also being features derived from low-dimensional representations of these extracted samples. The global features can be input into the evaluator, which can make the authenticity of the reconstructed features output by the model stronger.

Step S204: and extracting the graph characteristics of the preset guidance graph through the graph coding characteristic module to obtain the prior characteristics.

It should be noted that the graph coding feature module may be a graph encoder, where the graph encoder may perform graph coding on a guidance graph with prior modal knowledge, and convert the guidance graph to obtain feature vectors to obtain prior features.

Step S205: and carrying out feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain the modal feature distribution parameters corresponding to the group of the mathematical data.

The hybrid decoding module may be a decoder structure in the VAE architecture, and may specifically be a hybrid decoder structure specific to a plurality of modes, and may generate and output specific expression features according to each mode unit.

It should be further noted that, in order to fully utilize the prior knowledge and eliminate the influence of the batch effect, the hybrid decoding module may be divided into a hybrid fusion sub-module and a decoding sub-module to further illustrate the function of the hybrid decoding module, so step S205 may include:

step S2051: and mapping and fusing the prior feature, the global feature and the batch distribution information based on the mode unit through the hybrid fusion sub-module to obtain an integral expression feature.

It should be noted that, the prior feature may include information such as correlation, prior distribution, etc. between data of each mode unit in the omic data, and the global feature and the prior feature are configured to form a mapping relationship, so that it is convenient to learn the mode feature representation of different mode units better. In addition, in order to further eliminate the influence of the batch effect, the distribution condition of each mode unit in the group data can be mapped with the prior characteristic and the global characteristic, and the information of the distribution condition of the data of each mode unit in the batch can be introduced.

Step S2052: and integrating and generating the integral expression characteristics through the decoding module to obtain modal characteristic distribution parameters corresponding to the omics data.

In a specific implementation, the prior feature, the global feature and the batch distribution information are subjected to distribution mapping and then fused to generate an overall expression feature corresponding to the modal feature comprising each modal unit, so that a plurality of modal-specific hybrid decoder structures can conveniently use the association information among the modalities to generate the distribution condition of the feature of each modality, namely the modal feature distribution parameters, so that the data reconstruction of the histology data can be realized according to the modal feature distribution parameters, and the reconstructed histology feature can be obtained in a low-dimensional space.

In addition, reference may be made to fig. 4, and fig. 4 is a schematic diagram of a model structure of a preset single cell data integration model in the batch effect processing method of multi-modal cell data according to an embodiment of the invention.

In FIG. 4, the Encoder (Encoder) in the Encoder module of the graph has data on the histology of individual mode units of a single cell) Extracting features to obtain corresponding modal features; then, attention-based feature Fusion (Attention Fusion) is carried out in a self-scaling Attention module to obtain global feature +.>And by the fact that the variable ∈ ->And the continuous variable->The composition mixed distribution is parameterized and then is input into a Discriminator module, and the Discriminator (Discriminator) performs distribution coordination among different batches to obtain batch distribution information +.>. Meanwhile, the image encoder (image encoder) in the image encoding module hasInstruction graph with a priori knowledge->And (5) carrying out graph coding, and converting to obtain a feature vector to obtain a priori feature V.

It should be noted that the graph encoder may correspond to a graph Decoder (graph Decoder), and the Hybrid Decoder (Hybrid Decoder) in the Hybrid decoding module finally characterizes the global featureLot characteristic distribution information->And mapping the prior characteristic V and obtaining a distribution condition of the characteristic of each generated mode, namely a mode characteristic distribution parameter, so that the data reconstruction of the histology data is further realized according to the mode characteristic distribution parameter, and the reconstruction (reconfigurations) of the histology characteristic is realized in a low-dimensional space.

In the embodiment, in a preset single-cell data integration model, feature extraction is performed on the omics data through the mode coding module to obtain mode features of each mode unit; performing feature fusion on the modal features through the self-scaling attention fusion module to obtain global features; batch distribution coordination is carried out on the global features through the discriminator module, and batch distribution information is obtained; extracting graph characteristics of a preset guidance graph through the graph coding characteristic module to obtain priori characteristics; and carrying out feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain the modal feature distribution parameters corresponding to the group of the mathematical data. The method can utilize a self-scaling attention mechanism to carry out coding extraction, feature fusion, decoding and output on the histology data of each mode unit to obtain probability distribution information of each mode feature, is favorable for reconstructing the histology data better and realizes integration and analysis on the multi-mode histology data.

Referring to FIG. 5, FIG. 5 is a flow chart illustrating a third embodiment of a batch effect processing method for multi-modal cell data according to the present invention;

based on the above embodiment, in order to obtain a model with higher accuracy and robustness, a semi-supervised learning mode may be used for model training. Therefore, in this embodiment, the batch effect processing method of the multi-modal cell data of the present invention further includes:

step S01: and acquiring multi-modal cytology data and converting the multi-modal cytology data into raster data.

It will be appreciated that the collected cytological data of each modality may be converted into grid data of a grid structure for presentation due to differences in the experimental environment, individuals, tissues or species of the cytological data and mismatched patterns from different sequencing techniques.

It should be noted that the raster data may be a mosaic dataset (mosaic dataset). In cytohistology, mosaicdataset is widely used to integrate cell image data from different sequencing technologies, experiments or data sources. It is able to stitch the image data into a continuous, accurate image mosaic. By using the mosaicdataset, cell image data can be retrieved and analyzed as needed to easily view and compare cell structure and morphological features between different cell samples.

Step S02: a cytological dataset is constructed based on the raster data.

It should be noted that the mosaicdataset may store not only the original image data, but also additional attribute data such as annotation information of cell type, marker information, etc. Therefore, when the cytomic data set is constructed, whether annotation information exists in each unit data in the raster data can be judged first; then taking each unit data with annotation information as a first training data set and taking each unit data without annotation information as a second training data set; and finally, constructing a cytological data set according to the first training data set and the second training data set.

Step S03: training the single cell data integration model to be trained through the cytology data set to obtain a preset single cell data integration model.

It will be appreciated that, since the cytology dataset includes part of the annotated good unit data, step 03 further comprises, for training the single cell data integration model:

step S031: and based on a supervised learning algorithm, performing first training on the single-cell data integration model to be trained through the first training data set to obtain an initialized single-cell data integration model.

Step S032: and based on an unsupervised learning algorithm, performing second training on the initialized single-cell data integration model through the second training data set, and updating the single-cell data integration model according to a learning result.

Step S033: and repeatedly executing the first training and the second training to obtain a preset single-cell data integration model.

In the specific implementation, firstly, a supervised learning algorithm is adopted, and model training is carried out by adopting the unit data with the notes in the first training data set, so that the initialization of a single-cell data integration model is realized; and then adopting the unit data without annotation in the second training data set to carry out model adjustment, and updating the single-cell data integration model. And repeating the training process, and carrying out parameter optimization updating on the model until a preset stopping condition (such as the maximum iteration number or a certain performance index) is reached, so as to obtain a preset single-cell data integration model.

In the embodiment, by constructing the cytology data set and classifying the cytology data set based on the presence or absence of the annotation, model training is performed by respectively adopting the unit data with the annotation and the unit data without the annotation, so that model training based on a semi-supervised learning mode is realized, full utilization of annotation information in multi-mode cytology data is facilitated, accuracy and reliability of a model obtained by training are improved, and integration efficiency of subsequent multi-study data integration is further improved.

In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a batch effect processing program of the multi-modal cell data, and the batch effect processing program of the multi-modal cell data realizes the steps of the batch effect processing method of the multi-modal cell data when being executed by a processor.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method for batch effect processing of multi-modal cell data, the method comprising:

obtaining histology data of the single cells;

2. The batch effect processing method of multi-modal cell data as claimed in claim 1, wherein the preset single cell data integration model is constructed based on a variation self-encoder, the preset single cell data integration model further comprising: the system comprises a modal encoding module, a graph encoding module and a discriminator module, wherein the discriminator module is arranged between the self-scaling attention fusion module and the hybrid decoding module.

3. The method for batch effect processing of multi-modal cell data as set forth in claim 2, wherein the inputting the omics data into a preset single cell data integration model to obtain a modal feature generation result corresponding to the omics data includes:

4. The batch effect processing method of multi-modal cell data as claimed in claim 3 wherein the hybrid decoding module includes: a hybrid fusion sub-module and a decoding sub-module; the step of performing feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain modal feature distribution parameters corresponding to the group of chemical data, wherein the method comprises the following steps:

5. The batch effect processing method of multi-modal cell data as claimed in claim 1 wherein the preset single cell data integration model is trained using semi-supervised learning, the method further comprising:

constructing a cytological dataset based on the raster data;

6. The batch effect processing method of multi-modal cell data as claimed in claim 5 wherein the constructing a cytological dataset based on the raster data includes:

7. The batch effect processing method of multi-modal cell data according to claim 6, wherein training the single cell data integration model to be trained through the cytology data set to obtain a preset single cell data integration model includes:

8. The method of claim 1, wherein the obtaining the multi-set of integrated data for the single cell based on the modal generation results further comprises:

obtaining each reconstruction feature of the corresponding omic data according to the modal generation result;

9. A batch effect processing apparatus for multimodal cell data, the apparatus comprising: a memory, a processor and a batch effect processing program of multi-modal cell data stored on the memory and executable on the processor, the batch effect processing program of multi-modal cell data configured to implement the steps of the batch effect processing method of multi-modal cell data as claimed in any one of claims 1 to 8.

10. A storage medium, wherein a batch effect processing program of multi-modal cell data is stored on the storage medium, and the batch effect processing program of multi-modal cell data, when executed by a processor, implements the steps of the batch effect processing method of multi-modal cell data as claimed in any one of claims 1 to 8.