CN117854599A - Batch effect processing method, equipment and storage medium for multi-mode cell data - Google Patents

Batch effect processing method, equipment and storage medium for multi-mode cell data Download PDF

Info

Publication number
CN117854599A
CN117854599A CN202410259150.8A CN202410259150A CN117854599A CN 117854599 A CN117854599 A CN 117854599A CN 202410259150 A CN202410259150 A CN 202410259150A CN 117854599 A CN117854599 A CN 117854599A
Authority
CN
China
Prior art keywords
data
modal
cell data
module
cell
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410259150.8A
Other languages
Chinese (zh)
Other versions
CN117854599B (en
Inventor
荣志炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202410259150.8A priority Critical patent/CN117854599B/en
Publication of CN117854599A publication Critical patent/CN117854599A/en
Application granted granted Critical
Publication of CN117854599B publication Critical patent/CN117854599B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of cytohistology, and discloses a batch effect processing method, equipment and a storage medium of multi-mode cell data, wherein the method comprises the following steps: obtaining histology data of the single cells; inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the histology data of each modal unit to obtain global characteristics, and inputting the global characteristics into the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters; a plurality of sets of data are integrated into a single cell obtained from the results of the modality generation. The invention can adapt to multi-mode histology data because the preset single-cell data integration model comprises the self-scaling attention module, so as to realize accurate analysis of the characteristics of each mode in the histology data and reduce the influence of batch effect in data integration.

Description

Batch effect processing method, equipment and storage medium for multi-mode cell data
Technical Field
The present invention relates to the field of cytohistology, and in particular, to a method, an apparatus, and a storage medium for processing batch effects of multi-modal cell data.
Background
Single cell histology is a technique that enables high throughput analysis of single cells, including: single-cell transcriptome sequencing (Single-cell RNA sequencing, scRNA-seq), single-cell proteomics (Single-cell proteomics), and the like, to perform a plurality of histologic measurements on the same cell, and to integrate the obtained sets of data, and to enable finer analysis at the cellular level using these abundant resources.
However, due to differences in experimental environments, sequencing techniques, and data processing modes of different batches, the batch effect is susceptible to when the histologic data of cells with modal diversity are integrated. In the existing single-cell transcriptomic data integration method, a complex calculation method is required to reduce the influence of the batch effect, and calculation scene limitation exists, so that the accuracy of integrated data is not high when the method is applied to multi-mode cytohistology data.
The foregoing is provided merely for the purpose of facilitating understanding of the technical scheme of the present invention and is not intended to represent an admission that the foregoing is related art.
Disclosure of Invention
The invention mainly aims to provide a batch effect processing method, equipment and storage medium for multi-mode cell data, and aims to solve the technical problems that the existing single-cell transcriptome data integration mode is high in calculation complexity and the accuracy of integrated data is not high and difficult when the method is applied to multi-mode cell transcriptome data.
To achieve the above object, the present invention provides a batch effect processing method for multi-modal cell data, the method comprising the steps of:
obtaining histology data of the single cells;
inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the group study data of each modal unit to obtain global characteristics, and inputting the global characteristics to the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters;
a plurality of sets of integrated data for the single cell are obtained from the modality generation.
Optionally, the preset single-cell data integration model is constructed based on a variation self-encoder, and the preset single-cell data integration model further includes: the system comprises a modal encoding module, a graph encoding module and a discriminator module, wherein the discriminator module is arranged between the self-scaling attention fusion module and the hybrid decoding module.
Optionally, the inputting the omic data into a preset single-cell data integration model to obtain a modal feature generation result corresponding to the omic data includes:
extracting features of the group of data through the mode coding module to obtain mode features of each mode unit;
performing feature fusion on the modal features through the self-scaling attention fusion module to obtain global features;
batch distribution coordination is carried out on the global features through the discriminator module, and batch distribution information is obtained;
extracting graph characteristics of a preset guidance graph through the graph coding characteristic module to obtain priori characteristics;
and carrying out feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain the modal feature distribution parameters corresponding to the group of the mathematical data.
Optionally, the hybrid decoding module includes: a hybrid fusion sub-module and a decoding sub-module; the step of performing feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain modal feature distribution parameters corresponding to the group of chemical data, wherein the method comprises the following steps:
mapping and fusing the prior feature, the global feature and the batch distribution information based on a mode unit through the hybrid fusion sub-module to obtain an integral expression feature;
and integrating and generating the integral expression characteristics through the decoding module to obtain modal characteristic distribution parameters corresponding to the omics data.
Optionally, the preset single-cell data integration model is trained by semi-supervised learning, and the method further comprises:
collecting multi-modal cytology data and converting the multi-modal cytology data into raster data;
constructing a cytological dataset based on the raster data;
training the single cell data integration model to be trained through the cytology data set to obtain a preset single cell data integration model.
Optionally, the constructing a cytological dataset based on the raster data includes:
judging whether annotation information exists in each unit data in the raster data;
taking each unit data with annotation information as a first training data set and taking each unit data without annotation information as a second training data set;
a cytological dataset is constructed from the first training dataset and the second training dataset.
Optionally, the training the single-cell data integration model to be trained through the cytology data set to obtain a preset single-cell data integration model includes:
based on a supervised learning algorithm, performing first training on the single-cell data integration model to be trained through the first training data set to obtain an initialized single-cell data integration model;
based on an unsupervised learning algorithm, performing second training on the initialized single-cell data integration model through the second training data set, and updating the single-cell data integration model according to a learning result;
and repeatedly executing the first training and the second training to obtain a preset single-cell data integration model.
Optionally, the obtaining the multi-group of the single cell data sets according to the modal generation result further comprises:
obtaining each reconstruction feature corresponding to the omic data according to the modal generation result;
and carrying out data integration on each reconstruction feature in a shared embedded space to obtain a multi-group integrated data set of the single cells.
In addition, in order to achieve the above object, the present invention also provides a batch effect processing apparatus for multi-modal cell data, the apparatus comprising: the system comprises a memory, a processor and a batch effect processing program of the multi-modal cell data stored on the memory and capable of running on the processor, wherein the batch effect processing program of the multi-modal cell data is configured to realize the steps of the batch effect processing method of the multi-modal cell data.
In addition, in order to achieve the above object, the present invention also provides a storage medium, on which a batch effect processing program of multi-modal cell data is stored, which when executed by a processor, implements the steps of the batch effect processing method of multi-modal cell data as described above.
The invention obtains the histology data of single cells; inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the group study data of each modal unit to obtain global characteristics, and inputting the global characteristics to the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters; a plurality of sets of integrated data for the single cell are obtained from the modality generation. Because the preset single-cell data integration model comprises the self-scaling attention module, the invention can adapt to the histology data with modal diversity to obtain the fusion characteristics, and the mixed decoding module performs characteristic mapping to obtain modal characteristic distribution parameters, thereby realizing the accurate analysis of each modal characteristic in the histology data, reducing the influence of batch effect in data integration and obtaining a multi-group-study integration data set with higher accuracy.
Drawings
FIG. 1 is a schematic diagram of a batch effect processing device for multi-modal cell data in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flow chart of a first embodiment of a batch effect processing method for multi-modal cell data in accordance with the present invention;
FIG. 3 is a flow chart of a second embodiment of a batch effect processing method for multi-modal cell data in accordance with the present invention;
FIG. 4 is a diagram of a model structure of a pre-set single cell data integration model in a batch effect processing method of multi-modal cell data according to an embodiment of the present invention;
FIG. 5 is a flow chart of a third embodiment of a batch effect processing method for multi-modal cell data in accordance with the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a batch effect processing apparatus for multi-modal cell data in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the batch effect processing apparatus for multi-modal cell data may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the configuration shown in FIG. 1 does not constitute a limitation of the batch effect processing apparatus for multi-modal cell data and may include more or fewer components than shown, or certain components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include an operating system, a network communication module, a user interface module, and a batch effect processing program for multi-modal cell data.
In the batch effect processing device of the multi-modal cell data shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the batch effect processing device for multi-modal cell data may be disposed in the batch effect processing device for multi-modal cell data, where the batch effect processing device for multi-modal cell data invokes a batch effect processing program for multi-modal cell data stored in the memory 1005 through the processor 1001, and executes the batch effect processing method for multi-modal cell data provided by the embodiment of the invention.
An embodiment of the present invention provides a batch effect processing method for multi-modal cell data, and referring to fig. 2, fig. 2 is a flow chart of a first embodiment of the batch effect processing method for multi-modal cell data.
In this embodiment, the batch effect processing method of the multi-modal cell data includes the following steps:
step S10: histology data of individual cells were obtained.
It should be noted that, by adopting the single-cell histology technology, single-cell histology data of different modes can be extracted from cells sampled from the same biological sample or tissue, and the integration of the single-cell histology data can establish multi-mode connection, which is beneficial to more comprehensive biological internal analysis. However, due to differences in experimental environment, individuals or tissues, the histology data of the cells show batch effects in the integration analysis, which affects the accuracy of the integration data.
It should be noted that, in the existing methods, processing of data of a specific group of single cells is mostly limited, that is, correction of batch effect is performed on data of a group of single cells in one mode. However, a single cell may correspond to histology data of multiple modalities that lack common features in different modalities and different features of different modalities exist in different dimensional spaces, so alignment of the cell states of the histology data of each modality unit is required before single cell-based multi-modality data integration.
It should be understood that this embodiment provides a batch effect processing method for multi-modal cell data, by inputting the histology data of single cell into a preset single cell data integration model to obtain the modal feature generation result of the corresponding histology data, and because the preset single cell data integration model is provided with a self-scaling attention fusion module, each modal feature under different modal combinations in the histology data can be adaptively fused, global features are obtained, and decoded and output, so as to obtain the modal feature distribution parameters of the histology data, further realize cell state alignment, integrate each histology data, and be favorable for obtaining an accurate multi-set of the histology integration data set.
It should be further noted that, the method of the present embodiment may be applied to batch effect elimination of the multi-mode histology data, or in the process of cell alignment before single-cell multi-mode histology data integration, the execution subject of the method of the present embodiment may be a computing service device with functions of data processing, model calling, data storage and program operation, such as a mobile phone, a personal computer, a cell analyzer, etc., or other electronic devices capable of implementing the same or similar functions, which is not limited in this embodiment. The present embodiment and the following embodiments will be described by taking a batch effect processing apparatus (hereinafter referred to as a processing apparatus) for multi-modal cell data as an example.
It should be understood that the histology data of a single cell may include histology data of different modalities that may include chromatin accessibility (DNA level), gene expression (RNA level), and epitopes (protein level), among others.
Step S20: inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the group study data of each modal unit to obtain global characteristics, and inputting the global characteristics to the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters.
It should be noted that the preset single cell data integration model may be based on a variation self-encoder (Variational Autoencoders, VAE). The VAE is constructed as a model, and the integration and analysis of the data can be realized by pre-learning the potential form of the histology data. The VAE may include an encoder structure and a decoder structure, where the input data can be mapped to a potential space by the encoder and the potential vectors can be mapped back to the original space of the input data by the decoder, and the data integration analysis is performed based on the feature distribution.
It should also be noted that the self-scaling attention mechanism can allow attention calculations for each location in the modality feature with all other locations, while a scaling factor can also be introduced to adjust the magnitude of the attention weight. In this embodiment, the histology data may be a combination of data of different mode units, so the self-scaling attention fusion module included in the preset single-cell data integration model may perform attention weight-based adaptive fusion on mode features in different mode combinations, so as to obtain a global feature. This global feature represents a low-dimensional representation of the cell state of the individual cells.
It should be appreciated that the global featureThe parameterization can be performed by a mixed distribution, which can be made up of discrete variables +.>And the continuous variable->Composition is prepared. Discrete variable->Can represent different cell state categories, while the continuous variable +.>The degree of change in the cell state class or other relevant information may be represented.
It can be understood that by the self-scaling attention fusion module, the histology data of the input model can be effectively extracted and modeled in terms of cell states, and the information of different modes can be coded and fused to obtain global features of low-dimensional representation, so that more effective feature representation can be provided for subsequent tasks.
It should be noted that, after the global feature is obtained, in order to implement the integration of the omic data based on the model, the global feature may be input to the hybrid decoding module, and the prior feature and the batch distribution information input to the hybrid decoding module are combined to obtain the modal feature distribution parameter corresponding to the omic data, so as to obtain a richer modal feature generation result.
Step S30: a plurality of sets of integrated data for the single cell are obtained from the modality generation.
It should be noted that, the processing device may obtain each reconstruction feature corresponding to the omic data according to the modal generating result; and carrying out data integration on each reconstruction feature in a shared embedded space to obtain a multi-group integrated data set of the single cells.
It can be understood that the above-mentioned modal feature generation result may include modal feature distribution parameters corresponding to the histology data, and because the above-mentioned hybrid decoding module may map the overall expression feature obtained by fusing the global feature with the prior feature and the batch distribution information to the shared embedded space (Common embedding space), low-dimensional space mapping is achieved, so that the reconstructed histology feature of the single cell can be obtained in the low-dimensional space, the influence of the batch effect can be eliminated, the integration of (multiple) histology data of the single cell is achieved, and a plurality of groups of chemically integrated data sets are obtained.
The present example was performed by obtaining histology data of individual cells; inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the group study data of each modal unit to obtain global characteristics, and inputting the global characteristics to the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters; a plurality of sets of integrated data for the single cell are obtained from the modality generation. Because the preset single-cell data integration model in the embodiment comprises the self-scaling attention module, the method can adapt to the histology data with modal diversity to obtain the fusion characteristics, and further, the mixed decoding module performs characteristic mapping to obtain modal characteristic distribution parameters, so that the accurate analysis of each modal characteristic in the histology data is realized, the influence of batch effect in data integration can be reduced, and a multi-group data integration data set with higher accuracy is obtained.
Referring to FIG. 3, FIG. 3 is a flow chart illustrating a second embodiment of a batch effect processing method for multi-modal cell data according to the present invention;
based on the above embodiments, in order to further describe the processing procedure of the preset single cell data integration model on the input data, in this embodiment, the preset single cell data integration model further includes: the mode encoding module, the graph encoding module and the discriminator module, the discriminator module is disposed between the self-scaling attention fusion module and the hybrid decoding module, therefore, step S20 includes:
step S201: and extracting features of the group of the data through the mode coding module to obtain mode features of each mode unit.
It should be noted that, the mode encoding module may be an encoder structure in a VAE architecture, and since the histology data in a single cell includes a plurality of mode units, corresponding encoders may be respectively set based on different mode units. The data of each mode unit in the group data is subjected to feature extraction through the different encoders, so that corresponding mode features can be obtained, and better expression of the mode information of each mode unit in the group data is facilitated.
Step S202: and carrying out feature fusion on the modal features through the self-scaling attention fusion module to obtain global features.
It should be noted that, in order to model the cell state of a single cell, feature fusion may be performed on the modal features of each modal unit. The single cells input into the model may have differences, and the corresponding histology data may also have differences in the composition of the mode units, for example, the histology data may include histology data of RNA mode and protein mode, and may also include histology data of DNA mode and protein mode. Therefore, the adaptive fusion of the modal characteristics of each modal unit can be carried out based on the adaptive attention mechanism, and the global characteristics are obtained.
Step S203: and carrying out batch distribution coordination on the global features through the discriminator module to obtain batch distribution information.
It will be appreciated that in order to adjust the distribution of global features in different batches, parameterized global features may be input to the arbiter, the batches being randomly extracted portions of samples from the input histology data, and the global features also being features derived from low-dimensional representations of these extracted samples. The global features can be input into the evaluator, which can make the authenticity of the reconstructed features output by the model stronger.
Step S204: and extracting the graph characteristics of the preset guidance graph through the graph coding characteristic module to obtain the prior characteristics.
It should be noted that the graph coding feature module may be a graph encoder, where the graph encoder may perform graph coding on a guidance graph with prior modal knowledge, and convert the guidance graph to obtain feature vectors to obtain prior features.
Step S205: and carrying out feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain the modal feature distribution parameters corresponding to the group of the mathematical data.
The hybrid decoding module may be a decoder structure in the VAE architecture, and may specifically be a hybrid decoder structure specific to a plurality of modes, and may generate and output specific expression features according to each mode unit.
It should be further noted that, in order to fully utilize the prior knowledge and eliminate the influence of the batch effect, the hybrid decoding module may be divided into a hybrid fusion sub-module and a decoding sub-module to further illustrate the function of the hybrid decoding module, so step S205 may include:
step S2051: and mapping and fusing the prior feature, the global feature and the batch distribution information based on the mode unit through the hybrid fusion sub-module to obtain an integral expression feature.
It should be noted that, the prior feature may include information such as correlation, prior distribution, etc. between data of each mode unit in the omic data, and the global feature and the prior feature are configured to form a mapping relationship, so that it is convenient to learn the mode feature representation of different mode units better. In addition, in order to further eliminate the influence of the batch effect, the distribution condition of each mode unit in the group data can be mapped with the prior characteristic and the global characteristic, and the information of the distribution condition of the data of each mode unit in the batch can be introduced.
Step S2052: and integrating and generating the integral expression characteristics through the decoding module to obtain modal characteristic distribution parameters corresponding to the omics data.
In a specific implementation, the prior feature, the global feature and the batch distribution information are subjected to distribution mapping and then fused to generate an overall expression feature corresponding to the modal feature comprising each modal unit, so that a plurality of modal-specific hybrid decoder structures can conveniently use the association information among the modalities to generate the distribution condition of the feature of each modality, namely the modal feature distribution parameters, so that the data reconstruction of the histology data can be realized according to the modal feature distribution parameters, and the reconstructed histology feature can be obtained in a low-dimensional space.
In addition, reference may be made to fig. 4, and fig. 4 is a schematic diagram of a model structure of a preset single cell data integration model in the batch effect processing method of multi-modal cell data according to an embodiment of the invention.
In FIG. 4, the Encoder (Encoder) in the Encoder module of the graph has data on the histology of individual mode units of a single cell) Extracting features to obtain corresponding modal features; then, attention-based feature Fusion (Attention Fusion) is carried out in a self-scaling Attention module to obtain global feature +.>And by the fact that the variable ∈ ->And the continuous variable->The composition mixed distribution is parameterized and then is input into a Discriminator module, and the Discriminator (Discriminator) performs distribution coordination among different batches to obtain batch distribution information +.>. Meanwhile, the image encoder (image encoder) in the image encoding module hasInstruction graph with a priori knowledge->And (5) carrying out graph coding, and converting to obtain a feature vector to obtain a priori feature V.
It should be noted that the graph encoder may correspond to a graph Decoder (graph Decoder), and the Hybrid Decoder (Hybrid Decoder) in the Hybrid decoding module finally characterizes the global featureLot characteristic distribution information->And mapping the prior characteristic V and obtaining a distribution condition of the characteristic of each generated mode, namely a mode characteristic distribution parameter, so that the data reconstruction of the histology data is further realized according to the mode characteristic distribution parameter, and the reconstruction (reconfigurations) of the histology characteristic is realized in a low-dimensional space.
In the embodiment, in a preset single-cell data integration model, feature extraction is performed on the omics data through the mode coding module to obtain mode features of each mode unit; performing feature fusion on the modal features through the self-scaling attention fusion module to obtain global features; batch distribution coordination is carried out on the global features through the discriminator module, and batch distribution information is obtained; extracting graph characteristics of a preset guidance graph through the graph coding characteristic module to obtain priori characteristics; and carrying out feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain the modal feature distribution parameters corresponding to the group of the mathematical data. The method can utilize a self-scaling attention mechanism to carry out coding extraction, feature fusion, decoding and output on the histology data of each mode unit to obtain probability distribution information of each mode feature, is favorable for reconstructing the histology data better and realizes integration and analysis on the multi-mode histology data.
Referring to FIG. 5, FIG. 5 is a flow chart illustrating a third embodiment of a batch effect processing method for multi-modal cell data according to the present invention;
based on the above embodiment, in order to obtain a model with higher accuracy and robustness, a semi-supervised learning mode may be used for model training. Therefore, in this embodiment, the batch effect processing method of the multi-modal cell data of the present invention further includes:
step S01: and acquiring multi-modal cytology data and converting the multi-modal cytology data into raster data.
It will be appreciated that the collected cytological data of each modality may be converted into grid data of a grid structure for presentation due to differences in the experimental environment, individuals, tissues or species of the cytological data and mismatched patterns from different sequencing techniques.
It should be noted that the raster data may be a mosaic dataset (mosaic dataset). In cytohistology, mosaicdataset is widely used to integrate cell image data from different sequencing technologies, experiments or data sources. It is able to stitch the image data into a continuous, accurate image mosaic. By using the mosaicdataset, cell image data can be retrieved and analyzed as needed to easily view and compare cell structure and morphological features between different cell samples.
Step S02: a cytological dataset is constructed based on the raster data.
It should be noted that the mosaicdataset may store not only the original image data, but also additional attribute data such as annotation information of cell type, marker information, etc. Therefore, when the cytomic data set is constructed, whether annotation information exists in each unit data in the raster data can be judged first; then taking each unit data with annotation information as a first training data set and taking each unit data without annotation information as a second training data set; and finally, constructing a cytological data set according to the first training data set and the second training data set.
Step S03: training the single cell data integration model to be trained through the cytology data set to obtain a preset single cell data integration model.
It will be appreciated that, since the cytology dataset includes part of the annotated good unit data, step 03 further comprises, for training the single cell data integration model:
step S031: and based on a supervised learning algorithm, performing first training on the single-cell data integration model to be trained through the first training data set to obtain an initialized single-cell data integration model.
Step S032: and based on an unsupervised learning algorithm, performing second training on the initialized single-cell data integration model through the second training data set, and updating the single-cell data integration model according to a learning result.
Step S033: and repeatedly executing the first training and the second training to obtain a preset single-cell data integration model.
In the specific implementation, firstly, a supervised learning algorithm is adopted, and model training is carried out by adopting the unit data with the notes in the first training data set, so that the initialization of a single-cell data integration model is realized; and then adopting the unit data without annotation in the second training data set to carry out model adjustment, and updating the single-cell data integration model. And repeating the training process, and carrying out parameter optimization updating on the model until a preset stopping condition (such as the maximum iteration number or a certain performance index) is reached, so as to obtain a preset single-cell data integration model.
In the embodiment, by constructing the cytology data set and classifying the cytology data set based on the presence or absence of the annotation, model training is performed by respectively adopting the unit data with the annotation and the unit data without the annotation, so that model training based on a semi-supervised learning mode is realized, full utilization of annotation information in multi-mode cytology data is facilitated, accuracy and reliability of a model obtained by training are improved, and integration efficiency of subsequent multi-study data integration is further improved.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a batch effect processing program of the multi-modal cell data, and the batch effect processing program of the multi-modal cell data realizes the steps of the batch effect processing method of the multi-modal cell data when being executed by a processor.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A method for batch effect processing of multi-modal cell data, the method comprising:
obtaining histology data of the single cells;
inputting the histology data into a preset single-cell data integration model to obtain a modal characteristic generation result corresponding to the histology data, wherein the preset single-cell data integration model comprises: the self-scaling attention fusion module is used for fusing the modal characteristics corresponding to the group study data of each modal unit to obtain global characteristics, and inputting the global characteristics to the hybrid decoding module to enable the hybrid decoding module to perform characteristic mapping to obtain modal characteristic distribution parameters;
a plurality of sets of integrated data for the single cell are obtained from the modality generation.
2. The batch effect processing method of multi-modal cell data as claimed in claim 1, wherein the preset single cell data integration model is constructed based on a variation self-encoder, the preset single cell data integration model further comprising: the system comprises a modal encoding module, a graph encoding module and a discriminator module, wherein the discriminator module is arranged between the self-scaling attention fusion module and the hybrid decoding module.
3. The method for batch effect processing of multi-modal cell data as set forth in claim 2, wherein the inputting the omics data into a preset single cell data integration model to obtain a modal feature generation result corresponding to the omics data includes:
extracting features of the group of data through the mode coding module to obtain mode features of each mode unit;
performing feature fusion on the modal features through the self-scaling attention fusion module to obtain global features;
batch distribution coordination is carried out on the global features through the discriminator module, and batch distribution information is obtained;
extracting graph characteristics of a preset guidance graph through the graph coding characteristic module to obtain priori characteristics;
and carrying out feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain the modal feature distribution parameters corresponding to the group of the mathematical data.
4. The batch effect processing method of multi-modal cell data as claimed in claim 3 wherein the hybrid decoding module includes: a hybrid fusion sub-module and a decoding sub-module; the step of performing feature mapping and fusion output on the prior feature, the global feature and the batch distribution information through the hybrid decoding module to obtain modal feature distribution parameters corresponding to the group of chemical data, wherein the method comprises the following steps:
mapping and fusing the prior feature, the global feature and the batch distribution information based on a mode unit through the hybrid fusion sub-module to obtain an integral expression feature;
and integrating and generating the integral expression characteristics through the decoding module to obtain modal characteristic distribution parameters corresponding to the omics data.
5. The batch effect processing method of multi-modal cell data as claimed in claim 1 wherein the preset single cell data integration model is trained using semi-supervised learning, the method further comprising:
collecting multi-modal cytology data and converting the multi-modal cytology data into raster data;
constructing a cytological dataset based on the raster data;
training the single cell data integration model to be trained through the cytology data set to obtain a preset single cell data integration model.
6. The batch effect processing method of multi-modal cell data as claimed in claim 5 wherein the constructing a cytological dataset based on the raster data includes:
judging whether annotation information exists in each unit data in the raster data;
taking each unit data with annotation information as a first training data set and taking each unit data without annotation information as a second training data set;
a cytological dataset is constructed from the first training dataset and the second training dataset.
7. The batch effect processing method of multi-modal cell data according to claim 6, wherein training the single cell data integration model to be trained through the cytology data set to obtain a preset single cell data integration model includes:
based on a supervised learning algorithm, performing first training on the single-cell data integration model to be trained through the first training data set to obtain an initialized single-cell data integration model;
based on an unsupervised learning algorithm, performing second training on the initialized single-cell data integration model through the second training data set, and updating the single-cell data integration model according to a learning result;
and repeatedly executing the first training and the second training to obtain a preset single-cell data integration model.
8. The method of claim 1, wherein the obtaining the multi-set of integrated data for the single cell based on the modal generation results further comprises:
obtaining each reconstruction feature of the corresponding omic data according to the modal generation result;
and carrying out data integration on each reconstruction feature in a shared embedded space to obtain a multi-group integrated data set of the single cells.
9. A batch effect processing apparatus for multimodal cell data, the apparatus comprising: a memory, a processor and a batch effect processing program of multi-modal cell data stored on the memory and executable on the processor, the batch effect processing program of multi-modal cell data configured to implement the steps of the batch effect processing method of multi-modal cell data as claimed in any one of claims 1 to 8.
10. A storage medium, wherein a batch effect processing program of multi-modal cell data is stored on the storage medium, and the batch effect processing program of multi-modal cell data, when executed by a processor, implements the steps of the batch effect processing method of multi-modal cell data as claimed in any one of claims 1 to 8.
CN202410259150.8A 2024-03-07 2024-03-07 Batch effect processing method, equipment and storage medium for multi-mode cell data Active CN117854599B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410259150.8A CN117854599B (en) 2024-03-07 2024-03-07 Batch effect processing method, equipment and storage medium for multi-mode cell data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410259150.8A CN117854599B (en) 2024-03-07 2024-03-07 Batch effect processing method, equipment and storage medium for multi-mode cell data

Publications (2)

Publication Number Publication Date
CN117854599A true CN117854599A (en) 2024-04-09
CN117854599B CN117854599B (en) 2024-05-28

Family

ID=90540466

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410259150.8A Active CN117854599B (en) 2024-03-07 2024-03-07 Batch effect processing method, equipment and storage medium for multi-mode cell data

Country Status (1)

Country Link
CN (1) CN117854599B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822224A (en) * 2021-10-12 2021-12-21 中国人民解放军国防科技大学 Rumor detection method and device integrating multi-modal learning and multi-granularity structure learning
CN114187969A (en) * 2021-11-19 2022-03-15 厦门大学 Deep learning method and system for processing single-cell multi-modal omics data
CN114819056A (en) * 2022-03-16 2022-07-29 西北工业大学 Single cell data integration method based on domain confrontation and variation inference
CN115346602A (en) * 2022-07-14 2022-11-15 西北工业大学 Data analysis method and device
CN115732034A (en) * 2022-11-17 2023-03-03 山东大学 Identification method and system of spatial transcriptome cell expression pattern
US20230139567A1 (en) * 2020-03-09 2023-05-04 Pioneer Hi-Bred International, Inc. Multi-modal methods and systems
CN116580848A (en) * 2023-05-15 2023-08-11 湖南大学 Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers
CN116629123A (en) * 2023-05-25 2023-08-22 南开大学 Pairing-based single-cell multi-group data integration method and system
CN116758397A (en) * 2023-06-27 2023-09-15 华东师范大学 Single-mode induced multi-mode pre-training method and system based on deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230139567A1 (en) * 2020-03-09 2023-05-04 Pioneer Hi-Bred International, Inc. Multi-modal methods and systems
CN113822224A (en) * 2021-10-12 2021-12-21 中国人民解放军国防科技大学 Rumor detection method and device integrating multi-modal learning and multi-granularity structure learning
CN114187969A (en) * 2021-11-19 2022-03-15 厦门大学 Deep learning method and system for processing single-cell multi-modal omics data
CN114819056A (en) * 2022-03-16 2022-07-29 西北工业大学 Single cell data integration method based on domain confrontation and variation inference
CN115346602A (en) * 2022-07-14 2022-11-15 西北工业大学 Data analysis method and device
CN115732034A (en) * 2022-11-17 2023-03-03 山东大学 Identification method and system of spatial transcriptome cell expression pattern
CN116580848A (en) * 2023-05-15 2023-08-11 湖南大学 Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers
CN116629123A (en) * 2023-05-25 2023-08-22 南开大学 Pairing-based single-cell multi-group data integration method and system
CN116758397A (en) * 2023-06-27 2023-09-15 华东师范大学 Single-mode induced multi-mode pre-training method and system based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SAYED HASHIM ET AL.: "SubOmiEmbed: Self-supervised Representation Learning of Multi-omics Data for Cancer Type Classification", 《2022 10TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (ICBCB)》, 13 May 2022 (2022-05-13), pages 66 - 72, XP034139382, DOI: 10.1109/ICBCB55259.2022.9802478 *
YOULIN ZHAN ET AL.: "scMIC: A Deep Multi-Level Information Fusion Framework for Clustering Single-Cell Multi-Omics Data", 《IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS》, vol. 27, no. 12, 19 September 2023 (2023-09-19), pages 6121 *
宁念文: "基于表示学习的多层网络融合关键技术的研究", 《中国博士学位论文全文数据库 信息科技辑》, no. 01, 15 January 2022 (2022-01-15), pages 1 - 154 *

Also Published As

Publication number Publication date
CN117854599B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
WO2022105117A1 (en) Method and device for image quality assessment, computer device, and storage medium
CN111583284A (en) Small sample image semantic segmentation method based on hybrid model
CN110263801B (en) Image processing model generation method and device and electronic equipment
Bhamidi et al. Change point detection in network models: Preferential attachment and long range dependence
CN113362118B (en) User electricity consumption behavior analysis method and system based on random forest
CN113408570A (en) Image category identification method and device based on model distillation, storage medium and terminal
CN114676777A (en) Self-supervision learning fine-grained image classification method based on twin network
Zhang et al. Greedy orthogonal pivoting algorithm for non-negative matrix factorization
CN115311555A (en) Remote sensing image building extraction model generalization method based on batch style mixing
CN113611354B (en) Protein torsion angle prediction method based on lightweight deep convolutional network
CN101840499B (en) Bar code decoding method and binarization method thereof
CN117854599B (en) Batch effect processing method, equipment and storage medium for multi-mode cell data
Solis-Lemus et al. Accurate phylogenetic inference with a symmetry-preserving neural network model
WO2024051655A1 (en) Method and apparatus for processing histopathological whole-slide image, and medium and electronic device
CN113378938A (en) Edge transform graph neural network-based small sample image classification method and system
CN116958712A (en) Image generation method, system, medium and device based on prior probability distribution
CN116912268A (en) Skin lesion image segmentation method, device, equipment and storage medium
CN114445692B (en) Image recognition model construction method and device, computer equipment and storage medium
CN113139617B (en) Power transmission line autonomous positioning method and device and terminal equipment
CN117037917A (en) Cell type prediction model training method, cell type prediction method and device
CN117854600B (en) Cell identification method, device, equipment and storage medium based on multiple sets of chemical data
CN113760407A (en) Information processing method, device, equipment and storage medium
CN112132230A (en) Image classification method and device
CN110543833B (en) Face recognition method, device and equipment based on data dimension reduction and storage medium
US20240054346A1 (en) Systems and methods for simultaneous network pruning and parameter optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant