CN117037917A - Cell type prediction model training method, cell type prediction method and device - Google Patents

Cell type prediction model training method, cell type prediction method and device Download PDF

Info

Publication number
CN117037917A
CN117037917A CN202211216020.3A CN202211216020A CN117037917A CN 117037917 A CN117037917 A CN 117037917A CN 202211216020 A CN202211216020 A CN 202211216020A CN 117037917 A CN117037917 A CN 117037917A
Authority
CN
China
Prior art keywords
feature
cell
tissue
data
cell type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211216020.3A
Other languages
Chinese (zh)
Inventor
杨帆
王芳
姚建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211216020.3A priority Critical patent/CN117037917A/en
Publication of CN117037917A publication Critical patent/CN117037917A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

A cell type prediction model training method, a cell type prediction method and a cell type prediction device are provided, and relate to the field of machine learning of artificial intelligence. The model training method comprises the following steps: acquiring first proteome data of clinical sample tissues and second proteome data of single cells of multiple cell types; inputting the first proteome data into a feature encoder to obtain a first tissue feature; inputting the second proteome data into the feature encoder to obtain a first single cell feature; carrying out weighted summation on first single cell characteristics corresponding to multiple cell types to obtain first weighted characteristics; inputting the first weighted feature into a predictor to obtain cell ratios of the plurality of cell types; and updating parameters of the predictor and the feature encoder according to the first tissue features, the first weighted features and the cell proportion to obtain a cell type prediction model. The application can analyze real tissue proteome data according to single cell proteome data, and reveal tumor microenvironment from the protein level.

Description

Cell type prediction model training method, cell type prediction method and device
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to a cell type prediction model training method, a cell type prediction method and a cell type prediction device.
Background
Various tissues of the human body contain abundant cell types, for example tumor tissues, which contain tumor cells, interstitial tissues, extracellular matrix and immune cells, which constitute a complex tumor microenvironment. In the past, tissue sequencing generally refers to the steps of lysing and sequencing all cells in tumor tissues, and detecting the average value of the expression of all cells in the whole tissue, however, such data are difficult to reflect the heterogeneity of tumor microenvironments. With the development of single cell sequencing technology, researchers can separate single cells from tissues for analysis. Since the tissue is composed of individual cells of different species in a certain proportion, theoretically when we know the expression profile of each cell type, the average expression profile of the whole tissue can be solved by linear polynomial to obtain the proportion of each cell type. However, in reality, the multiple single cell subsets within each cell type do not share exactly the same expression profile, but instead exhibit a distribution that is not exactly uniform, subject to cell cycle, cell state, microenvironment, and sequencing noise.
With the maturation and popularization of single-cell sequencing technology, the single-cell data set can be used as a reference to re-mine the cell types contained in the clinical tissues reported in the past, and the proportion of the cell types reflects the microenvironment of tumor tissues, so that the method provides a reference for immunotherapy and has important clinical significance. With the recent development and maturation of single cell proteomes, a certain amount of single cell proteome data was generated. How to re-analyze the presently disclosed disease tissue proteome data based on single cell proteome data is in need of resolution.
Disclosure of Invention
The application provides a cell type prediction model training method, a cell type prediction method and a cell type prediction device, which can analyze real tissue proteome data according to single cell proteome data and reveal tumor microenvironment from a protein layer.
In a first aspect, an embodiment of the present application provides a method for training a cell classification model, including:
acquiring first proteome data of clinical sample tissues and second proteome data of single cells of multiple cell types;
inputting the first proteome data into a feature encoder to obtain a first tissue feature;
inputting the second proteome data corresponding to the cell types into the feature encoder to obtain first single cell features corresponding to the cell types;
Carrying out weighted summation on the first single cell characteristics corresponding to the multiple cell types to obtain first weighted characteristics;
inputting the first weighted feature into a predictor to obtain cell ratios of the plurality of cell types;
and updating parameters of the predictor and the feature encoder according to the first tissue feature, the first weighting feature and the cell proportion to obtain a cell type prediction model, wherein the cell type prediction model comprises the feature encoder and the predictor.
In a second aspect, embodiments of the present application provide a cell type prediction method, including:
acquiring proteome data of clinical sample tissues;
inputting the proteome data into a feature encoder in a cell type prediction model to obtain tissue features;
inputting the tissue characteristics into a predictor in the cell type prediction model to obtain a cell type proportion predicted value of the clinical sample tissue;
wherein the cell type predictive model is trained according to the method of the first aspect.
In a third aspect, an embodiment of the present application provides a cell type prediction model training apparatus, including:
an acquisition unit for acquiring first proteome data of a clinical sample tissue and second proteome data of single cells of a plurality of cell types;
The feature encoder is used for inputting the first proteome data to obtain a first tissue feature;
the characteristic encoder is further used for inputting the second proteome data corresponding to the cell types respectively to obtain first single-cell characteristics corresponding to the cell types respectively;
the summing unit is used for carrying out weighted summation on the first single cell characteristics corresponding to the plurality of cell types to obtain first weighted characteristics;
a predictor for inputting the first weighted feature to obtain cell proportions of the plurality of cell types;
and the training unit is used for updating parameters of the predictor and the feature encoder according to the first tissue features, the first weighting features and the cell proportion to obtain a cell type prediction model, wherein the cell type prediction model comprises the feature encoder and the predictor.
In a fourth aspect, an embodiment of the present application provides a cell type prediction apparatus, including:
an acquisition unit for acquiring proteome data of a clinical sample tissue;
a cell type prediction model comprising a feature encoder and a predictor; the cell type predictive model is trained according to the method of the first aspect;
The feature encoder is used for inputting the proteome data to obtain tissue features;
the predictor is used for inputting the tissue characteristics to obtain a cell type proportion predicted value of the clinical sample tissue.
In a fifth aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory for storing a computer program, the processor being for invoking and running the computer program stored in the memory for performing the method as in the first or second aspect.
In a sixth aspect, embodiments of the application provide a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform a method as in the first or second aspect.
In a seventh aspect, embodiments of the present application provide a computer program product comprising computer program instructions for causing a computer to perform the method as in the first or second aspect.
In an eighth aspect, embodiments of the present application provide a computer program that causes a computer to perform the method as in the first or second aspect.
Through the technical scheme, the single cell proteome data and the real tissue proteome data share the feature encoder for feature characterization, single cell features corresponding to single cell proteomes of different cell types are weighted and summed to obtain weighted features, and the predictor and the feature encoder are subjected to parameter updating according to the weighted features and the tissue features corresponding to the real tissue proteomes, so that a cell type prediction model can be obtained, wherein the cell type prediction model comprises the trained feature encoder and the predictor. According to the cell type prediction model, the proportion of the cell types contained in the real tissue can be predicted according to the real tissue proteome data. Therefore, the embodiment of the application can analyze real tissue proteome data according to single cell proteome data, realize a proteome deconvolution algorithm, reveal tumor microenvironment from a protein layer and provide possibility for new clinical discovery.
According to the embodiment of the application, the single cell characteristics of different cell types are weighted and summed to generate the simulation data set in a data enhancement mode, so that the effect of combining different types of cells is reasonably and skillfully realized, the data utilization efficiency is effectively improved, and the process of generating a large-scale simulation data set is avoided. Meanwhile, the model is updated by utilizing the weighting characteristics acquired in the data enhancement mode and the tissue characteristics corresponding to the real tissue protein group, so that a characteristic encoder with generalization capability can be built, and the generalization capability of the model on real data is improved.
Furthermore, the embodiment of the application can fully utilize the reported protein group data of the precious clinical sample tissue, analyze the cell type from the protein layer, mine the tumor microenvironment, promote the knowledge of the disease mechanism and provide important references for immunotherapy and prognosis. For example, patient clinical information corresponding to real clinical tissue proteome data, such as an immunotherapy effect, tumor metastasis condition, prognosis information and the like, can be collected, and correlation analysis is carried out on a proteome deconvolution prediction result and the clinical information, so that correlation between tumor microenvironment and immunotherapy and tumor metastasis is revealed, and clinical application value is realized. The method can be applied to the newly collected tissue proteome data in the future, so that the diagnosis is effectively assisted, and the life quality of a patient is improved.
Drawings
FIG. 1 is a schematic diagram of a system architecture of an embodiment of the present application;
FIG. 2 is a schematic flow chart of a cell type predictive model training method according to an embodiment of the application;
FIG. 3A is a diagram of a training model network architecture according to an embodiment of the present application;
FIG. 3B is a specific example of model training according to the network architecture of FIG. 3A;
FIG. 4A is a schematic diagram of a particular model structure of a feature encoder;
FIG. 4B is a schematic diagram of a particular model structure of a predictor;
FIG. 5 is a schematic flow chart diagram of another model training method according to an embodiment of the application;
FIG. 6 is a schematic flow chart of another model training method provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of a specific model structure of a arbiter;
FIG. 8 is a schematic flow chart diagram of a method of cell type prediction according to an embodiment of the present application;
FIG. 9 is a schematic block diagram of a cell type prediction model according to an embodiment of the present application;
FIG. 10 is a schematic block diagram of a cell type predictive model training apparatus in accordance with an embodiment of the application;
FIG. 11 is a schematic block diagram of a cell type prediction device according to embodiments of the present disclosure;
Fig. 12 is a schematic block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.
It should be understood that in embodiments of the present application, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.
In the description of the present application, unless otherwise indicated, "at least one" means one or more, and "a plurality" means two or more. In addition, "and/or" describes an association relationship of the association object, and indicates that there may be three relationships, for example, a and/or B may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
It should be further understood that the description of the first, second, etc. in the embodiments of the present application is for illustration and distinction of descriptive objects, and is not intended to represent any limitation on the number of devices in the embodiments of the present application, nor is it intended to constitute any limitation on the embodiments of the present application.
It should also be appreciated that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Embodiments of the present application may relate to the field of artificial intelligence technology. Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
Embodiments of the application may also relate to the field of machine learning in artificial intelligence. Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Deep Learning (DL) is a branch of machine Learning, an algorithm that attempts to abstract data at a high level using multiple processing layers that contain complex structures or consist of multiple nonlinear transformations. Deep learning is the inherent law and expression hierarchy of learning training sample data, and the information obtained in the learning process is greatly helpful to the interpretation of data such as characters, images, sounds and the like. The final goal of deep learning is to enable a machine to analyze learning capabilities like a person, and to recognize text, images, and sound data. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art.
Embodiments of the present application may also relate to the field of transfer learning in machine learning. The transfer learning (Transfer learning) is a machine learning method, and uses a model developed for task a as an initial point and is reused in the process of developing a model for task B. Deep learning, which uses pre-trained models as the starting point for new models, is a common method, and usually these pre-trained models already consume huge time resources and computing resources when developing neural networks, and migration learning can migrate the learned strong skills to related problems. Migration learning will only work if the depth model feature in the first task is a generalization feature.
Domain adaptation (domaindescription) is an important branch of migration learning, and aims to map data in a source domain (SourceDomain) and a target domain (TargetDomain) with different distributions to the same feature space, and find a certain metric criterion so that the "distance" of the data in the space is as close as possible. The classifier trained on the source domain (labeled) can then be used directly for classification of the target domain data.
The following description of the relevant terms of the application will be provided.
1. Transcriptome (transcriptate): broadly refers to the collection of all transcripts in a cell under a physiological condition, including messenger RNA, ribosomal RNA, transport RAN and non-coding RNA; in a narrow sense, refers to the collection of all mRNAs. Proteins are the main contributors to cellular function, proteomes are the most direct description of cellular function and status, transcriptional composition is an important tool to study gene expression, and transcriptomes are the necessary tie of proteomes to connect genomic genetic information with biological function.
2. Transcriptome sequencing: sequence information and expression information of almost all transcripts of a specific cell or tissue in a certain state are obtained, including mRNA encoding a protein and various non-coding RNAs.
3. Proteome: also referred to as a proteome, refers to all proteins expressed by one Genome (Genome), or one cell, tissue. Upon transcription, a gene may splice in multiple mRNA forms, and a proteome is not a direct product of a genome, and the number of proteins in a proteome may sometimes exceed the number of genomes.
4. Proteomic sequencing: determination of the level of abundance of a primary structure of a protein or of each protein in a group of proteins in a particular cell or tissue. The primary structure of a protein (primalyst structure) includes the number of polypeptide chains that make up the protein. The amino acid sequence of a polypeptide chain is the basis for the biological function of a protein.
5. Transcriptome deconvolution algorithm: a process of mining the proportion of various cell types contained in clinical tissue samples using single cell transcriptome sequencing datasets as references. The proportion of cell types can reflect the microenvironment of tumor tissues and provide a reference for immunotherapy.
6. Proteome deconvolution algorithm: the process of mining the proportion of various cell types contained in clinical tissue samples using single cell proteome data as a reference.
7. Deconvolution: an algorithm-based process for reversing the effect of convolution on recorded data. In general, the purpose of deconvolution is to find a solution to one form of convolution equation:
f*g=h
Typically h is the recorded signal and f is the signal that it is desired to recover, but has been convolved with the other signals g before f is recorded. The function g may represent the transfer function of the instrument or the driving force applied to the physical system. If g is known, then deterministic deconvolution may be performed. However, if g is not known, it needs to be estimated.
In the related art, the deconvolution algorithm commonly used in the transcriptome field is as follows:
1) Collecting single cell transcriptome sequencing data and corresponding cell types;
2) Mixing single-cell transcriptome sequencing data according to a random proportion to obtain simulated tissue data;
3) The simulated tissue data is used for training a deep learning network (generally consisting of a plurality of fully connected layers), the network output is the proportion of various cells in the simulated tissue, and the supervision signal is the proportion of the cell types when the simulated tissue data is generated.
If the existing transcriptome deconvolution algorithm is applied directly to proteomes, there are the following limitations:
1) The deconvolution algorithm based on deep learning firstly needs to generate simulation data in random proportion to train a deep learning network model, and then predicts real organization data by using the trained deep learning network model. However, for each tissue, a corresponding single cell data set needs to be used to generate large scale simulation data prior to each training to cover as much of the cell type and number of the target tissue as possible, which is time consuming and inefficient.
2) Transcriptome sequencing technology and data processing are mature, so that a deep learning network model established based on single-cell transcriptome simulation data can be more accurately transferred to transcriptome data of a real tissue. However, the current single cell proteome sequencing technology is not mature, and has poor data quality compared with the mature tissue proteome sequencing technology, fewer protein types can be detected at the same time, and more serious noise problems and batch effects are caused. Therefore, a large difference exists between the distribution of the simulation organization data and the real organization data established by the single-cell proteome, and the algorithm with good deconvolution effect on the simulation data can have extremely poor migration effect on the real data.
In order to realize a deconvolution algorithm for tissue proteome data based on single-cell proteome data, the embodiment of the application carries out characteristic characterization on the reference single-cell proteome data and real tissue proteome data by using a characteristic encoder, then carries out weighted summation on single-cell characteristics corresponding to single-cell proteomes of different cell types to obtain weighted characteristics, and carries out parameter updating on a predictor and the characteristic encoder according to the weighted characteristics and the tissue characteristics corresponding to the real tissue proteomes to obtain a cell type prediction model, wherein the cell type prediction model comprises a trained characteristic encoder and a trained predictor. According to the cell type prediction model, the proportion of the cell types contained in the real tissue can be predicted according to the real tissue proteome data. Therefore, the embodiment of the application can analyze real tissue proteome data according to single cell proteome data, realize a proteome deconvolution algorithm, reveal tumor microenvironment from a protein layer, and provide possibility for new clinical discovery.
According to the embodiment of the application, the single cell characteristics of different cell types are weighted and summed to generate the simulation data set in a data enhancement mode, so that the effect of combining different types of cells is reasonably and skillfully realized, the data utilization efficiency is effectively improved, and the process of generating a large-scale simulation data set is avoided. Meanwhile, the model is updated by utilizing the weighting characteristics acquired in the data enhancement mode and the tissue characteristics corresponding to the real tissue protein group, so that a characteristic encoder with generalization capability can be built, and the generalization capability of the model on real data is improved.
Further, in the model training process, the embodiment of the application can jointly optimize the first loss of the predictor and the second loss of the feature encoder. The first loss is determined according to the cell proportion of a plurality of cell types corresponding to the weighting characteristic output by the predictor and the sampling proportion of single cell characteristics corresponding to the weighting characteristic, and the second loss is determined according to the tissue characteristics and the weighting characteristic. The model is updated by combining the first loss and the second loss, so that the predictor accurately predicts the cell type proportion, and simultaneously, the feature encoder outputs single cell features and tissue features which are as close as possible in space, namely, the feature encoder extracts similar feature distribution for single cell proteome data and tissue proteome data, thereby a deep learning network model established based on single cell proteome simulation data can be more accurately migrated to proteome data of real tissues, and the generalization capability of the model on the real data is further improved.
In the field of intelligent medical treatment, the embodiment of the application can fully utilize the reported protein group data of the precious clinical sample tissue, analyze the cell type from the protein layer, mine the tumor microenvironment, promote the knowledge of the disease mechanism and provide important references for immunotherapy and prognosis. For example, patient clinical information corresponding to real clinical tissue proteome data, such as an immunotherapy effect, tumor metastasis condition, prognosis information and the like, can be collected, and correlation analysis is carried out on a proteome deconvolution prediction result and the clinical information, so that correlation between tumor microenvironment and immunotherapy and tumor metastasis is revealed, and clinical application value is realized. The method can be applied to the newly collected tissue proteome data in the future, so that the diagnosis is effectively assisted, and the life quality of a patient is improved.
A system architecture suitable for use with the present application is described below in conjunction with the accompanying drawings.
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 1, the system architecture may include a user device 101, a data acquisition device 102, a training device 103, an execution device 104, a database 105, and a content library 106.
The data acquisition device 102 is configured to read training data from the content library 106, and store the read training data in the database 105. The training data related to the embodiment of the application comprises single-cell proteome data and labels thereof, and proteome data of real tissues (the real labels are not needed), wherein the labels of the single-cell proteome data are cell types of the single-cell proteome data.
Training device 103 trains the deep learning model based on training data maintained in database 105. The model obtained by the training device 103 can analyze the real tissue proteome data to obtain the proportion of the cell types contained in the real tissue. The model obtained by training device 103 may be applied to different systems or devices.
In addition, referring to fig. 1, the execution device 104 is configured with an I/O interface 107, and performs data interaction with an external device. Such as receiving data to be predicted, such as real-organized proteome data, sent by the user device 101 via an I/O interface. The calculation module 109 in the execution device 104 processes the input data using the trained model, outputs the predicted result of the data, and sends the corresponding result to the user device 101 through the I/O interface.
The user device 101 may include a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a mobile internet device (mobile internet device, MID), or other terminal devices.
The execution device 104 may be a server. By way of example, the server may be a rack server, a blade server, a tower server, or a rack server, among other computing devices. The server may be an independent server or a server cluster formed by a plurality of servers.
In this embodiment, the execution device 104 is connected to the user device 101 through a network. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, a telephony network, etc.
It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawings does not constitute any limitation. In some embodiments, the data acquisition device 102 may be the same device as the user device 101, the training device 103, and the execution device 104. The database 105 may be distributed over one server or over a plurality of servers, and the content library 106 may be distributed over one server or over a plurality of servers.
The following describes the technical scheme of the embodiments of the present application in detail through some embodiments. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.
The model training process according to the embodiment of the present application will be described with reference to fig. 2.
FIG. 2 is a schematic flow diagram of a cell type prediction model training method 200 according to an embodiment of the application, where the method 200 may be performed by any electronic device having data processing capabilities, such as a server, for example, and such as the training device 103 of FIG. 1, as the application is not limited in this regard.
In some embodiments, a network architecture for training a model may be included (e.g., deployed) in an electronic device for performing the cell type predictive model training method 200. Fig. 3A shows a schematic diagram of a network architecture for model training that may be used to perform the method 200. As shown in fig. 3A, the model trained network architecture may include a feature encoder 310, a weighted sum module 320, and a predictor 330. Fig. 3B shows a specific example of model training according to the network architecture in fig. 3A. In fig. 3A and 3B, single cell proteome data and tissue proteome data may be input to a feature encoder, single cell features of the single cell proteome data and tissue features of the tissue proteome data may be extracted, respectively, and the weighted summation module 320 performs weighted summation on single cell features corresponding to the single cell proteome data of different cell types to obtain weighted features, and inputs the weighted features to a predictor to obtain a predicted value of cell type proportion.
The first penalty of the predictor 330 and the second penalty of the feature encoder 310 may be jointly optimized during model training. The first loss can be obtained according to the prediction result output by the predictor and the sampling proportion of the single cell characteristics corresponding to the single cell proteome data of different cell types by the weighted summation module, and the second loss can be obtained according to the weighted characteristics and the tissue characteristics.
The various steps of method 200 will be described below in connection with the network architecture of fig. 3A.
As shown in fig. 2, method 200 includes steps 210 through 260.
210, obtaining first proteome data of a clinical sample tissue, and second proteome data of single cells of a plurality of cell types.
The first proteome data is true tissue proteome data, which may also be referred to as tissue proteome data. The second proteome data may also be referred to as single cell proteome data. Wherein the second proteome data may have a tag, which is a cell type to which the single cell proteome data corresponds. It should be understood that the plurality of cell types corresponding to the second proteome data in the embodiments of the present application may cover the cell types corresponding to the clinical sample tissue.
Referring to fig. 3A and 3B, the first proteome data includes, for example, single cell proteome data #1, single cell proteome data #2, and single cell proteome data #3, and the second proteome data includes, for example, tissue proteome data. Wherein the cell types corresponding to the single cell proteome data #1, #2 and #3 cover the cell types contained in the tissue proteome data. It should be understood that FIG. 3B is described by way of example only with respect to single cell proteome data #1, #2 and #3, and embodiments of the present application are not limited thereto.
Alternatively, the first proteome data may have a tag that is the proportion of cell types corresponding to the actual tissue. It should be noted that, since the tag of the real organization is difficult to obtain, in order to improve the generalization performance of the model, the tag of the first proteome data may not be used in the model training process of the embodiment of the present application. Alternatively, the tag of the first proteome data may be used to measure the test performance of the model.
It should be noted that, the single-cell proteome data may be obtained by a single-cell proteome sequencing technique, and the tissue proteome data may be obtained by a tissue proteome sequencing technique. Due to the differences between single cell proteome sequencing technology and tissue proteome sequencing technology, there is a large difference between the distribution of single cell proteome data and real tissue proteome data.
In some embodiments, the collection of first proteome data of a clinical sample tissue may be referred to as a sequencing dataset and the collection of second proteome data of single cells of multiple cell types may be referred to as a reference dataset. Alternatively, the second proteome data of a single cell as a reference dataset may be defined as the source domain and the first proteome data of the real clinical sample organization in the sequencing dataset requiring deconvolution as the target domain. As an implementation manner, a domain adaptation technique may be used to reduce a large difference between the distribution of the first proteome data and the second proteome data, thereby improving the generalization ability of the model.
220, inputting the first proteome data into a feature encoder to obtain a first tissue feature.
Specifically, the feature encoder may perform feature extraction on the first proteome data to obtain a first tissue feature. For example, referring to fig. 3B, each tissue proteome data in the sequencing dataset may be input to the feature encoder 310, such that a corresponding tissue feature, i.e., a first tissue feature, may be derived for each proteome data.
Illustratively, a feature encoder (feature encoder) may be a neural network model that extracts features of the input high-dimensional data. The neural network model may be, for example, a convolutional neural network (convolutional neural network, CNN), i.e. a computational network consisting of a plurality of convolutional operations. The feature encoder may be obtained by pre-training.
As a specific example, the feature encoder may include a fully connected layer, such as a fully connected layer including two or more layers. Fig. 4A shows a schematic diagram of a specific model structure of a feature encoder that includes N (N is a positive integer) layers of fully-connected layers, each of which may include a linear layer, a normalization layer, and an activation layer. The input of the characteristic encoder is proteome data, and the output is characteristic encoding of the proteome data. Specifically, in this step 220, the feature encoder inputs tissue proteome data and outputs feature codes with tissue specificity, i.e., tissue features.
In some embodiments, the feature encoder may also be replaced by other encoding modules, such as a transducer, without limitation.
And 230, inputting second proteome data corresponding to the cell types into the feature encoder to obtain first single cell features corresponding to the cell types.
Specifically, the feature encoder may perform feature extraction on the second proteome data to obtain a first single cell feature corresponding to the second proteome data. The feature encoder is the same as the feature encoder in step 220, which may also be referred to as a shared feature encoder. When the feature encoder is of the structure shown in fig. 4, in this step 230, the feature encoder inputs single cell proteome data and outputs feature codes having cell specificity, i.e., single cell features.
For example, with continued reference to FIG. 3B, single cell proteome data #1, #2 and #3 for a plurality of cell types may be input to the signature encoder 310 to obtain single cell signatures #1, #2 and #3 for each cell type, respectively.
It should be noted that the order of steps 220 and 230 is not limited in the present application. For example, step 220 may be performed before step 230, or after, or both steps may be performed simultaneously.
And 240, carrying out weighted summation on the first single cell characteristics corresponding to the plurality of cell types to obtain first weighted characteristics.
For example, referring to fig. 3B, the single cell features #1, #2, and #3 corresponding to the multiple cell types may be input to the weighted summation module 320, where the weighted summation module 320 samples the single cell features of at least two cell types according to the sampling ratio, and sums the sampled single cell features to obtain a weighted feature, that is, a first weighted feature. The weighted features may be characteristic of the simulated tissue data.
Alternatively, the sampling ratio of the first single cell characteristic for the plurality of cell types may be preset, such as determined based on a priori knowledge of the cell ratio of the tissue. Alternatively, the sampling ratio of the first single cell characteristic corresponding to the plurality of cell types may be randomly determined.
In some embodiments, a first weighted feature may be obtained by linearly weighted summing first single cell features corresponding to a plurality of cell types using a mixed class enhancement algorithm. Illustratively, a mixed class enhancement algorithm, such as mixup, is a data augmentation strategy that may blend features between different classes to augment a training dataset. For example, the mixed enhancement algorithm constructs new training samples and corresponding labels by constructing operations with 'convex' properties on model inputs and labels, thereby improving generalization capability of the model.
In some embodiments, the first single cell characteristic comprises at least one of a single cell data characterization characteristic, a single cell primary protein profile characteristic, and a cell type characteristic. For example, the first weighting feature may be obtained by performing a class-mixing enhancement operation on the single-cell data representation layer of different cell types encoded by the feature encoder, or performing a class-mixing enhancement operation on the single-cell original protein spectrum layer of different cell types, or performing a class-mixing enhancement operation on the single-cell type layer of different cell types.
Therefore, the embodiment of the application generates the simulation data set in a data enhancement mode by carrying out weighted summation on single cell characteristics of different cell types, reasonably and skillfully realizes the effect of combining different types of cells, effectively improves the data utilization efficiency and avoids the process of generating a large-scale simulation data set.
The first weighted feature is input to a predictor, 250, resulting in a cell proportion for a plurality of cell types.
Specifically, the predictor may receive the first weighted feature as input and make a prediction, and the output prediction result is a cell proportion of a plurality of cell types corresponding to the first weighted feature, that is, a proportion predicted value of cells contained in the simulated tissue data. Illustratively, the cell ratio may refer to the ratio of the number of each type of cell, and may also be referred to as the cell type ratio, without limitation. Referring to fig. 3B, the weighted feature is input to a predictor 330, which may derive a ratio predictor of cell type.
Illustratively, the predictor may be a neural network model. As a specific example, the predictor may include a fully connected layer, such as a fully connected layer including two or more layers. Fig. 4B shows a schematic diagram of a specific model structure of a predictor that includes N (N is a positive integer) layer fully connected layers, each of which may include a linear layer, a normalization layer, and an activation layer. The predictor input is the weighted/tissue features and the output is the cell type proportion prediction value. Specifically, in step 250, the predictor inputs the weighted features and outputs the cell type proportion predicted value of the simulated tissue data corresponding to the weighted features.
In some embodiments, the predictor may also be replaced by other neural network models, such as multi-layer perceptrons (Multilayer Perceptron, MLP), etc., without limitation.
And 260, updating parameters of the predictor and the feature encoder according to the first tissue feature, the first weighted feature and the cell proportion to obtain a trained cell type prediction model, wherein the cell type prediction model comprises the feature encoder and the predictor.
According to the embodiment of the application, the model is updated by utilizing the weighted characteristics acquired in the data enhancement mode and the tissue characteristics corresponding to the real tissue protein group, so that a characteristic encoder with generalization capability can be built, and the generalization capability of the model on real data is improved. That is, the embodiment of the application establishes the cell type prediction model based on the single-cell proteome simulation data and the real tissue proteome data, so that the cell type prediction model can accurately predict the cell type proportion in the real tissue according to the proteome data of the real tissue.
Therefore, the embodiment of the application can analyze real tissue proteome data according to single cell proteome data, realize a proteome deconvolution algorithm, reveal tumor microenvironment from a protein layer, and provide possibility for new clinical discovery.
In some embodiments, the parameters of the predictor and feature encoder may be updated during model training by jointly optimizing the first loss of the predictor and the second loss of the feature encoder. Referring to fig. 5, updating parameters of the predictor and feature encoder may be implemented through steps 261 to 263.
261, determining a first loss based on the cell fraction and the sampling fraction of the first single cell characteristic corresponding to the plurality of cell types.
The sampling ratio, that is, the sampling ratio of single cell characteristics of different cell types in step 240, is a true value corresponding to the predicted value of the cell ratio output by the predictor. With continued reference to FIG. 3B, a first loss may be determined based on the ratio prediction value and the sampling ratio of the weighted sum module 320 to the single cell features #1, #2, and # 3.
Illustratively, the first loss may be, without limitation, an absolute error of mean (Mean Absolute Error, MAE), or a mean square error (Mean Square Error, MSE), etc. Wherein MAE, the mean of the distance between the model predicted value and the real value, may also be referred to as L1 loss (L1 loss). MSE, the mean of the square of the difference between the model predicted and true values, may also be referred to as L2loss (L2 loss).
At 262, a second loss is determined based on the first organizational feature and the first weighting feature.
Specifically, the first weighted feature is used as a feature of the simulated tissue data, and the difference between the feature distribution of the first weighted feature and the feature distribution of the first tissue feature of the real clinical sample tissue is caused by the difference between the data distribution of the single cell proteome data and the tissue proteome data. From the difference between the first tissue characteristic and the first weighted characteristic, a second loss may be determined.
263, according to the first loss and the second loss, parameter updating is carried out on the predictor and the feature encoder, and a cell type prediction model is obtained.
For example, the parameters of the predictor and the feature encoder may be updated by using a backward gradient propagation algorithm according to the first loss and the second loss until the loss is no longer reduced, and the model converges after the prediction of the predictor reaches a certain accuracy, at which time training is stopped. The embodiment of the application optimizes the model to convergence based on single-cell proteome data and labels thereof and real tissue proteome data, and the process is a full-automatic process without human intervention, thereby realizing an automatic proteome deconvolution model training process.
In particular, the feature encoder is capable of outputting single cell features and tissue features that are spatially as close as possible while the predictor accurately predicts the proportion of cell types in the tissue sample by combining the first loss and the second loss to update parameters of the predictor and the feature encoder. That is, the feature encoder can extract similar feature distribution from the single-cell proteome data and the tissue proteome data, so that the deep learning network model established based on the single-cell proteome simulation data can be more accurately migrated to the proteome data of the real tissue, and the generalization capability of the model on the real data is further improved.
In some embodiments, a domain adaptation method may be used, where single-cell second proteome data is used as a source domain, real clinical sample organization first proteome data is used as a target domain, data in the source domain and the target domain with different distributions are mapped to the same feature space, and similar feature distributions are extracted for the two data, so that the "distance" between the two data in the feature space is as close as possible, and thus a predictor trained on the source domain may be directly used for predicting the target domain data. Specifically, referring to fig. 6, model training based on domain adaptation may be implemented through steps 2621 to 2623.
2621, inputting the first tissue feature and the first weighted feature into a discriminant to obtain a discriminant domain of the first tissue feature and the first weighted feature, the discriminant domain comprising a source domain or a target domain.
In particular, the discriminant may also be referred to as a domain classifier, and is used to determine the discriminant of the domain to which the input feature belongs, for example, whether the first organization feature and the first weighting feature are from a source domain or a target domain. For example, when the output of the arbiter is 1, it indicates that the discrimination domain is the source domain; the output of the arbiter is 0, indicating that the discrimination domain is the target domain. For another example, when the output of the arbiter is 0, it indicates that the discrimination domain is the source domain; the output of the arbiter is 1, indicating that the discrimination domain is the target domain.
With continued reference to fig. 3B, after tissue proteome data is input to the feature encoder 310 to obtain tissue features, the tissue features may be input to the discriminator 340 to obtain a discrimination domain of tissue features, and the weighted features output by the weighted summation module 320 are input to the discriminator 340 to obtain a discrimination domain of weighted features. The discriminator of the tissue features can be a target domain, and the discriminating domain of the weighted features can be a source domain.
Illustratively, the arbiter may be a neural network model. As a specific example, the arbiter may comprise a fully connected layer, such as a fully connected layer comprising two or more layers. Fig. 7 shows a schematic diagram of a specific model structure of a arbiter that includes N (N is a positive integer) layers of fully connected layers, each of which may include a linear layer, a normalization layer, and an activation layer. The input of the discriminator is a weighted feature/tissue feature, and the output is a discrimination domain.
2622, determining a second penalty based on the discrimination domain and the true domain of the first tissue feature and the first weighted feature, wherein the true domain of the first tissue feature is the target domain and the true domain of the first weighted feature is the source domain.
Illustratively, the second penalty may be a two-class cross entropy of the discriminant domain and the true domain.
Illustratively, the true field of the first tissue feature is a target field, which may be denoted as 0, and the true field of the first weighted feature is a source field, which may be denoted as 1.
2623, updating parameters of the feature encoder, predictor and discriminant domain according to the first loss and the second loss to obtain a cell type prediction model.
Specifically, in the training process according to the first loss and the second loss, the training target of the discriminator is to classify the input information into the correct domain type (source domain or target domain) as much as possible, while the training target of the feature encoder is opposite, and the training target of the feature encoder is to extract the similar feature distribution in the source domain and the target domain as much as possible, so that the discriminator cannot correctly judge which domain the information comes from. Thus, the feature encoder and the arbiter form a antagonistic relationship. It can be seen that when the arbiter cannot correctly classify the received information as a source domain sample or a target domain sample, the feature encoder can extract similar feature distributions for both the source domain and the target domain, so that the "distances" of the two in the same feature space are as close as possible.
In some embodiments, the updated source domain features and target domain features may be extracted and input into the arbiter again, so that the arbiter cannot determine the data source, thereby achieving the goal of countermeasure training. Specifically, with continued reference to fig. 6, countermeasure training may be achieved through the following steps 2624 to 2629.
2624, inputting the first proteome data into the updated feature encoder to obtain a second tissue feature.
Wherein the first proteome data is real tissue proteome data, such as data in a sequencing dataset. An updated feature encoder, such as a feature encoder that has been parameter updated according to step 2623. In particular, step 2624 is similar to step 220 above, and reference may be made to the description above.
2625, inputting the second proteome data corresponding to the plurality of cell types into the updated feature encoder to obtain second single cell features corresponding to the plurality of cell types.
Wherein the second proteome data is single cell proteome data, such as data in a reference dataset. In particular, step 2625 is similar to step 230 above and reference may be made to the description above.
2626, performing weighted summation on the second single cell characteristics corresponding to the plurality of cell types to obtain a second weighted characteristic.
In particular, step 2626 is similar to step 240 above and reference may be made to the description above.
2627, inputting the second tissue feature and the second weighted feature into the updated discriminant to obtain a discriminant domain of the second tissue feature and the second weighted feature.
In particular, step 2627 is similar to step 2621 above and reference may be made to the description above. Unlike step 2621, if the source domain is represented as 1 and the target domain is represented as 0 in the discriminating domain in step 2621, the source domain is represented as 0 and the target domain is represented as 1 in step 2627 in order to confuse the discriminator; if the source domain is represented as 0 and the target domain is represented as 1 in step 2621, the source domain is represented as 1 and the target domain is represented as 0 in step 2627.
2628, determining a third penalty based on the second organizational feature and the discriminant and real domains of the second weighted feature, wherein the real domain of the second organizational feature is the target domain and the discriminant domain of the second weighted feature is the source domain.
Here, the real domain is the same as that in step 2622.
2629, updating parameters of the feature encoder and the discrimination domain according to the third loss.
Therefore, the embodiment of the application can confuse the discriminator to enable the discriminator to not judge the data source, realize that the discriminator cannot correctly classify the received information into the source domain sample or the target domain sample, and further realize that the feature encoder extracts similar feature distribution for the two data of the source domain and the target domain. Based on the method, the deep learning network model established according to the single-cell proteome simulation data can be more accurately transferred to the proteome data of the real tissue, and the generalization capability of the model on the real data is further improved.
In some embodiments, the step 262 may be specifically implemented as: similarity measurement is performed on the first organization feature and the first weighted feature, such as according to seurat integration algorithm, and the similarity between the first combination feature and the second organization feature is comprehensively evaluated to determine the second loss. And updating parameters of the predictor and the feature encoder according to the first loss and the second loss to obtain a cell type prediction model.
After model training is complete, a model test phase may be entered. The cell type prediction of the real tissue can be determined according to the trained cell classification model for deconvolution of the real tissue proteome data in the sequencing dataset or for deconvolution of the proteome data of the real clinical tissue sample. The process of model prediction according to an embodiment of the present application will be described with reference to fig. 8 and 9.
Fig. 8 shows a schematic flow chart of a cell type prediction method 800 according to an embodiment of the present application. As shown in fig. 8, method 800 includes steps 810 through 830.
810, proteome data of clinical sample tissue is obtained.
For example, tissue proteome sequencing techniques may be employed to obtain proteome data of clinical sample tissue. The proteomic data can be reported clinical proteomic samples or newly collected tissue proteomic samples, and is not limited. As a specific example, the proteome data of the clinical sample tissue may be proteome data in a sequencing dataset.
820, inputting proteome data into a feature encoder in a cell type prediction model to obtain tissue features. The cell type prediction model is obtained by training according to the cell type prediction model training method provided by the embodiment of the application. Specifically, the process of extracting the tissue features by the feature encoder is referred to above, and will not be described herein.
830, inputting the tissue characteristics into a predictor in a cell type prediction model to obtain a cell type proportion predicted value of the clinical sample tissue. Specifically, the process of obtaining the cell type proportion predicted value by the predictor is described hereinabove, and will not be repeated here.
Referring to fig. 9, tissue proteome data of clinical sample tissue can be input into a cell type prediction model. The feature encoder firstly performs feature extraction on tissue proteome data to obtain tissue features. And a further predictor predicts the input tissue characteristics to obtain a proportion predicted value of the cell types contained in the clinical sample tissue. Wherein the cell type predictive model is trained according to the model training method hereinabove.
In some embodiments, when the proteome data of the clinical sample tissue is the proteome data in the sequencing dataset, the predicted proportion of cell types of the real tissue can be compared to the label of the proteome data to evaluate the deconvolution of the cell type prediction model on the real data.
Therefore, the embodiment of the application can analyze real tissue proteome data according to single cell proteome data, predict the proportion of cell types contained in real tissues, realize a proteome deconvolution algorithm, reveal tumor microenvironment from a protein layer and provide possibility for new clinical discovery.
In the embodiment of the application, the patient clinical information corresponding to the real clinical tissue proteome data, such as the immunotherapy effect, the tumor metastasis condition, the prognosis information and the like, can be collected, the correlation analysis is carried out on the proteome deconvolution prediction result and the clinical information, the correlation between the tumor microenvironment and the immunotherapy and the tumor metastasis is revealed, and the clinical application value is realized.
The specific embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described further. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be regarded as the disclosure of the present application.
It should be further understood that, in the various method embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application. It is to be understood that the numbers may be interchanged where appropriate such that the described embodiments of the application may be practiced otherwise than as shown or described.
The method embodiments of the present application are described above in detail in conjunction with fig. 1 to 9, and the apparatus embodiments of the present application are described below in detail in conjunction with fig. 10 to 12.
FIG. 10 is a schematic block diagram of a cell type predictive model training apparatus 900 according to an embodiment of the application. As shown in fig. 10, the apparatus 900 may include an acquisition unit 910, a feature encoder 920, a summing unit 930, a predictor 940, and a training unit 950.
An acquisition unit 910 for acquiring first proteome data of a clinical sample tissue and second proteome data of single cells of a plurality of cell types;
a feature encoder 920, configured to input the first proteome data to obtain a first tissue feature;
the characteristic encoder is further used for inputting the second proteome data corresponding to the cell types respectively to obtain first single-cell characteristics corresponding to the cell types respectively;
A summing unit 930, configured to perform weighted summation on the first single cell features corresponding to the multiple cell types, to obtain a first weighted feature;
a predictor 940 for inputting the first weighted feature to obtain a cell proportion of the plurality of cell types;
and a training unit 950, configured to update parameters of a predictor and the feature encoder according to the first tissue feature, the first weighted feature, and the cell proportion, to obtain a cell type prediction model, where the cell type prediction model includes the feature encoder and the predictor.
In some embodiments, the training unit 950 is specifically configured to:
determining a first loss according to the cell proportion and the sampling proportion of the first single cell characteristic corresponding to the plurality of cell types;
determining a second loss based on the first tissue characteristic and the first weighted characteristic;
and updating parameters of the predictor and the feature encoder according to the first loss and the second loss to obtain the cell type prediction model.
In some embodiments, the training unit 950 is specifically configured to:
inputting the first organization feature and the first weighting feature into a discriminator to obtain a discriminating domain of the first organization feature and the first weighting feature, wherein the discriminating domain comprises a source domain or a target domain;
Determining the second loss according to the first organization feature, the discrimination domain and the real domain of the first weighting feature, wherein the real domain of the first organization feature is a target domain, and the real domain of the first weighting feature is a source domain;
and updating parameters of the feature encoder, the predictor and the discrimination domain according to the first loss and the second loss to obtain the cell type prediction model.
In some embodiments, the updated feature encoder 920 is further configured to input the first proteome data to obtain a second tissue feature; inputting the second proteome data corresponding to the cell types respectively to obtain second single cell characteristics corresponding to the cell types respectively;
the summing unit 930 is further configured to perform weighted summation on the second single cell features corresponding to the multiple cell types, to obtain a second weighted feature;
the updated discriminator is further used for inputting the second organization feature and the second weighting feature to obtain a discrimination domain of the second organization feature and the second weighting feature;
the training unit 950 is also configured to: determining a third loss according to the second organization feature and the discrimination domain and the real domain of the second weighting feature, wherein the real domain of the second organization feature is a target domain and the discrimination domain of the second weighting feature is a source domain; and updating parameters of the feature encoder and the discrimination domain according to the third loss.
In some embodiments, the summing unit 930 is specifically configured to:
and carrying out linear weighted summation on the first single cell characteristics corresponding to the plurality of cell types by using a mixed enhancement algorithm to obtain the first weighted characteristics.
In some embodiments, the sampling proportion of the first single cell characteristic corresponding to the plurality of cell types is determined based on a priori knowledge of the cell proportion of the tissue.
In some embodiments, the first single cell signature comprises at least one of a single cell data characterization signature, a single cell primary protein profile signature, and a cell type signature.
It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 900 shown in fig. 10 may perform the above method embodiments, and the foregoing and other operations and/or functions of each module in the apparatus 900 are respectively for implementing the corresponding flows in the method 200, which are not described herein for brevity.
FIG. 11 is a schematic block diagram of a cell type prediction apparatus 1000 according to an embodiment of the present application. As shown in fig. 11, the object detection device 1000 may include an acquisition unit 1010, a cell type prediction model 1020, wherein the cell type prediction model 1020 further includes a feature encoder 1021 and a predictor 1022.
An acquisition unit 1010 for acquiring proteome data of a clinical sample tissue;
a cell type prediction model 1020 comprising a feature encoder 1021 and a predictor 1022; the cell type prediction model 1020 is trained according to the model prediction method 200 provided by the embodiment of the application;
the feature encoder 1021 is configured to input the proteome data to obtain a tissue feature;
the predictor 1022 is configured to input the tissue characteristic to obtain a predicted value of the cell type ratio of the clinical sample tissue.
It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 1000 shown in fig. 11 may perform the above method embodiments, and the foregoing and other operations and/or functions of each module in the apparatus 1000 are respectively for implementing the corresponding flow in the above method 700, which is not described herein for brevity.
The apparatus of the embodiments of the present application is described above in terms of functional modules with reference to the accompanying drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiment in the embodiment of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in a software form, and the steps of the method disclosed in connection with the embodiment of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.
Fig. 12 is a schematic block diagram of an electronic device 1100 provided by an embodiment of the application.
As shown in fig. 12, the electronic device 1100 may include:
a memory 1110 and a processor 1120, the memory 1110 being for storing a computer program and transmitting the program code to the processor 1120. In other words, the processor 1120 may call and run a computer program from the memory 1110 to implement the methods of embodiments of the present application.
For example, the processor 1120 may be configured to perform the above-described method embodiments according to instructions in the computer program.
In some embodiments of the application, the processor 1120 may include, but is not limited to:
a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
In some embodiments of the application, the memory 1110 includes, but is not limited to:
volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).
In some embodiments of the application, the computer program may be partitioned into one or more modules that are stored in the memory 1110 and executed by the processor 1120 to perform the methods provided by the present application. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device.
As shown in fig. 12, the electronic device 1100 may further include:
a transceiver 1130, the transceiver 1130 may be coupled to the processor 1120 or memory 1110.
Wherein the processor 1120 may control the transceiver 1130 to communicate with other devices, and in particular, may send information or data to other devices, or receive information or data sent by other devices. Transceiver 1130 may include a transmitter and a receiver. Transceiver 1130 may further include antennas, the number of which may be one or more.
It will be appreciated that the various components in the electronic device are connected by a bus system that includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.
The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.
When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
It will be appreciated that in the specific implementation of the present application, when the above embodiments of the present application are applied to specific products or technologies and relate to data related to user information and the like, user permissions or consents need to be obtained, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.
Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (13)

1. A method of training a cell type predictive model, comprising:
acquiring first proteome data of clinical sample tissues and second proteome data of single cells of multiple cell types;
Inputting the first proteome data into a feature encoder to obtain a first tissue feature;
inputting the second proteome data corresponding to the cell types into the feature encoder to obtain first single cell features corresponding to the cell types;
carrying out weighted summation on the first single cell characteristics corresponding to the multiple cell types to obtain first weighted characteristics;
inputting the first weighted feature into a predictor to obtain cell ratios of the plurality of cell types;
and updating parameters of the predictor and the feature encoder according to the first tissue feature, the first weighting feature and the cell proportion to obtain a cell type prediction model, wherein the cell type prediction model comprises the feature encoder and the predictor.
2. The method of claim 1, wherein the parameter updating the predictor and the feature encoder based on the first tissue feature, the first weighting feature, and the cell scale to obtain a cell type prediction model comprises:
determining a first loss according to the cell proportion and the sampling proportion of the first single cell characteristic corresponding to the plurality of cell types;
Determining a second loss based on the first tissue characteristic and the first weighted characteristic;
and updating parameters of the predictor and the feature encoder according to the first loss and the second loss to obtain the cell type prediction model.
3. The method of claim 2, wherein said determining a second loss based on said first tissue characteristic and said first weighted characteristic comprises:
inputting the first organization feature and the first weighting feature into a discriminator to obtain a discriminating domain of the first organization feature and the first weighting feature, wherein the discriminating domain comprises a source domain or a target domain;
determining the second loss according to the first organization feature, the discrimination domain and the real domain of the first weighting feature, wherein the real domain of the first organization feature is a target domain, and the real domain of the first weighting feature is a source domain;
and updating parameters of the predictor and the feature encoder according to the first loss and the second loss to obtain the cell type prediction model, wherein the method comprises the following steps:
and updating parameters of the feature encoder, the predictor and the discrimination domain according to the first loss and the second loss to obtain the cell type prediction model.
4. A method according to claim 3, further comprising:
inputting the first proteome data into the updated feature encoder to obtain a second tissue feature;
inputting the second proteome data corresponding to the cell types into the updated feature encoder to obtain second single cell features corresponding to the cell types;
carrying out weighted summation on the second single cell characteristics corresponding to the multiple cell types to obtain second weighted characteristics;
inputting the second organization feature and the second weighting feature into the updated discriminator to obtain a discrimination domain of the second organization feature and the second weighting feature;
determining a third loss according to the second organization feature and the discrimination domain and the real domain of the second weighting feature, wherein the real domain of the second organization feature is a target domain and the discrimination domain of the second weighting feature is a source domain;
and updating parameters of the feature encoder and the discrimination domain according to the third loss.
5. The method of claim 1, wherein the weighted summing the first single cell characteristics corresponding to the plurality of cell types to obtain a first weighted characteristic comprises:
And carrying out linear weighted summation on the first single cell characteristics corresponding to the plurality of cell types by using a mixed enhancement algorithm to obtain the first weighted characteristics.
6. The method of claim 2 or 5, wherein the sampling proportion of the first single cell characteristic corresponding to the plurality of cell types is determined based on a priori knowledge of the cell proportion of the tissue.
7. The method of any one of claims 1-5, wherein the first single cell signature comprises at least one of a single cell data characterization signature, a single cell primary protein profile signature, and a cell type signature.
8. A method of cell type prediction comprising:
acquiring proteome data of clinical sample tissues;
inputting the proteome data into a feature encoder in a cell type prediction model to obtain tissue features;
inputting the tissue characteristics into a predictor in the cell type prediction model to obtain a cell type proportion predicted value of the clinical sample tissue;
wherein the cell type predictive model is trained according to the method of any one of claims 1-7.
9. A cell type predictive model training apparatus, comprising:
An acquisition unit for acquiring first proteome data of a clinical sample tissue and second proteome data of single cells of a plurality of cell types;
the feature encoder is used for inputting the first proteome data to obtain a first tissue feature;
the characteristic encoder is further used for inputting the second proteome data corresponding to the cell types respectively to obtain first single-cell characteristics corresponding to the cell types respectively;
the summing unit is used for carrying out weighted summation on the first single cell characteristics corresponding to the plurality of cell types to obtain first weighted characteristics;
a predictor for inputting the first weighted feature to obtain cell proportions of the plurality of cell types;
and the training unit is used for updating parameters of the predictor and the feature encoder according to the first tissue features, the first weighting features and the cell proportion to obtain a cell type prediction model, wherein the cell type prediction model comprises the feature encoder and the predictor.
10. A cell type prediction apparatus, comprising:
an acquisition unit for acquiring proteome data of a clinical sample tissue;
A cell type prediction model comprising a feature encoder and a predictor; the cell type predictive model trained according to the method of any one of claims 1-7;
the feature encoder is used for inputting the proteome data to obtain tissue features;
the predictor is used for inputting the tissue characteristics to obtain a cell type proportion predicted value of the clinical sample tissue.
11. An electronic device comprising a processor and a memory, the memory having instructions stored therein that when executed by the processor cause the processor to perform the method of any of claims 1-8.
12. A computer storage medium for storing a computer program, the computer program comprising instructions for performing the method of any one of claims 1-8.
13. A computer program product comprising computer program code which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1-8.
CN202211216020.3A 2022-09-30 2022-09-30 Cell type prediction model training method, cell type prediction method and device Pending CN117037917A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211216020.3A CN117037917A (en) 2022-09-30 2022-09-30 Cell type prediction model training method, cell type prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211216020.3A CN117037917A (en) 2022-09-30 2022-09-30 Cell type prediction model training method, cell type prediction method and device

Publications (1)

Publication Number Publication Date
CN117037917A true CN117037917A (en) 2023-11-10

Family

ID=88641724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211216020.3A Pending CN117037917A (en) 2022-09-30 2022-09-30 Cell type prediction model training method, cell type prediction method and device

Country Status (1)

Country Link
CN (1) CN117037917A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117746995A (en) * 2024-02-21 2024-03-22 厦门大学 Cell type identification method, device and equipment based on single-cell RNA sequencing data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117746995A (en) * 2024-02-21 2024-03-22 厦门大学 Cell type identification method, device and equipment based on single-cell RNA sequencing data
CN117746995B (en) * 2024-02-21 2024-05-28 厦门大学 Cell type identification method, device and equipment based on single-cell RNA sequencing data

Similar Documents

Publication Publication Date Title
WO2022012407A1 (en) Neural network training method and related device
WO2024041479A1 (en) Data processing method and apparatus
CN113298152B (en) Model training method, device, terminal equipment and computer readable storage medium
CN111653275A (en) Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method
Basati et al. DFE: efficient IoT network intrusion detection using deep feature extraction
CN112634992A (en) Molecular property prediction method, training method of model thereof, and related device and equipment
CN113764037A (en) Method and device for model training, antibody modification and binding site prediction
CN113808664B (en) Antibody screening method and device based on machine learning
CN117037917A (en) Cell type prediction model training method, cell type prediction method and device
CN114358249A (en) Target recognition model training method, target recognition method and device
CN112132269B (en) Model processing method, device, equipment and storage medium
CN116646001B (en) Method for predicting drug target binding based on combined cross-domain attention model
CN115114329A (en) Method and device for detecting data stream abnormity, electronic equipment and storage medium
CN117095460A (en) Self-supervision group behavior recognition method and system based on long-short time relation predictive coding
Fokianos et al. Biological applications of time series frequency domain clustering
Zhang et al. MTSCANet: Multi temporal resolution temporal semantic context aggregation network
CN114547308A (en) Text processing method and device, electronic equipment and storage medium
CN114332469A (en) Model training method, device, equipment and storage medium
CN113436682B (en) Risk group prediction method and device, terminal equipment and storage medium
CN116453599B (en) Open reading frame prediction method, apparatus and storage medium
CN118277920A (en) Object detection method, device, computer equipment and computer readable storage medium
CN110598578B (en) Identity recognition method, training method, device and equipment of identity recognition system
CN116708313B (en) Flow detection method, flow detection device, storage medium and electronic equipment
Zheng et al. Multi‐channel response reconstruction using transformer based generative adversarial network
He et al. Determining the proper number of proposals for individual images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination