CN117292750A - Cell type duty ratio prediction method, device, equipment and storage medium - Google Patents
Cell type duty ratio prediction method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN117292750A CN117292750A CN202311279355.4A CN202311279355A CN117292750A CN 117292750 A CN117292750 A CN 117292750A CN 202311279355 A CN202311279355 A CN 202311279355A CN 117292750 A CN117292750 A CN 117292750A
- Authority
- CN
- China
- Prior art keywords
- information
- cell
- model
- data
- distance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 94
- 238000005259 measurement Methods 0.000 claims abstract description 95
- 238000012549 training Methods 0.000 claims abstract description 53
- 230000014509 gene expression Effects 0.000 claims description 44
- 238000004088 simulation Methods 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 16
- 238000001514 detection method Methods 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 14
- 210000004027 cell Anatomy 0.000 description 331
- 239000011159 matrix material Substances 0.000 description 38
- 238000005516 engineering process Methods 0.000 description 22
- 230000008569 process Effects 0.000 description 19
- 238000010586 diagram Methods 0.000 description 16
- 238000012545 processing Methods 0.000 description 15
- 238000012163 sequencing technique Methods 0.000 description 12
- 238000010801 machine learning Methods 0.000 description 9
- 108090000623 proteins and genes Proteins 0.000 description 7
- 210000001519 tissue Anatomy 0.000 description 7
- 102000004169 proteins and genes Human genes 0.000 description 5
- 108010026552 Proteome Proteins 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 4
- 230000033228 biological regulation Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 210000003710 cerebral cortex Anatomy 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 210000002510 keratinocyte Anatomy 0.000 description 3
- 241000699670 Mus sp. Species 0.000 description 2
- 210000004556 brain Anatomy 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 210000003491 skin Anatomy 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 241000699666 Mus <mouse, genus> Species 0.000 description 1
- 150000001413 amino acids Chemical class 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003915 cell function Effects 0.000 description 1
- 230000008614 cellular interaction Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003325 follicular Effects 0.000 description 1
- 210000003780 hair follicle Anatomy 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 239000002207 metabolite Substances 0.000 description 1
- 238000002705 metabolomic analysis Methods 0.000 description 1
- 230000001431 metabolomic effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 102000039446 nucleic acids Human genes 0.000 description 1
- 108020004707 nucleic acids Proteins 0.000 description 1
- 150000007523 nucleic acids Chemical class 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 210000000976 primary motor cortex Anatomy 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a method, a device, equipment and a storage medium for predicting a cell type duty ratio, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring space histology data, wherein the space histology data comprises characteristic information and position information of at least one cell measuring point; acquiring label information corresponding to the space histology data, wherein the label information is used for indicating the cell type ratio of each cell measuring point; obtaining distance information based on the position information of at least one cell measuring point, wherein the distance information is used for indicating the distance between each cell measuring point; training the first model based on the characteristic information, the label information and the distance information of at least one cell measurement point to obtain a trained first model, wherein the trained first model is used for predicting the cell type duty ratio of the cell measurement point. The training method for the first model has no requirement on the distribution of the space histology data, does not aim at the space transcriptome data or any other space histology data, and has strong universality.
Description
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for predicting a cell type ratio.
Background
Histology refers to systematic studies in biology on a collection of various study subjects, such as genomics, transcriptomics, proteomics, metabolomics, and the like. Space histology is based on general histology, and the spatial position of the study object is reserved and combined into the study process.
Due to limitations of sequencing technology, a large part of data obtained by the current technology cannot reach cell-level resolution, but takes a plurality of cells as a unit. For this type of data, the need for cell type deconvolution tasks arises naturally. Cell type deconvolution is the process of calculating the proportion of different types of cells in each unit.
In the related art, cell type deconvolution may be performed on spatial transcription data. However, such methods require a hypothetical distribution based on spatial transcriptome data when implemented (e.g., in some schemes employing probabilistic models for cell type deconvolution, the input spatial transcriptome data needs to be modeled as a negative binomial distribution that conforms to empirical characteristics of the spatial transcriptome data), resulting in a failure to extend to any other spatial transcriptome data (e.g., spatial proteome data), and poor versatility.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for predicting a cell type duty ratio. The technical scheme provided by the embodiment of the application is as follows:
according to one aspect of embodiments of the present application, there is provided a method of predicting a cell type fraction, the method comprising:
acquiring space histology data, wherein the space histology data comprises characteristic information and position information of at least one cell measuring point, the cell measuring point comprises a plurality of cells, the characteristic information is used for indicating characteristic expression conditions of the cell measuring point, and the position information is used for indicating the position of the cell measuring point;
acquiring label information corresponding to the space histology data, wherein the label information is used for indicating the cell type ratio of each cell measuring point;
obtaining distance information based on the position information of the at least one cell measurement point, wherein the distance information is used for indicating the distance between the cell measurement points;
training the first model based on the characteristic information, the label information and the distance information of the at least one cell measurement point to obtain a trained first model, wherein the trained first model is used for predicting the cell type duty ratio of the cell measurement point.
According to an aspect of embodiments of the present application, there is provided a device for predicting a cell type fraction, the device comprising:
the first acquisition module is used for acquiring space histology data, wherein the space histology data comprises characteristic information and position information of at least one cell measuring point, the cell measuring point comprises a plurality of cells, the characteristic information is used for indicating characteristic expression conditions of the cell measuring point, and the position information is used for indicating the position of the cell measuring point;
the second acquisition module is used for acquiring label information corresponding to the space group data, wherein the label information is used for indicating the cell type duty ratio of each cell measurement point;
the distance acquisition module is used for acquiring distance information based on the position information of the at least one cell measuring point, and the distance information is used for indicating the distance between the cell measuring points;
the training module is used for training the first model based on the characteristic information, the label information and the distance information of the at least one cell measuring point to obtain a trained first model, and the trained first model is used for predicting the cell type duty ratio of the cell measuring point.
According to another aspect of embodiments of the present application, there is provided a device for predicting a cell type fraction, the device comprising:
the device comprises an acquisition module, a detection module and a control module, wherein the acquisition module is used for acquiring space histology data to be analyzed, the space histology data to be analyzed comprises characteristic information and position information of at least one cell measuring point, the cell measuring point comprises a plurality of cells, the characteristic information is used for indicating characteristic expression conditions of the cell measuring point, and the position information is used for indicating the position of the cell measuring point;
the distance acquisition module is used for acquiring distance information based on the position information of the at least one cell measuring point, and the distance information is used for indicating the distance between the cell measuring points;
the prediction module is used for obtaining first information based on the characteristic information of the at least one cell measurement point and the distance information through a first model; the first information is used for indicating a predicted result of the first model on the cell type ratio of each cell measurement point.
According to an aspect of embodiments of the present application, there is provided a computer device comprising a processor and a memory, the memory having stored therein a computer program loaded and executed by the processor to implement the above-described method of predicting a cell type fraction.
According to an aspect of embodiments of the present application, there is provided a computer readable storage medium having stored therein a computer program loaded and executed by a processor to implement the above-described method of predicting a cell type fraction.
According to an aspect of embodiments of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium, from which a processor reads and executes the computer program to implement the above-described method of predicting a cell type fraction.
The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects:
by acquiring spatial histology data comprising characteristic information and position information of at least one cell site, and tag information corresponding thereto. And obtaining distance information indicating a distance between the respective cell measurement points based on the position information of the at least one cell measurement point. Training the first model based on the characteristic information, the label information and the distance information of at least one cell measurement point to obtain a trained first model for predicting the cell type ratio of the cell measurement point. The training method for the first model has no requirement on the distribution of the space histology data, does not aim at the space transcriptome data or any other space histology data, and has strong universality.
Drawings
FIG. 1 is a schematic diagram of an implementation environment for an embodiment provided herein;
FIG. 2 is a flow chart of a method for predicting cell type occupancy provided in one embodiment of the present application;
FIG. 3 is a schematic view of the structure of a first model provided in one embodiment of the present application;
FIG. 4 is a schematic diagram of a first probability distribution provided by one embodiment of the present application;
FIG. 5 is a flow chart of a method for predicting cell type occupancy provided in another embodiment of the present application;
FIG. 6 is a schematic diagram of a complete cell type deconvolution scheme provided in one embodiment of the present application;
FIG. 7 is a schematic representation of spatial transcriptome data provided by one embodiment of the present application;
FIG. 8 is a schematic diagram of deconvolution results provided by one embodiment of the present application;
FIG. 9 is a schematic diagram of deconvolution results provided in another embodiment of the present application;
FIG. 10 is a block diagram of a cell type duty cycle prediction apparatus provided in one embodiment of the present application;
FIG. 11 is a block diagram of a cell type fraction prediction apparatus provided in another embodiment of the present application;
fig. 12 is a block diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Artificial intelligence (Artificial Intelligence, AI for short) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important revolution for the development of computer vision technology, and a pretrained model in vision fields such as Swin-Transformer, viT (Vision Transformer, visual self-attention model), V-MOE (Vision Mixture of Experts, mixed expert vision model), MAE (Masked Autoencoder, mask self-encoder) and the like can be quickly and widely applied to specific downstream tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D (three dimensional, three-dimensional) techniques, virtual reality, augmented reality, synchronous positioning, map construction, and other techniques, as well as common biometric techniques such as face recognition, fingerprint recognition, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how the computer simulates or realizes the learning behavior of the human body to acquire new knowledge or skill, and reorganizes the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.
The technical scheme provided by the embodiment of the application relates to the technologies of computer vision, machine learning and the like of artificial intelligence, and is specifically described through the following embodiments.
Referring to fig. 1, a schematic diagram of an implementation environment of an embodiment of the present application is shown. The scenario implementation environment may include a model training apparatus 10 and a model using apparatus 20.
The model training device 10 may be an electronic device such as a mobile phone, a desktop computer, a tablet computer, a notebook computer, a vehicle-mounted terminal, a server, an intelligent robot, an intelligent television, a multimedia playing device, or some other electronic device with a relatively high computing power, which is not limited in this application. Model training apparatus 10 is used to train a first model 30 and a second model 40.
In the present embodiment, the first model 30 and the second model 40 are machine learning models. Alternatively, the model training apparatus 10 may train the first model 30 and the second model 40 in a machine learning manner so that it has better performance.
Alternatively, the training process of the first model 30 is as follows (only briefly described herein, and specific training processes are described in the following embodiments, which are not described here): and acquiring space histology data comprising characteristic information and position information of at least one cell measuring point and label information corresponding to the space histology data, wherein the cell measuring point comprises a plurality of cells. Distance information indicating a distance between the respective cell measurement points is obtained based on the position information of the at least one cell measurement point. And training the first model based on the characteristic information, the label information and the distance information of at least one cell measuring point to obtain a trained first model. Alternatively, the tag information may be generated by a trained second model based on the characteristic information of at least one cell site.
Optionally, the training process of the second model 40 is as follows (only briefly described herein, and specific training processes are described in the following embodiments, which are not described here): and obtaining reference data corresponding to the space histology data, wherein the reference data is used for indicating the characteristic expression conditions respectively corresponding to a plurality of cell types, and the plurality of cell types comprise the cell types respectively corresponding to the plurality of cells. Based on the reference data, simulation data is generated that includes characteristic information of at least one simulated cell site. The second model is trained based on the simulation data, resulting in a trained second model 40. The trained second model is used to generate corresponding tag information for training of the first model 30.
In some embodiments, the model-using device 20 may be an electronic device such as a mobile phone, a desktop computer, a tablet computer, a notebook computer, a vehicle-mounted terminal, a server, a smart robot, a smart television, a multimedia playing device, or some other electronic device with a relatively high computing power, which is not limited in this application. For example, the trained first model may predict the cell type duty cycle of the cell site therein, i.e. perform a cell type deconvolution task, for the spatial histology data to be analyzed.
The model training apparatus 10 and the model using apparatus 20 may be two independent apparatuses or the same apparatus.
In the method provided by the embodiment of the application, the execution subject of each step may be a computer device, and the computer device refers to an electronic device with data computing, processing and storage capabilities. When the electronic device is a server, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The computer device may be the model training device 10 of fig. 1 or the model using device 20.
Referring to fig. 2, a flowchart of a method for predicting a cell type ratio according to an embodiment of the present application is shown. The subject of execution of the steps of the method may be the model training apparatus described above. In the following method embodiments, for convenience of description, only the execution subject of each step is described as "computer device". The method may include at least one of the following steps 210-240.
Step 210, obtaining spatial histology data, wherein the spatial histology data comprises characteristic information and position information of at least one cell measuring point, the cell measuring point comprises a plurality of cells, the characteristic information is used for indicating characteristic expression conditions of the cell measuring point, and the position information is used for indicating the position of the cell measuring point.
Sequencing refers to a technique for determining the sequence of a biomolecule such as a nucleic acid or an amino acid.
The histology data is data obtained by sequencing a plurality of cells as one unit on a cell tissue. The spatial histology data additionally comprises spatial position information of the study object based on the histology data. Each of the above units, referred to as a cell site in the examples herein, is understood to be a measurement point during the sequencing process. Accordingly, the position of the cell site refers to the position of the cell site in the tissue sample used in the sequencing. In some embodiments, one cell site may also be referred to as one "spot", and the (spatial) histology data may also be referred to as bulk data.
The spatial genomic data may include spatial genomic data, spatial transcriptomic data, spatial proteomic data, spatial metabonomic data, and the like, which is not limited in this application. Illustratively, spatial transcriptome data can be obtained by sequencing techniques to determine gene expression of cells within a tissue; spatial proteome data can be obtained by measuring protein expression of cells within the tissue.
Accordingly, the above characteristic expression conditions may include gene expression conditions at cell measurement points, protein expression conditions, expression conditions of metabolites, and the like, which are not limited in this application. For example, if the above-mentioned spatial proteomics data is spatial proteomics data, the characteristic expression situation indicated by the characteristic information is the protein expression situation of the cell measurement point.
In some embodiments, the characteristic information of the at least one cell measurement point may be represented as an m×n-dimensional characteristic expression matrix, where m and n are both positive integers, m is the number of cell measurement points, and n is the dimension of the expression feature, that is, each row of the characteristic expression matrix corresponds to one cell measurement point, and each column corresponds to one dimension of the expression feature. Illustratively, taking space histology data as an example of space transcriptomics data, if the characteristic information of at least one cell site is represented as a 5×6 characteristic expression matrix, the matrix may be used to indicate different expression situations of 6 different genes at 5 cell sites, respectively.
In some embodiments, the location information of the at least one cell site may be represented as an m×2-dimensional coordinate matrix, that is, a matrix of coordinates of m cell sites. The above-mentioned coordinates are 2-dimensional coordinates, and illustratively, in the case of using a planar rectangular coordinate system, the coordinates include two-dimensional coordinates of an abscissa (coordinate in the horizontal direction) and an ordinate (coordinate in the vertical direction).
In the embodiments of the present application, the various data/information representations are merely illustrated as matrices. Data/information mentioned in the embodiments of the present application may also be represented in the form of vectors, maps, etc., which are not limited in this application.
Step 220, obtaining label information corresponding to the space histology data, wherein the label information is used for indicating the cell type ratio of each cell measurement point.
In some embodiments, the tag information may be represented as an m×c-dimensional tag matrix, where c is the number of cell types, i.e., each row of the tag matrix corresponds to a cell site and each column corresponds to a cell type. And c is an integer greater than 1, and the value of c should be greater than or equal to the number of cell types included in the tissue sample corresponding to the spatial histology data, and the sum of all values in each row is 1 in the matrix. For example, if the tag information is represented as a 4×3 tag matrix, in which the element value of the first column of the first row is 0.5, the element value of the second column of the first row is 0.4, and the element value of the third column of the first row is 0.1, it may be represented that, in the cell measurement points corresponding to the first row, the cell type corresponding to the first column is 0.5, the cell type corresponding to the second column is 0.4, and the cell type corresponding to the third column is 0.1.
In some embodiments, the tag information is not a genuine tag that is validated by calibration, but is a pseudo tag that is obtained by some other cell type deconvolution scheme. Illustratively, the tag information may be obtained by deconvolution schemes in the related art, such as Cell2location (Cell to location), RCTD (Robust Cell Type Decomposition ), stereoscipe (Stereoscope), SPOTlight (SPOTlight) based on non-negative matrix decomposition, spatialndwls (accurate deconvolution of spatial transcriptomic data, exact deconvolution of spatial transcriptome data), tangram (Tangram) based on deep learning, and the like.
It should be noted that, the deconvolution scheme in the above related art is based on the assumed distribution of the spatial transcriptome data, and cannot be widely used.
In some embodiments, the tag information may also be obtained based on the characteristic information of the at least one cell site by the trained second model.
In some embodiments, the second model may be a DNN (Deep Neural Network ) model, a one-dimensional CNN (Convolutional Neural Network ) model, or other model capable of cell type duty cycle prediction based on the characteristic information.
The second model has no requirement on the distribution of the input characteristic information, so that the cell type deconvolution can be carried out on various types of space histology data without aiming at the space transcriptome data or any other space histology data, thereby generating corresponding label information, and having higher universality. In addition, the label information is obtained based on the characteristic information of at least one cell measuring point, and in the process, other data except space histology data are not required to be introduced, so that the data redundancy of the first model training process is reduced.
The specific training process of the second model is shown in the following examples, and will not be described herein.
Step 230, obtaining distance information based on the position information of at least one cell measurement point, wherein the distance information is used for indicating the distance between each cell measurement point.
In some embodiments, the distance information may directly indicate a specific value of the distance between the individual cell sites.
In some embodiments, the distance information may indicate how far or near the distance between the individual cell sites. Illustratively, the distance between two cell sites may be divided into a plurality of different levels, e.g., the distance between cell sites (via a set threshold) may be divided into 3 levels, adjacent, near, far, respectively, and characterized in different ways (e.g., different values).
In some embodiments, the distance information may indicate adjacency between individual cell sites. For example, two cell sites whose distance is within a set range may be regarded as adjacent cell sites, and two cell sites whose distance is outside the set range may be regarded as non-adjacent cell sites. Alternatively, the k-nearest neighbor method can be used to determine if one cell site is adjacent to another, i.e., if the latter is the former k-nearest neighbor (for all cell sites).
In some embodiments, the distance information may be represented as an m x m-dimensional adjacency matrix, where each row or column corresponds to a cell site. Accordingly, the element (i and j are positive integers less than or equal to m) in the ith row and jth column represents the adjacency of the ith cell site to the jth cell site. Illustratively, when the coordinates of the ith and jth cell measurement points satisfy a set condition (typically, the two distances are within a certain range or the latter belongs to the former k-nearest neighbor), the element value of the jth column of the ith row is equal to the opposite number of the two distances and normalized to [0,1], otherwise the element value is zero. Thus the closer the distance between two cell sites, the greater the value of the corresponding element in the adjacency matrix.
Step 240, training the first model based on the characteristic information, the label information and the distance information of at least one cell measurement point to obtain a trained first model, wherein the trained first model is used for predicting the cell type duty ratio of the cell measurement point.
Referring to fig. 3, in some embodiments, the first model 30 includes a first sub-model 31 including a first encoder and a first decoder, a second sub-model 32 including a second encoder and a second decoder, and a classifier 33.
In some embodiments, the first model may be a GCN (Graph Convolutional Network, graph roll-up neural network) model.
In some embodiments, the first sub-model may be a DAE (Deep Auto-Encoder), a MAE (Mask Auto-Encoder), or other model capable of encoding and decoding the feature information.
In some embodiments, the second sub-model may be a GAE (graphic Auto-Encoder), VGAE (Variational Graph Auto-Encoder), or other model capable of encoding and decoding distance information.
In some embodiments, step 240 includes at least one of the following sub-steps 242-256:
In step 242, the first encoder obtains first encoded information based on the characteristic information of at least one cell site.
The first coding information is obtained by coding (compressing) the characteristic information of the at least one cell measurement point by the first coder, and can be used for characterizing the characteristic of the characteristic information of the at least one cell measurement point.
In a substep 244, the second encoded information is obtained by the second encoder based on the first encoded information and the distance information.
The second encoded information is obtained by encoding the first encoded information and the distance information by a second encoder, and can be used for characterizing the characteristic of combining the distance information with the second encoded information.
And a sub-step 246 of splicing the first encoded information and the second encoded information to obtain third encoded information.
The third encoded information may be used to characterize the characteristic information of the at least one cell site in combination with the distance information.
In a sub-step 248, the first decoder obtains the characteristic information of the reconstructed at least one cell site based on the third encoded information.
The reconstructed characteristic information of the at least one cell measurement point is obtained by decoding (reconstructing) the third encoded information by the first decoder, and can also be used for indicating the characteristic expression condition of the at least one cell measurement point.
For example, if the characteristic information of at least one cell measurement point is represented as an mxn-dimensional characteristic expression matrix, the characteristic information of the reconstructed at least one cell measurement point may also be represented as an mxn-dimensional characteristic expression matrix, and in the reconstructed characteristic expression matrix, the values of the elements may change.
In a sub-step 252, the reconstructed distance information is obtained by the second decoder based on the third encoded information.
The reconstructed distance information is obtained by decoding (reconstructing) the third encoded information by the second decoder, and can also be used for indicating the distance between the cell measurement points.
For example, if the distance information is represented as an m×m-dimensional adjacency matrix, the reconstructed distance information may also be represented as an m×m-dimensional adjacency matrix, and in the reconstructed adjacency matrix, the values of the elements may change.
In step 254, based on the third encoded information, first information is obtained by the classifier, where the first information is used to indicate a predicted result of the first model on the cell type ratio of each cell site.
In some embodiments, the first information may also be represented as an m c-dimensional matrix, similar to the tag information described above.
In some embodiments, the classifier classifies (predicts) the cell types of the cell measurement points for a plurality of cell types based on the third coding information, so as to obtain probabilities that the cell measurement points respectively belong to the cell types, and the probabilities can be used as prediction results of the first model on the cell type ratio of the cell measurement points.
In the substep 256, parameters of the first model are adjusted based on the characteristic information of the at least one cell site, the reconstructed characteristic information of the at least one cell site, the distance information, the reconstructed distance information, the first information and the tag information, so as to obtain an adjusted first model.
By training the first model through the method, the model can learn the characteristics of combining the characteristic information and the distance information, and can learn the mapping relation between the characteristics and the cell type ratio, so that the performance of the first model for predicting the cell type ratio is improved.
In some embodiments, substep 256 includes at least one of the following steps:
1. based on the characteristic information of the at least one cell site and the characteristic information of the reconstructed at least one cell site, a first reconstruction loss is calculated, the first reconstruction loss being used to characterize a difference between the characteristic information of the at least one cell site and the characteristic information of the reconstructed at least one cell site.
In some embodiments, MSE Loss (Mean Squared Error Loss, mean square error Loss) between the characteristic information of the at least one cell site and the characteristic information of the reconstructed at least one cell site is used as the first reconstruction Loss.
2. Based on the distance information and the reconstructed distance information, a second reconstruction loss is calculated, the second reconstruction loss being used to characterize a difference between the distance information and the reconstructed distance information.
In some embodiments, cross Entropy Loss (cross entropy loss) between the distance information and the reconstructed distance information is used as the second reconstruction loss.
3. Based on the first information and the tag information, a classification loss is calculated, the classification loss being used to characterize the difference between the first information and the tag information.
In some embodiments, MSE Loss between the first information and the tag information is used as the classification Loss.
4. And carrying out weighted summation on the first reconstruction loss, the second reconstruction loss and the classification loss to obtain a first loss, wherein the first loss is used for measuring the performance of the first model for predicting the cell type ratio.
I.e. first loss l=w D ×L D +W G ×L G +W C ×L C Wherein L is D For the first reconstruction loss, L G For the second reconstruction loss, L C To classify losses, W D 、W G 、W C The weight parameters are respectively different, and can be set by a technician according to requirements, so that the application is not limited.
5. And adjusting parameters of the first model according to the first loss to obtain a trained first model.
In some embodiments, parameters of the first model are adjusted with the aim of minimizing the first loss, resulting in a trained first model.
The method balances the self-supervision and the supervised learning of the first model, and does not require the first information to be completely aligned with the tag information. And under the condition of comprehensively considering the expression characteristics and the spatial information, the parameters of the first model are adjusted by combining the label information, so that the first information which is closer to the real situation than the label information can be finally generated. And based on the credibility of the label information, technicians can flexibly adjust the weight parameters, so that more reasonable first loss is obtained.
According to the technical scheme provided by the embodiment of the application, the spatial histology data comprising the characteristic information and the position information of at least one cell measuring point and the label information corresponding to the spatial histology data are obtained. And obtaining distance information indicating a distance between the respective cell measurement points based on the position information of the at least one cell measurement point. Training the first model based on the characteristic information, the label information and the distance information of at least one cell measurement point to obtain a trained first model for predicting the cell type ratio of the cell measurement point. The training method for the first model has no requirement on the distribution of the space histology data, does not aim at the space transcriptome data or any other space histology data, and has strong universality.
The method comprises the steps of converting the position information into the distance information, and training the first model, so that the first model can be trained based on the relative spatial position and the neighborhood information of each cell measuring point and combining the corresponding characteristic information. The space histology data is fully utilized, and finally, the performance of the first model for predicting the cell type duty ratio aiming at the space histology data is improved.
The training process of the second model is described below by some embodiments.
In some embodiments, the second model may be trained by the following steps.
1. And obtaining reference data corresponding to the space histology data, wherein the reference data is used for indicating characteristic expression conditions respectively corresponding to a plurality of cell types, and the plurality of cell types comprise cell types respectively corresponding to a plurality of cells.
The plurality of cells are cells included in a cell site corresponding to the spatial histology data. That is, the reference data indicates a number of cell types greater than or equal to the number of cell types included in the tissue sample to which the spatial histology data corresponds.
In some embodiments, the reference data may be represented as a c n reference matrix. Each row of the reference matrix corresponds to a cell type and each column corresponds to a dimension of the expressed feature.
The reference data corresponds to spatial histology data. For example, if the spatial proteomics data is spatial proteomics data, the reference data is used to indicate protein expression for each of the plurality of cell types. The reference data may be typical of the characteristic expression profile for various cell types, as is common in the art, or alternatively, the reference data may be obtained based on non-spatial combinatorial sequencing techniques (e.g., single cell sequencing).
It should be noted that the above reference matrix is merely an exemplary representation method of the reference data, and the form of the reference data is not limited in this application. For example, some cell types may have a variety of different profiles, where the reference matrix cannot be used to characterize the profile of one cell type in a row. In some embodiments, the reference data may also be raw sequencing data.
2. Based on the reference data, simulation data is generated, the simulation data comprising characteristic information of at least one simulated cell site.
The simulated cell sites are not cell sites in a true sequencing process, and the number of cell types and the cell type ratios that they contain are set by the skilled artisan.
In some embodiments, the simulated data may be generated based on the reference data by:
(1) Based on the number of cell types included in the plurality of cell types, a first probability distribution is generated, the first probability distribution being indicative of a probability distribution of the number of cell types respectively corresponding to each of the simulated cell sites.
The first probability distribution described above may also be used to indicate the distribution of the duty cycle of simulated cell sites containing different numbers of cell types in the simulated data. For example, if the number of cell types included in the plurality of cell types is 3, simulated cell sites containing 1, 2, and 3 cell types in the simulated data may be respectively assigned a duty ratio as the first probability distribution, and the sum of the three duty ratios is 1.
In some embodiments, in the first probability distribution, the smaller the number of cell types, the higher the probability that the number of cell types corresponds. For example, referring to fig. 4, the first probability distribution may refer to a normal distribution with a mean value of 0. In fig. 4, curve 41 is a continuous normal distribution, and broken line 42 is a probability distribution (i.e., a first probability distribution) of the number of cell types corresponding to each simulated cell site, respectively, in one example.
By setting the first probability distribution through the method, the simulation data can better simulate the real situation that the probability of occurrence of the cell measuring points with fewer cell types is higher.
(2) Based on the first probability distribution, simulated tag information is generated, the simulated tag information being used to indicate the cell type duty cycle of each simulated cell site.
In some embodiments, the number of cell types at each simulated cell site is assigned based on the first probability distribution, and the cell type duty cycle at each cell site is randomly assigned while guaranteeing a sum of the cell type duty cycles at each cell site of 1.
In some embodiments, the simulated tag information may be represented as a k×c-dimensional simulated tag matrix, where k is the number of simulated cell sites and k is a positive integer. Each row of the simulated tag matrix corresponds to a simulated cell site and each column corresponds to a cell type.
In some embodiments, the number of non-zero elements for each row in the simulated tag matrix may be determined based on the first probability distribution. For example, if the first probability distribution is as shown in fig. 4 (c=19 in fig. 4), each value of the horizontal axis is i (1, 2, …, c), and the corresponding probability value is p i An initialization matrix of k×c dimension can be preset, and k×p can be randomly selected from the initialization matrix without repetition i Rows and designates the number of non-zero elements of these rows as i. Further, for each row in the initialization matrix, based on the determined number of non-zero elements, randomly generating values of the non-zero elements, and ensuring that the sum of all elements in each row is 1, so as to obtain the simulation tag matrix.
(3) Based on the reference data and the simulated tag information, simulated data is generated.
In some embodiments, the analog data may be represented as a k n-dimensional analog matrix. Each row of the simulation matrix corresponds to a simulated cell site and each column corresponds to a dimension of the expressed feature.
In some embodiments, the simulated tag matrix is multiplied by a reference matrix to obtain the simulated matrix.
The above method generates simulated data by randomly mixing the expression characteristics of different types of cells in different ratios. In the above step, since the simulation data is generated based on the reference data and the simulation tag information, it is ensured that the simulation tag information is necessarily an error-free true tag of the simulation data, thereby improving the efficiency of performing supervised learning of the subsequent second model.
3. And training the second model based on the simulation data to obtain a trained second model.
In some embodiments, the second model may be trained based on the simulation data by:
(1) And obtaining second information based on the simulation data through the second model, wherein the second information is used for indicating the prediction result of the second model on the cell type ratio of each simulated cell measurement point.
In some embodiments, the second information may also be represented as a matrix of k n dimensions, similar to the analog tag information described above.
(2) Based on the second information and the simulated tag information, a second loss is calculated, the second loss being used to measure the performance of the second model in predicting the cell type fraction.
In some embodiments, MSE Loss between the second information and the simulated tag information is used as the second Loss.
(3) And adjusting parameters of the second model based on the second loss to obtain a trained second model.
In some embodiments, parameters of the second model are adjusted to obtain a trained second model with the objective of minimizing the second loss.
By training the second model through the method, the second model can learn the mapping relation between the expression characteristics and the cell type distribution proportion, so that more accurate label information can be generated for training the first model.
The training method for the second model has no requirement on the distribution of the reference data and the simulation data, does not aim at the space transcriptome data or any other space histology data, and has strong universality. In addition, in the method, training samples of the second model can be obtained based on the reference data, and the data redundancy is low.
In addition, it should be noted that there are differences between the reference data and the spatial histology data due to sequencing techniques or other reasons. For the spatial histology data, the label information generated by the trained second model is not the most ideal deconvolution result, but is just a pseudo label. Therefore, it is also necessary to predict the cell type occupancy more accurately by the first model in combination with the positional information in the spatial histology data.
In the following, description will be given of the usage flow of the first model by way of embodiments, and the content related to the model training process and the content related to the usage process correspond to each other, and the two are mutually communicated, for example, where detailed description is not given on one side, reference may be made to description on the other side.
Referring to fig. 5, a flowchart of a method for predicting a cell type ratio according to another embodiment of the present application is shown. The subject of execution of the steps of the method may be the model-using device described above. In the following method embodiments, for convenience of description, only the execution subject of each step is described as "computer device". The method may include at least one of the following steps 510-530.
Step 510, obtaining spatial histology data to be analyzed, wherein the spatial histology data to be analyzed comprises characteristic information and position information of at least one cell measuring point, the cell measuring point comprises a plurality of cells, the characteristic information is used for indicating characteristic expression conditions of the cell measuring point, and the position information is used for indicating the position of the cell measuring point.
Step 520, obtaining distance information based on the position information of at least one cell measurement point, wherein the distance information is used for indicating the distance between each cell measurement point.
Step 530, obtaining first information based on the characteristic information and the distance information of at least one cell measurement point through a first model; the first information is used for indicating the prediction result of the cell type ratio of the first model to each cell measurement point.
In some embodiments, step 530 includes at least one of the following sub-steps 532-538.
In a substep 532, first encoded information is obtained by a first encoder of the first model based on the characteristic information of the at least one cell site.
In a substep 534, the second encoded information is obtained by the second encoder of the first model based on the first encoded information and the distance information.
In a substep 536, the first encoded information and the second encoded information are spliced to obtain third encoded information.
In a substep 538, the first information is obtained by the classifier of the first model based on the third encoded information.
In the method, the first model predicts the cell type duty ratio of the cell measuring point by encoding and classifying the characteristic information and the distance information of at least one cell measuring point, and fully utilizes the space histology data to be analyzed, so that a relatively accurate cell type deconvolution result can be obtained.
The cell type ratio prediction method has no requirement on the distribution of space histology data, does not aim at space transcriptome data or other certain space histology data, and has strong universality.
For spatial histology data that does not reach cell-level resolution (i.e., a cell site includes multiple cells), cell type deconvolution is the basis for most downstream analyses; in the spatial histology data, which is similar to the cell-level resolution (i.e., each cell is taken as a cell site), identifying the type of each cell is the basis for subsequent analysis. The cell type ratio of each cell site based on cell type deconvolution can be used for a series of subsequent analyses of cell type specific distribution, cell interactions, gene differential expression, cell functions and pathways, etc. At present, most other spatial sequencing technologies except the technology of partial spatial transcriptome cannot reach the resolution of cell level, so that the scheme has wide application prospect.
Referring to fig. 6, a schematic diagram of a complete cell type deconvolution scheme provided in one embodiment of the present application is shown. In this scheme, only the respective data/information is exemplarily illustrated in a matrix form. The scheme may include the following four phases.
Stage one, the training stage of the second model 40. In this stage, analog data is generated based on the reference data. And, based on the simulation data, the second model 40 is trained.
Stage two, the prediction stage of the second model 40. In this phase, based on the characteristic information of at least one cell site in the spatial histology data, the tag information corresponding to the spatial histology data is obtained by the (trained) second model 40.
Stage three, training process of the first model 30. In this stage, the position information in the spatial histology data is first converted into distance information. The first model 30 is trained based on the distance information, the characteristic information of the at least one cell site, and the tag information. In stage three, the first encoder a, the first decoder B, the second encoder C, the second decoder D, and the classifier E included in the first model 30 participate in the training process.
Stage four, the prediction stage of the first model 30. In this phase, the position information in the spatial histology data (to be analyzed) is first converted into distance information. And obtaining first information of a predicted result for indicating the cell type ratio of the first model to each cell measuring point through the first model based on the distance information and the characteristic information of at least one cell measuring point, wherein the first information is the output cell type deconvolution result. In stage four, the first encoder a, the second encoder C, and the classifier E included in the first model 30 participate in the prediction process.
The parts of the above four stages that are not described in detail are referred to the above embodiments, and are not described here again.
Next, the technical effects obtained by the technical solutions provided in the embodiments of the present application are described by some test examples.
Referring to fig. 7, a schematic diagram of spatial transcriptome data provided by one embodiment of the present application is shown, which is primary motor cortex data of a mouse brain. Each dot in fig. 7 represents a cell. First, using the cells in each grid in fig. 7 as one cell site, gene expression of all cells in each cell site was combined as gene expression of the cell site to generate simulated expression data. Since the simulated expression data is obtained by simulation, the specific ratios of the cell types, i.e., the tags, are known. On samples of this dataset, the comparison of this protocol with Cell2location, tangram and Stereoscope is shown in table 1, with deconvolution results and the average PCC of the true tags (Pearson Correlation Coefficient ) as evaluation indicators, the performance of this protocol is superior to the other three classical spatial transcriptome Cell type deconvolution methods.
TABLE 1
The proposal is that | Cell2location | Tangram | Stereoscope | |
PCC | 0.913 | 0.872 | 0.864 | 0.876 |
Referring to fig. 8, a schematic diagram of deconvolution results provided by one embodiment of the present application is shown.
The left panel in fig. 8 is the whole brain data of the spatial proteome mice, showing the distribution of marker proteins of the cerebral cortex of the mice, the darker the color indicates the greater the expression level, and the distribution indicates the approximate area of the cerebral cortex; the right panel shows the distribution of neuronal (Neuron) cell types in the deconvolution results of this scheme, with darker colors indicating a larger neuronal duty cycle. According to the knowledge in the biological field, the neurons of the cerebral cortex are distributed more densely, and the deconvolution result of the scheme accords with the biological rule.
Referring to fig. 9, a schematic diagram of deconvolution results provided in another embodiment of the present application is shown.
The left image in fig. 9 is a tissue image corresponding to the spatial proteome human skin data, wherein the circled area is a skin hair follicle area; the right panel shows the distribution of Keratinocyte type in the deconvolution result of the present protocol, with darker color indicating a larger Keratinocyte ratio. It can be seen that the keratinocytes do focus mainly on the follicular area, demonstrating the accuracy of the deconvolution results of this protocol.
It should be noted that, in the present application, when the related technology for acquiring biological information (such as human skin data) of a user is applied to a specific product or technology, the related data collecting, using and processing processes should comply with requirements of national laws and regulations, and before collecting biological information of the user, the information processing rules should be notified and independent consent (or legal basis) of the target object should be solicited, and biological information of the user is processed strictly in compliance with requirements of laws and regulations and personal information processing rules, and technical measures are taken to ensure safety of the related data.
The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.
Referring to fig. 10, a block diagram of a device for predicting a cell type ratio according to an embodiment of the present application is shown. The device may be a computer device or may be provided in a computer device. The apparatus 1000 may include: a first acquisition module 1010, a second acquisition module 1020, a distance acquisition module 1030, and a training module 1040.
The first obtaining module 1010 is configured to obtain spatial histology data, where the spatial histology data includes feature information and position information of at least one cell measurement point, the cell measurement point includes a plurality of cells, the feature information is used to indicate a feature expression condition of the cell measurement point, and the position information is used to indicate a position of the cell measurement point.
And a second obtaining module 1020, configured to obtain tag information corresponding to the spatial histology data, where the tag information is used to indicate a cell type duty ratio of each cell measurement point.
The distance acquiring module 1030 is configured to obtain distance information based on the position information of the at least one cell measurement point, where the distance information is used to indicate a distance between the cell measurement points.
The training module 1040 is configured to train the first model based on the feature information, the tag information, and the distance information of the at least one cell measurement point, so as to obtain a trained first model, where the trained first model is used to predict a cell type duty ratio of the cell measurement point.
In some embodiments, the first model includes a first sub-model including a first encoder and a first decoder, a second sub-model including a second encoder and a second decoder, and a classifier. The training module 1040 includes a prediction sub-module and an adjustment sub-module.
The prediction submodule is used for obtaining first coding information based on the characteristic information of the at least one cell measuring point through the first coder; obtaining second coding information based on the first coding information and the distance information through the second coder; splicing the first coding information and the second coding information to obtain third coding information; obtaining the characteristic information of at least one reconstructed cell measurement point based on the third coding information through the first decoder; obtaining reconstructed distance information based on the third coding information by the second decoder; and obtaining first information based on the third coding information through the classifier, wherein the first information is used for indicating a predicted result of the first model on the cell type ratio of each cell measuring point.
And the adjustment sub-module is used for adjusting the parameters of the first model based on the characteristic information of the at least one cell measuring point, the characteristic information of the reconstructed at least one cell measuring point, the distance information, the reconstructed distance information, the first information and the label information to obtain the adjusted first model.
In some embodiments, the adjustment sub-module is configured to calculate a first reconstruction loss based on the characteristic information of the at least one cell site and the characteristic information of the reconstructed at least one cell site, the first reconstruction loss being used to characterize a difference between the characteristic information of the at least one cell site and the characteristic information of the reconstructed at least one cell site; calculating a second reconstruction loss based on the distance information and the reconstructed distance information, wherein the second reconstruction loss is used for representing the difference between the distance information and the reconstructed distance information; calculating, based on the first information and the tag information, a classification loss, the classification loss being used to characterize a difference between the first information and the tag information; performing weighted summation on the first reconstruction loss, the second reconstruction loss and the classification loss to obtain a first loss, wherein the first loss is used for measuring the performance of the first model for predicting the cell type ratio; and adjusting parameters of the first model according to the first loss to obtain the trained first model.
In some embodiments, the second acquisition module 1020 includes a tag generation sub-module.
And the label generation sub-module is used for obtaining the label information based on the characteristic information of the at least one cell measuring point through the trained second model.
In some embodiments, the second acquisition module 1020 further includes an acquisition sub-module, an analog sub-module, and a training sub-module.
The acquisition sub-module is used for acquiring reference data corresponding to the space histology data, wherein the reference data is used for indicating characteristic expression conditions respectively corresponding to a plurality of cell types, and the plurality of cell types comprise cell types respectively corresponding to the plurality of cells.
And the simulation sub-module is used for generating simulation data based on the reference data, wherein the simulation data comprises characteristic information of at least one simulated cell measurement point.
And the training sub-module is used for training the second model based on the simulation data to obtain the trained second model.
In some embodiments, the modeling sub-module is configured to generate a first probability distribution based on the number of cell types included in the plurality of cell types, where the first probability distribution is used to indicate a probability distribution of the number of cell types respectively corresponding to each of the modeled cell measurement points; generating simulated tag information based on the first probability distribution, wherein the simulated tag information is used for indicating the cell type ratio of each simulated cell measurement point; the simulated data is generated based on the reference data and the simulated tag information.
In some embodiments, in the first probability distribution, the smaller the number of cell types, the higher the probability that the number of cell types corresponds.
In some embodiments, the training sub-module is configured to obtain, based on the simulation data by the second model, second information, where the second information is used to indicate a predicted result of the second model on a cell type ratio of each of the simulated cell measurement points; calculating a second loss based on the second information and the simulated tag information, wherein the second loss is used for measuring the performance of the second model for predicting the cell type ratio; and adjusting parameters of the second model based on the second loss to obtain the trained second model.
Referring to fig. 11, a block diagram of a device for predicting a cell type ratio according to another embodiment of the present application is shown. The device may be a computer device or may be provided in a computer device. The apparatus 1000 may include: an acquisition module 1110, a distance acquisition module 1120, and a prediction module 1130.
The obtaining module 1110 is configured to obtain spatial histology data to be analyzed, where the spatial histology data to be analyzed includes feature information and position information of at least one cell measurement point, the cell measurement point includes a plurality of cells, the feature information is used to indicate a feature expression condition of the cell measurement point, and the position information is used to indicate a position of the cell measurement point.
The distance obtaining module 1120 is configured to obtain distance information based on the position information of the at least one cell measurement point, where the distance information is used to indicate a distance between the cell measurement points.
A prediction module 1130, configured to obtain first information based on the characteristic information of the at least one cell measurement point and the distance information through a first model; the first information is used for indicating a predicted result of the first model on the cell type ratio of each cell measurement point.
In some embodiments, the prediction module 1130 is configured to obtain, by a first encoder of the first model, first encoded information based on the characteristic information of the at least one cell site; obtaining second coding information based on the first coding information and the distance information through a second coder of the first model; splicing the first coding information and the second coding information to obtain third coding information; and obtaining the first information based on the third coding information through a classifier of the first model.
It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Referring to fig. 12, a block diagram of a computer device according to an embodiment of the present application is schematically shown.
In general, the computer device 1200 includes: a processor 1201 and a memory 1202.
Processor 1201 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1201 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1201 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1201 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor 1201 may also include an AI processor for processing computing operations related to machine learning.
Memory 1202 may include one or more computer-readable storage media, which may be tangible and non-transitory. Memory 1202 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1202 stores a computer program that is loaded and executed by processor 1201 to implement the method of cell type duty cycle prediction described above.
Those skilled in the art will appreciate that the architecture shown in fig. 12 is not limiting as to the computer device 1200, and may include more or fewer components than shown, or may combine certain components, or employ a different arrangement of components.
In some embodiments, a computer readable storage medium having a computer program stored therein is also provided, the computer program being loaded and executed by a processor to implement the above-described method of predicting a cell type fraction.
Alternatively, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State Drives, solid State disk), optical disk, or the like. The random access memory may include ReRAM (Resistance Random Access Memory, resistive random access memory) and DRAM (Dynamic Random Access Memory ), among others.
In some embodiments, there is also provided a computer program product comprising a computer program stored in a computer readable storage medium, from which a processor reads and executes the computer program to implement the above-described method of predicting a cell type fraction.
It should be noted that, all user data (including biological information of the user) collected in the application are processed strictly according to requirements of relevant national laws and regulations, informed consent or independent consent of the personal information body is obtained, and subsequent data use and processing behaviors are developed within the scope of laws and regulations and the authorization of the personal information body.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limited by the embodiments of the present application.
The foregoing is illustrative of the present application and is not to be construed as limiting thereof, but rather as providing for the use of various modifications, equivalents, improvements or alternatives falling within the spirit and principles of the present application.
Claims (15)
1. A method of predicting a cell type fraction, the method comprising:
acquiring space histology data, wherein the space histology data comprises characteristic information and position information of at least one cell measuring point, the cell measuring point comprises a plurality of cells, the characteristic information is used for indicating characteristic expression conditions of the cell measuring point, and the position information is used for indicating the position of the cell measuring point;
acquiring label information corresponding to the space histology data, wherein the label information is used for indicating the cell type ratio of each cell measuring point;
obtaining distance information based on the position information of the at least one cell measurement point, wherein the distance information is used for indicating the distance between the cell measurement points;
training the first model based on the characteristic information, the label information and the distance information of the at least one cell measurement point to obtain a trained first model, wherein the trained first model is used for predicting the cell type duty ratio of the cell measurement point.
2. The method of claim 1, wherein the first model comprises a first sub-model, a second sub-model, and a classifier, the first sub-model comprising a first encoder and a first decoder, the second sub-model comprising a second encoder and a second decoder;
the training the first model based on the characteristic information, the label information and the distance information of the at least one cell measurement point to obtain a trained first model, which comprises the following steps:
obtaining first coding information based on the characteristic information of the at least one cell measurement point through the first coder;
obtaining second coding information based on the first coding information and the distance information through the second coder;
splicing the first coding information and the second coding information to obtain third coding information;
obtaining the characteristic information of at least one reconstructed cell measurement point based on the third coding information through the first decoder;
obtaining reconstructed distance information based on the third coding information by the second decoder;
obtaining first information based on the third coding information through the classifier, wherein the first information is used for indicating a prediction result of the first model on the cell type ratio of each cell measuring point;
And adjusting parameters of the first model based on the characteristic information of the at least one cell measurement point, the characteristic information of the reconstructed at least one cell measurement point, the distance information, the reconstructed distance information, the first information and the label information to obtain the adjusted first model.
3. The method of claim 2, wherein adjusting parameters of the first model based on the characteristic information of the at least one cell site, the characteristic information of the reconstructed at least one cell site, the distance information, the reconstructed distance information, the first information, and tag information to obtain the adjusted first model comprises:
calculating a first reconstruction loss based on the characteristic information of the at least one cell measurement point and the characteristic information of the reconstructed at least one cell measurement point, wherein the first reconstruction loss is used for representing the difference between the characteristic information of the at least one cell measurement point and the characteristic information of the reconstructed at least one cell measurement point;
calculating a second reconstruction loss based on the distance information and the reconstructed distance information, wherein the second reconstruction loss is used for representing the difference between the distance information and the reconstructed distance information;
Calculating, based on the first information and the tag information, a classification loss, the classification loss being used to characterize a difference between the first information and the tag information;
performing weighted summation on the first reconstruction loss, the second reconstruction loss and the classification loss to obtain a first loss, wherein the first loss is used for measuring the performance of the first model for predicting the cell type ratio;
and adjusting parameters of the first model according to the first loss to obtain the trained first model.
4. The method according to claim 1, wherein the obtaining tag information corresponding to the spatial histology data includes:
and obtaining the label information based on the characteristic information of the at least one cell measurement point through the trained second model.
5. The method according to claim 4, wherein the method further comprises:
acquiring reference data corresponding to the space histology data, wherein the reference data is used for indicating characteristic expression conditions respectively corresponding to a plurality of cell types, and the plurality of cell types comprise cell types respectively corresponding to the plurality of cells;
generating simulated data based on the reference data, the simulated data including characteristic information of at least one simulated cell site;
And training the second model based on the simulation data to obtain the trained second model.
6. The method of claim 5, wherein generating analog data based on the reference data comprises:
generating a first probability distribution based on the number of cell types included in the plurality of cell types, wherein the first probability distribution is used for indicating probability distribution of the number of cell types respectively corresponding to each simulated cell measurement point;
generating simulated tag information based on the first probability distribution, wherein the simulated tag information is used for indicating the cell type ratio of each simulated cell measurement point;
the simulated data is generated based on the reference data and the simulated tag information.
7. The method of claim 6, wherein in the first probability distribution, the smaller the number of cell types, the higher the probability that the number of cell types corresponds.
8. The method of claim 6, wherein training the second model based on the simulation data results in the trained second model, comprising:
obtaining second information based on the simulation data through the second model, wherein the second information is used for indicating a predicted result of the second model on the cell type ratio of each simulated cell measurement point;
Calculating a second loss based on the second information and the simulated tag information, wherein the second loss is used for measuring the performance of the second model for predicting the cell type ratio;
and adjusting parameters of the second model based on the second loss to obtain the trained second model.
9. A method of predicting a cell type fraction, the method comprising:
acquiring space histology data to be analyzed, wherein the space histology data to be analyzed comprises characteristic information and position information of at least one cell measuring point, the cell measuring point comprises a plurality of cells, the characteristic information is used for indicating characteristic expression conditions of the cell measuring point, and the position information is used for indicating the position of the cell measuring point;
obtaining distance information based on the position information of the at least one cell measurement point, wherein the distance information is used for indicating the distance between the cell measurement points;
obtaining first information based on the characteristic information of the at least one cell measurement point and the distance information through a first model; the first information is used for indicating a predicted result of the first model on the cell type ratio of each cell measurement point.
10. The method of claim 9, wherein the obtaining, by the first model, first information based on the characteristic information of the at least one cell site and the distance information, comprises:
obtaining first coding information based on the characteristic information of the at least one cell measurement point through a first coder of the first model;
obtaining second coding information based on the first coding information and the distance information through a second coder of the first model;
splicing the first coding information and the second coding information to obtain third coding information;
and obtaining the first information based on the third coding information through a classifier of the first model.
11. A device for predicting the cell type fraction, the device comprising:
the first acquisition module is used for acquiring space histology data, wherein the space histology data comprises characteristic information and position information of at least one cell measuring point, the cell measuring point comprises a plurality of cells, the characteristic information is used for indicating characteristic expression conditions of the cell measuring point, and the position information is used for indicating the position of the cell measuring point;
The second acquisition module is used for acquiring label information corresponding to the space group data, wherein the label information is used for indicating the cell type duty ratio of each cell measurement point;
the distance acquisition module is used for acquiring distance information based on the position information of the at least one cell measuring point, and the distance information is used for indicating the distance between the cell measuring points;
the training module is used for training the first model based on the characteristic information, the label information and the distance information of the at least one cell measuring point to obtain a trained first model, and the trained first model is used for predicting the cell type duty ratio of the cell measuring point.
12. A device for predicting the cell type fraction, the device comprising:
the device comprises an acquisition module, a detection module and a control module, wherein the acquisition module is used for acquiring space histology data to be analyzed, the space histology data to be analyzed comprises characteristic information and position information of at least one cell measuring point, the cell measuring point comprises a plurality of cells, the characteristic information is used for indicating characteristic expression conditions of the cell measuring point, and the position information is used for indicating the position of the cell measuring point;
The distance acquisition module is used for acquiring distance information based on the position information of the at least one cell measuring point, and the distance information is used for indicating the distance between the cell measuring points;
the prediction module is used for obtaining first information based on the characteristic information of the at least one cell measurement point and the distance information through a first model; the first information is used for indicating a predicted result of the first model on the cell type ratio of each cell measurement point.
13. A computer device comprising a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of any one of claims 1 to 8 or to implement the method of any one of claims 9 to 10.
14. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, which is loaded and executed by a processor to implement the method of any of claims 1 to 8 or to implement the method of any of claims 9 to 10.
15. A computer program product, characterized in that it comprises a computer program stored in a computer readable storage medium, from which a processor reads and executes the computer program to implement the method according to any one of claims 1 to 8 or to implement the method according to any one of claims 9 to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311279355.4A CN117292750A (en) | 2023-09-28 | 2023-09-28 | Cell type duty ratio prediction method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311279355.4A CN117292750A (en) | 2023-09-28 | 2023-09-28 | Cell type duty ratio prediction method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117292750A true CN117292750A (en) | 2023-12-26 |
Family
ID=89251520
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311279355.4A Pending CN117292750A (en) | 2023-09-28 | 2023-09-28 | Cell type duty ratio prediction method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117292750A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117854600A (en) * | 2024-03-07 | 2024-04-09 | 北京大学 | Cell identification method, device, equipment and storage medium based on multiple sets of chemical data |
-
2023
- 2023-09-28 CN CN202311279355.4A patent/CN117292750A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117854600A (en) * | 2024-03-07 | 2024-04-09 | 北京大学 | Cell identification method, device, equipment and storage medium based on multiple sets of chemical data |
CN117854600B (en) * | 2024-03-07 | 2024-05-21 | 北京大学 | Cell identification method, device, equipment and storage medium based on multiple sets of chemical data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kannojia et al. | Effects of varying resolution on performance of CNN based image classification: An experimental study | |
CN110929622A (en) | Video classification method, model training method, device, equipment and storage medium | |
CN114332578A (en) | Image anomaly detection model training method, image anomaly detection method and device | |
CN112560967B (en) | Multi-source remote sensing image classification method, storage medium and computing device | |
CN112560966B (en) | Polarized SAR image classification method, medium and equipment based on scattering map convolution network | |
CN117292750A (en) | Cell type duty ratio prediction method, device, equipment and storage medium | |
CN110704652A (en) | Vehicle image fine-grained retrieval method and device based on multiple attention mechanism | |
CN113764034A (en) | Method, device, equipment and medium for predicting potential BGC in genome sequence | |
CN115311502A (en) | Remote sensing image small sample scene classification method based on multi-scale double-flow architecture | |
CN114496099A (en) | Cell function annotation method, device, equipment and medium | |
CN114612902A (en) | Image semantic segmentation method, device, equipment, storage medium and program product | |
CN112420125A (en) | Molecular attribute prediction method and device, intelligent equipment and terminal | |
CN117291895A (en) | Image detection method, device, equipment and storage medium | |
CN115223662A (en) | Data processing method, device, equipment and storage medium | |
CN117591813B (en) | Complex equipment fault diagnosis method and system based on multidimensional features | |
Zhu et al. | Dual-decoder transformer network for answer grounding in visual question answering | |
CN114298299A (en) | Model training method, device, equipment and storage medium based on course learning | |
CN111651626B (en) | Image classification method, device and readable storage medium | |
Liu et al. | Multi-view clustering via dual-norm and hsic | |
CN116959605A (en) | Molecular property prediction method, training method and device of molecular property prediction model | |
CN117011569A (en) | Image processing method and related device | |
CN115359484A (en) | Image processing method, device, equipment and storage medium | |
CN109359694B (en) | Image classification method and device based on mixed collaborative representation classifier | |
CN111582404A (en) | Content classification method and device and readable storage medium | |
CN115115871B (en) | Training method, device, equipment and storage medium for image recognition model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |