CN114429787B - Omics data processing method and device, electronic device and storage medium - Google Patents

Omics data processing method and device, electronic device and storage medium Download PDF

Info

Publication number
CN114429787B
CN114429787B CN202111653938.XA CN202111653938A CN114429787B CN 114429787 B CN114429787 B CN 114429787B CN 202111653938 A CN202111653938 A CN 202111653938A CN 114429787 B CN114429787 B CN 114429787B
Authority
CN
China
Prior art keywords
omics data
data
omics
training
enhanced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111653938.XA
Other languages
Chinese (zh)
Other versions
CN114429787A (en
Inventor
郜杰
赵国栋
方晓敏
何径舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111653938.XA priority Critical patent/CN114429787B/en
Publication of CN114429787A publication Critical patent/CN114429787A/en
Application granted granted Critical
Publication of CN114429787B publication Critical patent/CN114429787B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The disclosure provides a method and a device for processing histological data, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and intelligent medical treatment. The specific implementation scheme is as follows: obtaining omics data; coding the omics data according to the expression quantity of a plurality of genes in the omics data so as to obtain the characteristics of the omics data; and executing the target task of the omics data according to the characteristics of the omics data. Therefore, accurate low-dimensional representation of omics data can be obtained, and accuracy of downstream target tasks such as classification tasks for cancer typing and individual survival analysis tasks is improved.

Description

Omics data processing method and device, electronic device and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, in particular to the field of deep learning and intelligent medical technologies, and in particular, to an omics data processing method, apparatus, electronic device, and storage medium.
Background
With the development of high-throughput sequencing technology, omics data are used more and more in modern medicine, and because the omics data can completely depict the health condition of a patient, the omics data are widely applied to the aspects of disease diagnosis, medication and the like.
However, the characteristics of high dimensionality, high noise, batch effect and the like of omics data bring a lot of difficulties for practical application, and the acquisition of accurate low-dimensional representation of omics data is of great significance for improving the accuracy of downstream applications such as classification of cancer typing, individual survival analysis and the like.
Disclosure of Invention
The present disclosure provides a method, apparatus, electronic device and storage medium for omics data processing.
According to an aspect of the present disclosure, there is provided a method of processing histological data, the method including: obtaining omics data; coding the omics data according to the expression quantity of a plurality of genes in the omics data so as to obtain the characteristics of the omics data; and executing the target task of the omics data according to the characteristics of the omics data.
According to another aspect of the present disclosure, there is provided a model training method for omics data processing, the method comprising: acquiring training omics data; adjusting the expression quantity of a plurality of genes in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data; coding by adopting a deep neural network model according to the expression quantity of a plurality of genes in the at least two types of enhanced omics data to obtain corresponding characteristics; adjusting model parameters of the neural network model based on a difference between the features of the at least two enhanced omics data to minimize the difference.
According to another aspect of the present disclosure, there is provided a histology data processing apparatus including: the first acquisition module is used for acquiring omics data; the first coding module is used for coding the omics data according to the expression quantity of a plurality of genes in the omics data so as to obtain the characteristics of the omics data; and the processing module is used for executing the target task of the omics data according to the characteristics of the omics data.
According to another aspect of the present disclosure, there is provided a model training apparatus for omics data processing, comprising: the second acquisition module is used for acquiring training omics data; the first adjusting module is used for adjusting the expression quantity of a plurality of genes in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data; the second coding module is used for coding by adopting a deep neural network model according to the expression quantity of a plurality of genes in the at least two types of enhanced omics data to obtain corresponding characteristics; a second adjusting module for adjusting model parameters of the neural network model to minimize a difference between the features of the at least two enhanced omics data.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the omic data processing method of the present disclosure or to perform the model training method for omic data processing of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform an omics data processing method disclosed in embodiments of the present disclosure or a model training method for omics data processing disclosed in embodiments of the present disclosure.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the omics data processing method of the present disclosure, or implements the steps of the model training method for omics data processing of the present disclosure.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
figure 1 is a schematic flow diagram of an omics data processing method according to a first embodiment of the present disclosure;
figure 2 is a schematic flow diagram of an omics data processing method according to a second embodiment of the present disclosure;
figure 3 is a schematic flow diagram of a model training method for omics data processing according to a third embodiment of the present disclosure;
figure 4 is an architectural schematic diagram of a model training method for omics data processing according to a third embodiment of the present disclosure;
FIG. 5 is a schematic flow diagram of a model training method for omics data processing according to a fourth embodiment of the present disclosure;
figure 6 is a schematic structural diagram of an omics data processing device according to a fifth embodiment of the present disclosure;
figure 7 is a schematic structural diagram of a model training apparatus for omics data processing according to a sixth embodiment of the present disclosure;
figure 8 is a block diagram of an electronic device for implementing an omics data processing method or a model training method for omics data processing of embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The disclosure provides an omics data processing method, a model training method for omics data processing, a device, an electronic device, a non-transitory computer readable storage medium and a computer program product, and relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and intelligent medical treatment.
The artificial intelligence is a subject for researching and enabling a computer to simulate certain thinking process and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a human, and has a hardware level technology and a software level technology. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises computer vision, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.
Intelligent medical treatment is realized by creating a medical information platform in a health archive area and utilizing the most advanced Internet of things technology, so that the interaction between a patient and medical staff, a medical institution and medical equipment is realized, and informatization is gradually achieved. In the near future, the medical industry will incorporate more high technologies such as artificial intelligence, sensing technology and the like, so that the medical service is intelligent in real meaning, and the prosperity and development of the medical industry are promoted.
At present, the characteristics of high dimensionality, high noise, batch effect and the like of omics data bring many difficulties for practical application, and the accurate low-dimensional representation of the omics data is obtained, so that the method has important significance for improving the accuracy of downstream applications such as cancer type classification, survival analysis and the like.
The method can obtain accurate low-dimensional representation of the omics data by encoding the omics data according to the expression quantity of a plurality of genes in the omics data so as to obtain the characteristics of the omics data and then execute the target tasks of the omics data according to the characteristics of the omics data, thereby improving the accuracy of the downstream target tasks of cancer typing, survival analysis and the like of patients based on the omics data.
An omics data processing method, a model training method for omics data processing, an apparatus, an electronic device, a non-transitory computer-readable storage medium, and a computer program product of the embodiments of the present disclosure are described below with reference to the accompanying drawings.
Figure 1 is a schematic flow diagram of an omics data processing method according to a first embodiment of the present disclosure. It should be noted that, the main implementation body of the omics data processing method of this embodiment is an omics data processing device, the omics data processing device may be implemented by software and/or hardware, the omics data processing device may be configured in an electronic device, and the electronic device may include, but is not limited to, a terminal device, a server, and the like, and the embodiment does not specifically limit the electronic device.
As shown in fig. 1, the omics data processing method may include:
step 101, omics data are obtained.
Wherein, the omics data is to-be-processed omics data. Omics data comprise a plurality of genes.
And 102, coding the omic data according to the expression quantity of a plurality of genes in the omic data so as to obtain the characteristics of the omic data.
The gene expression refers to the conversion of genetic information stored in a DNA (deoxyribose nucleic acid) sequence into a protein molecule with biological activity through transcription and translation by a cell in the life process. The expression level of a gene is a quantitative value of gene expression.
In the disclosed embodiment, the mathematical data is encoded, i.e. the feature extraction is performed on the mathematical data. By extracting the characteristics of the genomic data according to the expression quantities of a plurality of genes in the genomic data, the characteristics can be fully extracted from the genomic data, so that the characteristics of the accurate genomic data can be obtained. Wherein, the characteristic of the omics data is the omics characterization of the omics data with low dimension, i.e. low dimension representation.
The feature extraction of the mathematical data may be implemented through a deep neural network model, for example, the feature extraction of the mathematical data may be implemented according to expression levels of a plurality of genes in the omics data through a trained MLP (Multi-Layer Perception) model. Alternatively, it may be implemented in other ways, and the present disclosure is not limited thereto.
And step 103, executing a target task of the omics data according to the characteristics of the omics data.
The target task may be any downstream task such as an individual survival analysis task, a disease diagnosis task, a medication recommendation task, a cancer classification task, and the like, which is not limited in the present disclosure.
In the embodiment of the disclosure, after the characteristics of the omics data are obtained, the target task of the omics data can be executed according to the characteristics of the omics data. Because accurate low-dimensional representation of omics data can be obtained in the method, the expression capability of the low-dimensional representation of the omics data is improved, so that the state of a patient can be better described, and further, a target task of the omics data is executed based on the obtained low-dimensional representation of the omics data, and the accuracy of the target task can be improved.
The omics data processing method of the embodiment of the disclosure can obtain the characteristics of the omics data by obtaining the omics data and encoding the omics data according to the expression quantities of a plurality of genes in the omics data, and then execute the target tasks of the omics data according to the characteristics of the omics data, so that the accurate low-dimensional representation of the omics data can be obtained, and the accuracy of downstream target tasks such as classification tasks of cancer typing and individual survival analysis tasks is further improved.
The process of encoding the omics data according to the expression levels of a plurality of genes in the omics data to obtain the characteristics of the omics data in the omics data processing method provided by the present disclosure is further described below with reference to fig. 2.
Figure 2 is a schematic flow diagram of an omics data processing method according to a second embodiment of the present disclosure. As shown in fig. 2, the omics data processing method may include the following steps:
step 201, omics data are obtained.
Wherein, the omics data is to-be-processed omics data. The omics data include the expression levels of a plurality of genes.
Step 202, generating an input vector according to the expression levels of a plurality of genes in the omics data.
And step 203, inputting the input vector into the deep neural network model for coding so as to obtain the characteristics of omics data.
The deep neural network model may be any model capable of implementing feature extraction, such as an MLP model, which is not limited by this disclosure.
In the embodiment of the disclosure, a deep neural network model may be trained in advance in a contrast learning manner, the deep neural network model has an input vector generated according to the expression quantities of a plurality of genes in omics data and an output as the characteristics of the omics data, so that after the omics data are obtained, the input vector may be generated according to the expression quantities of the plurality of genes in the omics data, the input vector is input into the trained deep neural network model, and the deep neural network model is used to perform characteristic extraction on the omics data to obtain the characteristics of the omics data.
For the training process of the deep neural network model, reference may be made to the following embodiments, which are not described herein again.
The deep neural network model is adopted to fully extract the characteristics from the omics data to obtain the accurate characteristics of the omics data.
In embodiments of the disclosure, the input vector includes one or more dimensions, one or more genes in the omics data, corresponding to one dimension of the input vector. When the input vector comprises a dimension, the value of the dimension can be determined according to the expression quantity of a plurality of genes in the omics data. When the input vector comprises a plurality of dimensions, for any dimension, the value of the dimension can be determined according to the expression quantity of one or more genes corresponding to the dimension in the omics data.
That is, in the embodiment of the present disclosure, when the input vector is generated according to the expression amounts of the plurality of genes in the omics data, the expression amounts of the plurality of genes in the omics data may be used as the values of the corresponding dimensions in the input vector.
Therefore, the expression quantities of a plurality of genes in the omics data can be contained in the generated input vector, and further, the deep neural network model can fully extract the characteristics from the omics data according to the expression quantities of the plurality of genes contained in the input vector.
And step 204, executing a target task of the omics data according to the characteristics of the omics data.
The target task may be any downstream task such as an individual survival analysis task, a disease diagnosis task, a medication recommendation task, a cancer classification task, and the like, which is not limited in the present disclosure.
The omics data processing method of the embodiment of the disclosure generates an input vector according to the expression quantities of a plurality of genes in the omics data by obtaining the omics data, inputs the input vector into the deep neural network model for coding to obtain the characteristics of the omics data, and then executes the target task of the omics data according to the characteristics of the omics data, so that accurate low-dimensional representation of the omics data can be obtained, and the accuracy of downstream target tasks such as classification task of cancer typing, individual survival analysis task and the like can be improved.
According to an embodiment of the present disclosure, a model training method for omics data processing is also provided. Figure 3 is a schematic flow diagram of a model training method for omics data processing according to a third embodiment of the present disclosure.
It should be noted that, in the model training method for omics data processing provided in the embodiments of the present disclosure, the execution subject is a model training device for omics data processing, which is hereinafter referred to as a model training device for short. The model training apparatus may be implemented by software and/or hardware, and the model training apparatus may be configured in an electronic device, which may include, but is not limited to, a terminal device, a server, and the like.
As shown in fig. 3, the model training method for omics data processing may include the following steps:
and step 301, obtaining training omics data.
Wherein the training omics data comprise a plurality of genes.
Step 302, adjusting the expression quantity of a plurality of genes in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data.
The expression level of a gene is a quantitative value of gene expression.
And the data enhancement strategy is a strategy for enhancing the data of the training omics data. The data enhancement policy may be set as desired, and the disclosure is not limited thereto.
In the embodiment of the disclosure, at least two data enhancement strategies can be adopted to perform data enhancement on training omics data to obtain at least two enhanced omics data. And performing data enhancement on the training omics data by adopting each data enhancement strategy to correspondingly obtain enhanced omics data.
Specifically, at least two data enhancement strategies can be adopted to adjust the expression quantity of a plurality of genes in the training omics data, so that the data enhancement of the training omics data is realized. And adjusting the expression quantity of a plurality of genes in the training omics data by adopting each data enhancement strategy to correspondingly obtain enhanced omics data.
And step 303, coding by adopting a deep neural network model according to the expression quantities of a plurality of genes in at least two enhanced omics data to obtain corresponding characteristics.
The deep neural network model may be an MLP model or other models, which is not limited in this disclosure.
In the embodiment of the disclosure, for each enhanced omics data, the enhanced omics data can be encoded by using a deep neural network model according to the expression amount of a plurality of genes therein, so as to obtain the corresponding characteristics of the enhanced omics data. The features corresponding to the enhanced omics data are the low-dimensional omics characterization, namely the low-dimensional representation, of the enhanced omics data. Wherein, the enhanced omics data is coded, namely, the enhanced omics data is subjected to feature extraction.
Step 304, adjusting model parameters of the neural network model based on differences between the features of the at least two enhanced omics data to minimize the differences.
In the embodiment of the present disclosure, a contrast learning manner may be adopted for model training. Specifically, after the features corresponding to at least two types of enhanced omics data are obtained, the contrast learning loss function can be jointly calculated according to the difference between the features corresponding to the at least two types of enhanced omics data, and the model parameters of the deep neural network model are optimized by adjusting the model parameters of the deep neural network model to minimize the contrast learning loss function. And (3) optimizing the model parameters of the deep neural network model for multiple times by repeating the steps 302-304 for multiple times to obtain the trained deep neural network model.
The model parameters of the deep neural network model may be optimized by any optimization method such as an SGD (Stochastic Gradient Descent) method and a BGD (Batch Gradient Descent) method, which is not limited in this disclosure.
The comparative learning loss function may be an InfoNCE Noise-comparative estimation loss function, a BARLOW twons (an auto-supervised learning method) loss function, or may also be another loss function, and may be selected according to the used comparative learning method, which is not limited in this disclosure.
Taking the InfoNCE loss function as an example, the difference and correlation between the group data samples can be learned in the hidden space by building positive and negative samples between the group data samples by increasing the batch size (batch size). Wherein, an omics data sample can be understood as an enhanced omics data.
Taking the example of adopting two data enhancement strategies to adjust the expression quantity of a plurality of genes in training omics data to obtain two enhanced omics data and performing model training based on the two enhanced omics data, and referring to the architecture diagram shown in fig. 4, a double-tower model can be adopted in the embodiment of the disclosure, and the model comprises a data enhancement module, a coding module and a loss function calculation module. The data enhancement module can adopt two data enhancement strategies to perform data enhancement on input training omics data 401 to obtain two enhanced omics data 402 and 403. The coding module is realized through a deep neural network model 404, the enhanced omics data 402 is input into the deep neural network model 404, and the deep neural network model 404 is adopted to extract the features of the enhanced omics data 402, so that the low-dimensional features 405 corresponding to the enhanced omics data 402 can be obtained. Similarly, the enhanced omics data 403 is input into the deep neural network model 404, and the deep neural network model 404 is used to perform feature extraction on the enhanced omics data 403, so as to obtain the low-dimensional features 406 corresponding to the enhanced omics data 403. The loss function calculation module, after obtaining the features 405 and 406 output by the encoding module, may calculate a contrast learning loss function according to a difference between the feature 405 of the enhanced omics data 402 and the feature 406 of the enhanced omics data 403, and optimize the model parameters of the deep neural network model 404 by adjusting the model parameters of the deep neural network model 404 to minimize the contrast learning loss function. The training image data 401 is subjected to data enhancement by adopting two data enhancement strategies for multiple times, so that model parameters of the deep neural network model 404 are optimized for multiple times based on the difference between the characteristics of two enhanced omics data, the optimal parameters of the deep neural network model 404 can be obtained, and the training of the deep neural network model 404 is completed.
It should be noted that the trained deep neural network model in the embodiment of the present disclosure may be used to obtain omics data, and encode the omics data according to the expression amounts of a plurality of genes in the omics data to obtain the features of the omics data. The above steps are executed by using the trained deep neural network model, and the embodiment of the omics data processing method can be parameterized, which is not described herein again.
Through learning by adopting a contrast learning mode, the relevance and the difference between at least two types of enhanced omics data can be learned, so that the trained deep neural network model can fully extract features from the omics data, accurate low-dimensional expression of the omics data is obtained, the expression capacity of the low-dimensional expression of the omics data is improved, the state of a patient can be better described, a target task of the omics data is executed based on the obtained low-dimensional expression of the omics data, and the accuracy of the target task can be improved.
In summary, according to the model training method for omics data processing provided in the embodiments of the present disclosure, training omics data is obtained, at least two data enhancement strategies are employed to adjust the expression levels of a plurality of genes in the training omics data to obtain at least two enhanced omics data, a deep neural network model is employed to encode according to the expression levels of a plurality of genes in the at least two enhanced omics data to obtain corresponding features, model parameters of the neural network model are adjusted according to differences between the features of the at least two enhanced omics data to minimize the differences, training of the deep neural network model based on the training omics data is achieved to obtain the deep neural network model for omics data processing, and the trained deep neural network model is used to perform omics data processing, so that features can be fully extracted from the omics data to obtain accurate low-dimensional representation of the omics data, and further accuracy of downstream target tasks such as classification tasks for cancer typing, individual survival analysis tasks, and the like is improved.
The model training apparatus for omics data processing provided by the present disclosure is further described below in conjunction with figure 5.
Figure 5 is a schematic flow diagram of a model training method for omics data processing according to a fourth embodiment of the present disclosure.
As shown in fig. 5, the model training method for omics data processing may include the following steps:
and step 501, obtaining training omics data.
Step 502, masking the expression quantity of at least one gene in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data.
In the embodiment of the present disclosure, at least two data enhancement strategies are adopted, and when data enhancement is performed on training omics data, the at least two data enhancement strategies may be both used for masking the expression quantity of at least one gene in the training omics data, but the mask positions corresponding to the data enhancement strategies are different, that is, different data enhancement strategies are used for masking the expression quantities of different genes. Therefore, at least two data enhancement strategies are adopted to mask the expression quantity of at least one gene in the training omics data, and at least two types of enhanced omics data can be obtained because different data enhancement strategies are used to mask the expression quantities of different genes.
In the embodiment of the present disclosure, at least two data enhancement strategies are adopted, and when data enhancement is performed on training omics data, the at least two data enhancement strategies may be both to add noise to the expression level of at least one gene in the training omics data, but noise addition modes corresponding to different data enhancement strategies are different. For example, the different data enhancement strategies are to add noise to the expression quantities of different genes, or the different data enhancement strategies are to add noise with different amplitudes to the expression quantities of the same gene, or the different data enhancement strategies are to add noise to the expression quantities of different genes, and the amplitudes of the added noise are different. Therefore, at least two data enhancement strategies are adopted, noise is added to the expression quantity of at least one gene in the training omics data, and at least two enhanced omics data can be obtained due to the difference of different data enhancement strategies.
Accordingly, step 502 may be replaced with: and adding noise to the expression quantity of at least one gene in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data.
By adopting at least two data enhancement strategies, the expression quantity of at least one gene in the training omics data is masked, or at least two data enhancement strategies are adopted, noise is added to the expression quantity of at least one gene in the training omics data, the data enhancement of the training omics data is realized, at least two enhanced omics data are obtained, a deep neural network model obtained by training the enhanced omics data after data enhancement is further enabled, the anti-interference capability is enhanced when the mathematical data is coded, and therefore more accurate low-dimensional representation of the omics data can be obtained through the trained deep neural network model.
Step 503, coding is performed by adopting a deep neural network model according to the expression quantity of a plurality of genes in at least two enhanced omics data to obtain corresponding characteristics.
Step 504, model parameters of the neural network model are adjusted to minimize the difference based on the difference between the features of the at least two enhanced omics data.
For the specific implementation process and principle of steps 503 to 504, reference may be made to the description of the foregoing embodiments, which are not described herein again.
In summary, the model training method for omics data processing provided in the embodiments of the present disclosure obtains training omics data, performs masking on the expression quantity of at least one gene in the training omics data by using at least two data enhancement strategies to obtain at least two enhanced omics data, encodes the expression quantity of a plurality of genes in the at least two enhanced omics data by using a deep neural network model to obtain corresponding features, adjusts model parameters of the neural network model according to a difference between the features of the at least two enhanced omics data to minimize the difference, achieves training the deep neural network model based on the training omics data to obtain the deep neural network model for omics data processing, performs omics data processing by using the trained deep neural network model, can achieve sufficient feature extraction from the omics data to obtain accurate low-dimensional representation of the omics data, and further improves accuracy of downstream target tasks such as classification task of cancer typing, individual survival analysis task, and the like.
The omics data processing device provided by the present disclosure will be described with reference to fig. 6.
Fig. 6 is a schematic structural diagram of an omics data processing device according to a fifth embodiment of the present disclosure.
As shown in fig. 6, the present disclosure provides an omics data processing apparatus 600 comprising: a first obtaining module 601, a first encoding module 602, and a processing module 603.
The first obtaining module 601 is configured to obtain omics data;
a first encoding module 602, configured to encode the omic data according to expression levels of a plurality of genes in the omic data to obtain characteristics of the omic data;
and the processing module 603 is configured to perform a target task of the omics data according to the characteristics of the omics data.
It should be noted that the omics data processing device 600 provided in the present embodiment can execute the omics data processing method of the foregoing embodiment. Wherein, the omics data processing device 600 can be implemented by software and/or hardware, and the omics data processing device 600 can be configured in an electronic device, which can include, but is not limited to, a terminal device, a server, etc., and the embodiment does not specifically limit the electronic device.
As a possible implementation manner of the embodiment of the present disclosure, the first encoding module 602 includes:
a generation unit configured to generate an input vector from the expression levels of a plurality of genes in the omics data;
and the coding unit is used for inputting the input vector into the deep neural network model for coding so as to obtain the characteristics of omics data.
As a possible implementation manner of the embodiment of the present disclosure, the generating unit includes:
and the processing subunit is used for taking the expression quantities of the genes in the omics data as values of corresponding dimensions in the input vector.
It should be noted that the foregoing description of the embodiments of the omics data processing method is also applicable to the omics data processing apparatus provided in the present disclosure, and is not repeated herein.
The omics data processing device provided by the embodiment of the disclosure can obtain accurate low-dimensional representation of omics data by obtaining the omics data, encoding the omics data according to the expression levels of a plurality of genes in the omics data to obtain the characteristics of the omics data, and then executing the target tasks of the omics data according to the characteristics of the omics data, thereby improving the accuracy of downstream target tasks such as classification tasks for cancer typing, individual survival analysis tasks and the like.
According to an embodiment of the present disclosure, a model training device for omics data processing is also provided.
The model training device for omics data processing provided by the present disclosure is described below with reference to fig. 7.
Fig. 7 is a schematic structural diagram of a model training apparatus for omics data processing according to a sixth embodiment of the present disclosure.
As shown in fig. 7, the present disclosure provides a model training apparatus 700 for omics data processing, comprising: a second obtaining module 701, a first adjusting module 702, a second encoding module 703 and a second adjusting module 704.
The second obtaining module 701 is configured to obtain training omics data;
a first adjusting module 702, configured to adjust expression levels of a plurality of genes in training omics data by using at least two data enhancement strategies to obtain at least two enhanced omics data;
the second coding module 703 is configured to code the expression levels of a plurality of genes in at least two types of enhanced omics data by using a deep neural network model to obtain corresponding characteristics;
a second adjusting module 704 for adjusting model parameters of the neural network model based on a difference between the features of the at least two enhanced omics data to minimize the difference.
It should be noted that the model training device 700 for omics data processing, referred to as a model training device for short, provided in this embodiment may perform the model training method for omics data processing of the foregoing embodiment. The model training apparatus may be implemented by software and/or hardware, and the model training apparatus may be configured in an electronic device, which may include, but is not limited to, a terminal device, a server, and the like.
As a possible implementation manner of the embodiment of the present disclosure, the first adjusting module 702 includes:
and the mask unit is used for masking the expression quantity of at least one gene in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data.
As a possible implementation manner of the embodiment of the present disclosure, the first adjusting module 702 includes:
and the processing unit is used for adding noise to the expression quantity of at least one gene in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data.
It should be noted that the foregoing description of the embodiment of the model training method for omics data processing is also applicable to the model training device for omics data processing provided in the present disclosure, and is not repeated herein.
The model training device for omics data processing provided in the embodiments of the present disclosure obtains training omics data, adjusts expression quantities of a plurality of genes in the training omics data by using at least two data enhancement strategies to obtain at least two enhanced omics data, encodes a deep neural network model according to expression quantities of a plurality of genes in the at least two enhanced omics data to obtain corresponding features, adjusts model parameters of the neural network model according to differences between the features of the at least two enhanced omics data to minimize the differences, realizes training the deep neural network model based on the training omics data to obtain the deep neural network model for omics data processing, and performs omics data processing by using the trained deep neural network model to realize sufficient feature extraction from the omics data to obtain accurate low-dimensional representation of the omics data, thereby improving accuracy of downstream target tasks such as classification tasks for cancer typing, individual survival analysis tasks, and the like.
Based on the above embodiment, the present disclosure also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the omic data processing method of the present disclosure or to perform the model training method for omic data processing of the present disclosure.
Based on the above embodiments, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute an omics data processing method disclosed in the embodiments of the present disclosure or execute a model training method for omics data processing disclosed in the embodiments of the present disclosure.
Based on the above embodiments, the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the omics data processing method of the present disclosure, or implements the steps of the model training method for omics data processing of the present disclosure.
The present disclosure also provides an electronic device and a readable storage medium and a computer program product according to embodiments of the present disclosure.
FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the electronic device 800 may include a computing unit 801 that may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the various methods and processes described above, such as an omic data processing method or a model training method for omic data processing. For example, in some embodiments, the omic data processing method or the model training method for omic data processing can be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM 803 and executed by computing unit 801, a computer program may perform one or more steps of the omics data processing method or the model training method for omics data processing described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform an omic data processing method or a model training method for omic data processing by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (10)

1. A method of histological data processing, comprising:
obtaining omics data;
generating an input vector according to the expression quantity of a plurality of genes in the omics data;
inputting the input vector into a deep neural network model for coding so as to obtain the characteristics of the omics data;
performing a target task of the omics data according to the characteristics of the omics data;
the training method of the deep neural network model comprises the following steps:
acquiring training omics data;
adjusting the expression quantity of a plurality of genes in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data;
coding by adopting a deep neural network model according to the expression quantity of a plurality of genes in the at least two types of enhanced omics data to obtain corresponding characteristics;
adjusting model parameters of the neural network model based on differences between the characteristics of the at least two enhanced omics data to minimize the differences.
2. The method of claim 1 wherein the generating an input vector from the expression levels of the plurality of genes in the omics data comprises:
and taking the expression quantities of a plurality of genes in the omics data as values of corresponding dimensions in the input vector.
3. The method of claim 1, wherein the adjusting the expression levels of the plurality of genes in the training omics data using at least two data enhancement strategies to obtain at least two enhanced omics data comprises:
and masking the expression quantity of at least one gene in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data.
4. The method of claim 1 wherein the adjusting the expression levels of the plurality of genes in the training omics data to obtain at least two enhanced omics data using at least two data enhancement strategies comprises:
and adding noise to the expression quantity of at least one gene in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data.
5. A histologic data processing apparatus, comprising:
the first acquisition module is used for acquiring omics data;
a first encoding module comprising:
a generation unit configured to generate an input vector from the expression levels of the plurality of genes in the omics data;
the coding unit is used for inputting the input vector into a deep neural network model for coding so as to obtain the characteristics of the omics data, the deep neural network model is obtained by performing model training based on at least two types of enhanced omics data, and the at least two types of enhanced omics data are obtained by adjusting the expression quantity of a plurality of genes in the training omics data by adopting at least two data enhancement strategies;
the processing module is used for executing a target task of the omics data according to the characteristics of the omics data;
training the deep neural network model, comprising:
the second acquisition module is used for acquiring training omics data;
the first adjusting module is used for adjusting the expression quantity of a plurality of genes in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data;
the second coding module is used for coding by adopting a deep neural network model according to the expression quantity of a plurality of genes in the at least two types of enhanced omics data to obtain corresponding characteristics;
a second adjusting module for adjusting model parameters of the neural network model to minimize the differences according to the differences between the characteristics of the at least two enhanced omics data.
6. The apparatus of claim 5, wherein the generating unit comprises:
and the processing subunit is used for taking the expression quantities of the genes in the omics data as values of corresponding dimensions in the input vector.
7. The apparatus of claim 5, wherein the first adjustment module comprises:
and the mask unit is used for masking the expression quantity of at least one gene in the training omics data by adopting at least two data enhancement strategies to obtain at least two types of enhanced omics data.
8. The apparatus of claim 5, wherein the first adjustment module comprises:
and the processing unit is used for adding noise to the expression quantity of at least one gene in the training omics data by adopting at least two data enhancement strategies to obtain at least two enhanced omics data.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.
CN202111653938.XA 2021-12-30 2021-12-30 Omics data processing method and device, electronic device and storage medium Active CN114429787B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111653938.XA CN114429787B (en) 2021-12-30 2021-12-30 Omics data processing method and device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111653938.XA CN114429787B (en) 2021-12-30 2021-12-30 Omics data processing method and device, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN114429787A CN114429787A (en) 2022-05-03
CN114429787B true CN114429787B (en) 2023-04-18

Family

ID=81310491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111653938.XA Active CN114429787B (en) 2021-12-30 2021-12-30 Omics data processing method and device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN114429787B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956416B (en) * 2016-05-10 2018-07-13 湖北普罗金科技有限公司 A kind of method of fast automatic analyzing prokaryote protein gene group data
CA3100065A1 (en) * 2018-05-30 2019-12-05 Quantum-Si Incorporated Methods and apparatus for multi-modal prediction using a trained statistical model
CN109300502A (en) * 2018-10-10 2019-02-01 汕头大学医学院 A kind of system and method for the analyzing and associating changing pattern from multiple groups data
JP7463186B2 (en) * 2020-05-26 2024-04-08 キヤノン株式会社 Information processing device, information processing method, and program
CN113571193B (en) * 2021-06-24 2023-09-05 浙江大学 Construction method and device of lymph node metastasis prediction model based on multi-view learning image histology fusion
CN113782089B (en) * 2021-11-15 2022-02-18 浙江大学 Drug sensitivity prediction method and device based on multigroup chemical data fusion

Also Published As

Publication number Publication date
CN114429787A (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN112466288B (en) Voice recognition method and device, electronic equipment and storage medium
CN112560874A (en) Training method, device, equipment and medium for image recognition model
CN114333982A (en) Protein representation model pre-training and protein interaction prediction method and device
CN114564593A (en) Completion method and device of multi-mode knowledge graph and electronic equipment
CN112562069A (en) Three-dimensional model construction method, device, equipment and storage medium
JP7357114B2 (en) Training method, device, electronic device and storage medium for living body detection model
CN114420309A (en) Method for establishing drug synergy prediction model, prediction method and corresponding device
CN112949433B (en) Method, device and equipment for generating video classification model and storage medium
CN113409898B (en) Molecular structure acquisition method and device, electronic equipment and storage medium
CN114781650A (en) Data processing method, device, equipment and storage medium
CN114529796A (en) Model training method, image recognition method, device and electronic equipment
CN114462598A (en) Deep learning model training method, and method and device for determining data category
CN114429787B (en) Omics data processing method and device, electronic device and storage medium
CN112331261A (en) Drug prediction method, model training method, device, electronic device, and medium
CN114783597B (en) Method and device for diagnosing multi-class diseases, electronic equipment and storage medium
CN115206421B (en) Drug repositioning method, and repositioning model training method and device
CN114490965B (en) Question processing method and device, electronic equipment and storage medium
CN114429786A (en) Omics data processing method and device, electronic device and storage medium
CN115840867A (en) Generation method and device of mathematical problem solving model, electronic equipment and storage medium
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN113204616A (en) Method and device for training text extraction model and extracting text
CN112784967A (en) Information processing method and device and electronic equipment
CN112632999A (en) Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium
CN114792573B (en) Drug combination effect prediction method, model training method, device and equipment
CN117493514B (en) Text labeling method, text labeling device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant