CN117116350B - Correction method and device for RNA sequencing data, electronic equipment and storage medium - Google Patents

Correction method and device for RNA sequencing data, electronic equipment and storage medium Download PDF

Info

Publication number
CN117116350B
CN117116350B CN202311388051.1A CN202311388051A CN117116350B CN 117116350 B CN117116350 B CN 117116350B CN 202311388051 A CN202311388051 A CN 202311388051A CN 117116350 B CN117116350 B CN 117116350B
Authority
CN
China
Prior art keywords
rna sequencing
information distribution
encoder
biological information
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311388051.1A
Other languages
Chinese (zh)
Other versions
CN117116350A (en
Inventor
钱坤
李若男
刘万飞
林强
崔鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute Of Agricultural Genome Chinese Academy Of Agricultural Sciences Shenzhen Branch Of Guangdong Provincial Laboratory Of Lingnan Modern Agricultural Science And Technology
Original Assignee
Shenzhen Institute Of Agricultural Genome Chinese Academy Of Agricultural Sciences Shenzhen Branch Of Guangdong Provincial Laboratory Of Lingnan Modern Agricultural Science And Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute Of Agricultural Genome Chinese Academy Of Agricultural Sciences Shenzhen Branch Of Guangdong Provincial Laboratory Of Lingnan Modern Agricultural Science And Technology filed Critical Shenzhen Institute Of Agricultural Genome Chinese Academy Of Agricultural Sciences Shenzhen Branch Of Guangdong Provincial Laboratory Of Lingnan Modern Agricultural Science And Technology
Priority to CN202311388051.1A priority Critical patent/CN117116350B/en
Publication of CN117116350A publication Critical patent/CN117116350A/en
Application granted granted Critical
Publication of CN117116350B publication Critical patent/CN117116350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Public Health (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a correction method and device of RNA sequencing data, electronic equipment and storage medium, and the correction method of the RNA sequencing data comprises the following steps: correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data; the correction model is obtained based on training of the variation self-encoder; the variable self-encoder comprises a correction model, and is used for extracting noise information distribution in the RNA sequencing sample and biological information distribution in the corrected RNA sequencing sample, and carrying out decoding reduction treatment on the RNA sequencing sample by combining the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample; the variation self-encoder is trained based on at least the reduced RNA sequencing sample and the RNA sequencing sample, and discrimination results for noise information distribution and biological information.

Description

Correction method and device for RNA sequencing data, electronic equipment and storage medium
Technical Field
The application relates to the technical field of RNA sequencing, in particular to a correction method and device of RNA sequencing data, electronic equipment and a storage medium.
Background
Transcriptome sequencing (rna_seq) technology, which is based on second generation high-throughput DNA sequencing technology, provides single base level full transcript information, has been developed to now as an indispensable tool in the field of molecular biology. Nowadays, RNA-seq technology is widely used in a variety of fields such as gene expression quantification, transcription initiation site recognition, non-coding RNA functional identification, single cell analysis, and the like.
The accumulation of high throughput sequencing data makes integrated analysis of large numbers of common transcriptome sequencing data from which biological laws are found to be more viable, but how to correct noise data in large-scale data sets generated by batch effects becomes a primary issue.
Disclosure of Invention
In order to solve the technical problems, the application provides a method and a device for correcting RNA sequencing data, electronic equipment and a storage medium.
According to a first aspect of embodiments of the present application, there is provided a method of correcting RNA sequencing data, comprising:
correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is performed on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data;
The correction model is obtained based on training of a variation self-encoder; the variation self-encoder comprises the correction model and is used for extracting noise information distribution in an RNA sequencing sample and biological information distribution in the corrected RNA sequencing sample, and decoding and reducing the RNA sequencing sample by combining the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample;
the variation self-encoder is trained and obtained at least based on the reduced RNA sequencing sample and the discrimination results of the noise information distribution and the biological information; wherein the discrimination of the noise information distribution and the biological information distribution includes: and distinguishing the noise information and the biological information in the noise information distribution and distinguishing the noise information and the biological information in the biological information distribution.
In an alternative embodiment of the present application, the correction model includes: a first encoder and a first decoder;
correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the method comprises the following steps of:
Performing biological dimension coding processing on the RNA sequencing data by using the first coder to obtain biological information distribution of the RNA sequencing data;
and decoding and reducing the biological information distribution by using the first decoder to obtain the corrected RNA sequencing data.
In an alternative embodiment of the present application, the correction model is trained by:
carrying out noise dimension coding treatment on the RNA sequencing sample to obtain noise information distribution of the RNA sequencing sample;
coding the corrected RNA sequencing sample to obtain biological information distribution of the RNA sequencing sample;
performing joint decoding reduction treatment on the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample;
optimizing the variational self-encoder based on the difference between the RNA sequencing sample and the reduced RNA sequencing sample to train the correction model.
In an optional embodiment of the present application, the coding processing of the noise dimension is performed on the RNA sequencing sample, so as to obtain noise information distribution of the RNA sequencing sample; coding the corrected RNA sequencing sample to obtain biological information distribution of the RNA sequencing sample; performing joint decoding reduction processing on the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample, wherein the method comprises the following steps of:
Performing noise dimension coding processing on the RNA sequencing sample by using a second coder to obtain noise information distribution of the RNA sequencing sample;
encoding the corrected RNA sequencing sample by using a third encoder to obtain biological information distribution of the RNA sequencing sample;
and carrying out joint decoding reduction treatment on the noise information distribution and the biological information distribution by using a second decoder to obtain a reduced RNA sequencing sample.
In an alternative embodiment of the present application, further comprising:
respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first noise information discriminator to obtain a first discrimination result aiming at the noise information in the noise information distribution and a second discrimination result aiming at the noise information in the biological information distribution;
performing discrimination processing on the noise information distribution and the biological information distribution by using a first biological information discriminator to obtain a third discrimination result for biological information in the noise information distribution and a fourth discrimination result for biological information in the biological information distribution;
And optimizing a first encoder and the second encoder based on the first discrimination result, the second discrimination result, the third discrimination result, and the fourth discrimination result to train the correction model.
In an alternative embodiment of the present application, further comprising:
optimizing the variational self-encoder based on the differences between the noise information distribution and the prior hypothesis, and the differences between the biological information distribution and the prior hypothesis, to train the correction model; wherein the prior assumption is a normal distribution.
In an alternative embodiment of the present application, further comprising:
discriminating biological information of the RNA sequencing sample and the corrected RNA sequencing sample by using a second biological information discriminator to obtain a fifth discrimination result for the RNA sequencing sample and a sixth discrimination result for the corrected RNA sequencing sample;
utilizing a second noise information discriminator to discriminate noise information of the RNA sequencing sample and the corrected RNA sequencing sample, and obtaining a seventh discrimination result for the RNA sequencing sample and an eighth discrimination result for the corrected RNA sequencing sample;
Optimizing the variation self-encoder based on the fifth, sixth, seventh, and eighth discrimination results to train the correction model.
In an alternative embodiment of the present application, further comprising:
acquiring an RNA sequencing data set, wherein the RNA sequencing data set comprises a plurality of corrected RNA sequencing data;
and classifying the corrected RNA sequencing data to obtain the RNA sequencing data of different categories.
According to a second aspect of embodiments of the present application, there is provided a correction device for RNA sequencing data, comprising:
the first unit is used for correcting the RNA sequencing data by utilizing a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is carried out on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data;
the correction model is obtained based on training of a variation self-encoder; the variation self-encoder comprises the correction model and is used for extracting noise information distribution in an RNA sequencing sample and biological information distribution in the corrected RNA sequencing sample, and decoding and reducing the RNA sequencing sample by combining the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample;
The variation self-encoder is trained and obtained at least based on the reduced RNA sequencing sample and the discrimination results of the noise information distribution and the biological information; wherein the discrimination of the noise information distribution and the biological information distribution includes: and distinguishing the noise information and the biological information in the noise information distribution and distinguishing the noise information and the biological information in the biological information distribution.
According to a third aspect of embodiments of the present application, there is provided an electronic device, including:
a processor;
a memory for storing the processor-executable instructions;
the processor is used for executing the correction method of the RNA sequencing data by running the instructions in the memory.
According to a fourth aspect of embodiments of the present application, there is provided a computer storage medium storing a computer program which, when executed by a processor, performs the method of correcting RNA sequencing data described above.
The application provides a correction method and device of RNA sequencing data, electronic equipment and storage medium, wherein the correction method of the RNA sequencing data comprises the following steps: correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is performed on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data; the correction model is obtained based on training of a variation self-encoder; the variation self-encoder comprises the correction model and is used for extracting noise information distribution in an RNA sequencing sample and biological information distribution in the corrected RNA sequencing sample, and decoding and reducing the RNA sequencing sample by combining the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample; the variation self-encoder is trained and obtained at least based on the reduced RNA sequencing sample and the discrimination results of the noise information distribution and the biological information; wherein the discrimination of the noise information distribution and the biological information distribution includes: and distinguishing the noise information and the biological information in the noise information distribution and distinguishing the noise information and the biological information in the biological information distribution.
According to the method, the correction of the RNA sequencing data is achieved by utilizing the correction model, meanwhile, the training of the correction model is based on the training of a variation self-encoder comprising the correction model, and the judgment result of noise information distribution and biological information distribution in the RNA sequencing data is obtained, so that the correction precision of the correction model on the RNA sequencing data is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flow chart of a method for calibrating RNA sequencing data according to an embodiment of the present application.
Fig. 2 is a flowchart of preprocessing RNA sequencing data according to another embodiment of the present application.
Fig. 3 is a schematic structural diagram of a variable self-encoder according to another embodiment of the present application.
Fig. 4 is a schematic structural diagram of a calibration device for RNA sequencing data according to another embodiment of the present application.
Fig. 5 is a schematic structural diagram of an electronic device according to another embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Transcriptome sequencing (rna_seq) technology, which is based on second generation high-throughput DNA sequencing technology, provides single base level full transcript information, has been developed to now as an indispensable tool in the field of molecular biology. Nowadays, RNA-seq technology is widely used in a variety of fields such as gene expression quantification, transcription initiation site recognition, non-coding RNA functional identification, single cell analysis, and the like.
The accumulation of high throughput sequencing data makes integrated analysis of large numbers of common transcriptome sequencing data from which biological laws are found to be more viable, but how to correct noise data in large-scale data sets generated by batch effects becomes a primary issue.
In order to solve the above technical problems, the present application provides a method, an apparatus, an electronic device and a storage medium for correcting RNA sequencing data, which are described in detail in the following examples.
Exemplary method
The embodiment of the application firstly provides a correction method of RNA sequencing data, which aims to correct noise data (also called batch data) in the RNA sequencing data to obtain noise-free RNA sequencing data under the condition that biological information in the RNA sequencing data is reserved.
Among them, RNA sequencing data is understood to be RNA-seq data, which is transcriptome data obtained by RNA sequencing technology, which contains all RNA information transcribed in cells or tissues, and can be used to study gene expression levels, splice variation, transcript diversity, transcriptome start sites, transcription factor binding sites, etc. In embodiments of the present application, RNA-seq data can be generated by high throughput sequencing techniques.
In the field of biology, biological information in the RNA sequencing data mainly includes the following aspects:
1. the tissue source, i.e., the source used to represent the RNA sequencing sample, may be from different tissues, such as liver, heart, lung, etc., each having specific characteristics and expression profiles at the gene expression level.
2. Physiological status, i.e. the health status used to represent the RNA sequencing sample, may be from a healthy individual or an individual suffering from a disease, in which the transcriptome may be altered, e.g. the expression of immune related genes may be up-regulated under inflammatory conditions.
3. The developmental status, i.e., the developmental stage at which the individual from which the RNA sequencing sample is derived, may be from individuals at different developmental stages, such as embryos, young children, adults, etc., and the transcriptome will change as the developmental stage changes.
4, treatment or stimulation, i.e., to indicate that the RNA sequencing sample is subjected to an external treatment or stimulation, the RNA sequencing sample may be subjected to an external treatment or stimulation, such as: drug treatment, viral infection, etc., which causes changes in the transcriptome.
Referring to fig. 1, fig. 1 is a flowchart of a method for calibrating RNA sequencing data according to an embodiment of the present application.
As shown in fig. 1, the method for correcting RNA sequencing data comprises:
and step S101, correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is performed on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data.
The correction model is obtained based on training of a variation self-encoder; the variation self-encoder comprises the correction model and is used for extracting noise information distribution in an RNA sequencing sample and biological information distribution in the corrected RNA sequencing sample, and decoding and reducing the RNA sequencing sample by combining the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample;
the variation self-encoder is trained and obtained at least based on the reduced RNA sequencing sample and the discrimination results of the noise information distribution and the biological information; wherein the discrimination of the noise information distribution and the biological information distribution includes: and distinguishing the noise information and the biological information in the noise information distribution and distinguishing the noise information and the biological information in the biological information distribution.
In an embodiment of the application, the correction model is also a variable self-Encoder (VariationalAutoEncoder, VAE) essentially comprising two parts, a first Encoder (Encoder) for compressing a high-dimensional input into a low-dimensional hidden variable and a first Decoder (Decoder) for decoding the low-dimensional hidden variable and reconstructing the high-dimensional input.
In the embodiment of the application, the RNA test data includes not only biological information but also noise information, for example, when integrating RNA sequencing data in different laboratories, batch effects may occur in the RNA sequencing data in different laboratories due to different experimental instruments and experimental conditions.
Further, the first decoder is specifically configured to perform a coding process of a biological dimension on the RNA sequencing data, so as to obtain a biological distribution of the RNA sequencing data; the first decoder is used for decoding and restoring the biological information distribution to obtain the corrected RNA sequencing data.
In an alternative embodiment of the present application, the RNA sequencing data is further pre-processed before being subjected to the correction process using the pre-constructed correction model, in order to make the RNA sequencing data available for input to the convolutional neural network.
Specifically, the process of preprocessing the RNA sequencing data includes:
firstly, expressing RNA sequencing data through a TPM (Transcripts Per Million) expression matrix, then, carrying out standardization treatment on TPM expression (in an alternative embodiment of the application, the library size of the TPM after the standardization treatment is 1000000), carrying out logarithmic conversion with the offset of 1 and the base of 2 on the TPM, and carrying out maximum standardization on the TMP value after the logarithmic conversion at the gene level so as to limit the TMP value to be between 0 and 1.
Because the model framework of the present application is based on a graph Convolutional Neural Network (CNN), two-dimensional image transformation can be performed on a standardized TPM (one-dimensional vector) so as to improve training efficiency of a subsequent model, and specifically, please refer to fig. 2, fig. 2 is a flowchart of preprocessing RNA sequencing data provided in another embodiment of the present application.
As shown in fig. 2, the two-dimensional image conversion of the standardized TPM includes: (1) Performing nonlinear dimension reduction on all genes in the TPM by using a t-SNE algorithm, and mapping each gene to a two-dimensional coordinate axis; (2) Searching and packing the minimum enclosure containing all genes in a coordinate system by utilizing a convex hull algorithm; (3) The genes are transformed from a Cartesian coordinate system to pixels in a two-dimensional matrix using rotation and feature localization. That is, in the present embodiment, the RNA sequencing sample is input to the calibration model in the form of a picture.
Further, in order to facilitate understanding of the training process of the correction model mentioned in the embodiments of the present application, the training process of the correction model is described in detail below with reference to fig. 3.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a variable self-encoder according to another embodiment of the present application.
Wherein the correction model is included in the variation self-encoder, and in the embodiment of the application, training of the correction model is realized based on training of the variation self-encoder.
As shown in fig. 3, the variable self-encoder includes: a first encoder 101, a first decoder 102, a second encoder 103, a third encoder 104, and a second decoder 105.
Wherein the first encoder 101 and the first decoder 102 construct the correction model;
the second encoder 103 is configured to perform a noise dimension encoding process on the RNA sequencing sample x to obtain a noise information distribution z of the RNA sequencing sample b
The first encoder 101 is used for performing a biological dimension encoding process on the RNA sequencing sample x to obtain a biological information distribution z of the RNA sequencing sample c The method comprises the steps of carrying out a first treatment on the surface of the The first decoder 102 is used for distributing z the biological information c Decoding to obtain corrected RNA sequencing sample x bf (i.e., noise-free RNA sequencing samples).
Specifically, in the present embodiment, q (z b The x) model maps the RNA sequencing sample x directly to the hidden layer, representing the noise information distribution z b The method comprises the steps of carrying out a first treatment on the surface of the Constructing q (z by using the first encoder 101 c The |x) model maps the RNA sequencing sample x directly to the hidden layer, representing the biological information distribution z c
In agreement with the original variational self-encoder, the first encoder 101 and the second encoder 103 do not directly embed the input x as the hidden layer representation z, but assume that p (z) complies with the standard gaussian distribution prior, and first dimension-reduce to obtain the mean value of the hidden layerSum of variances->The hidden layer z is obtained by sampling.
Wherein the noise information is distributedBiological information distribution->,,/>Wherein->,/>J is the dimension of the hidden layer, +.>
In the present embodiment, the first decoder 102 passes throughRepresentation for distributing biological information z c Restoring to original dimension to obtain RNA sequencing sample x without influence of noise information bf . Further, a third encoder 104 is used for sequencing the RNA sample x after the correction processing bf Performing coding treatment to obtain biological information distribution z of the RNA sequencing sample c
Finally, a second decoder 105 is arranged for distributing z the noise information b And the biological information distribution z c And carrying out joint decoding reduction treatment to obtain a reduced RNA sequencing sample x'.
Specifically, the second decoder 105 passes throughDividing biological information distribution and noise information The cloths are combined to construct +.>The method is characterized in that the method conforms to a multi-element Gaussian distribution, the input RNA sequencing sample x is reconstructed, biological information and noise information in the RNA sequencing sample x are split by using a first encoder 101 and a third encoder 104, and then combined and reduced by using a second decoder 105, so that a framework of a variable self-encoder is constructed under the condition of no noise information.
As is apparent from the above description, the second encoder 103, the correction model composed of the first encoder 101 and the first decoder 102, the third encoder 104, and the second decoder 105 together constitute the framework of the variable self-encoder,
in an alternative embodiment of the present application, the variation self-encoder may be optimized to train the correction model based on the difference between the RNA sequencing sample and the reduced RNA sequencing sample.
In this embodiment, for the second decoder 105 in the variation self-encoder, the main purpose of the second decoder is to decode the biological information distribution and the noise information distribution jointly to obtain the reduced RNA sequencing data, and assuming that the RNA sequencing data is x and the reduced RNA sequencing data is x ', the variation self-encoder may be trained by constructing a log-likelihood manner, where x is input and x' is output as similar as possible.
Specifically, the log likelihood thereof can be represented by the following formula (1):
(1);
wherein,independently of each other(s)>Representing noise information distribution, ++>Representing the distribution of biological information,/->Representation->And->Is a joint a priori distribution of->Indicate a given +.>And->The likelihood function below.
Since the integral is difficult to calculate, the posterior probability is then calculatedIs also computationally difficult, and therefore, it is possible to construct the coding network +.>And->To approximate->Therefore, the log likelihood can also be expressed by the following formula (2):
(2);
wherein,loss of reconstruction from the encoder for said variation,>indicate a given +.>And->The likelihood function below.
KL divergence between the approximate posterior distribution and the prior distribution of the self-encoder for the variation,/for the variation>Representing the encoded network generated posterior distribution +.>Approximation of->Representation->Is a priori distributed->Representing the encoded network generated posterior distribution +.>Approximation of->Representation->Is a priori distributed->Representation->And->Is a joint posterior distribution of (c).
To vary the KL divergence between the approximate posterior distribution and the true posterior distribution of the self-encoder,/for>Representing the encoded network generated posterior distribution +.>Is a approximation of (a).
However, since the divergence between the approximate posterior distribution and the true posterior distribution is difficult to calculate, and KL divergence is a non-negative value, the following loss function (3) can be constructed by maximizing the variable lower bound of the data approximation:
(3);
The first term to the right of equation (3) above is the reconstruction loss, which aims to minimize the difference between the reduced RNA sequencing sample x' and the RNA sequencing sample x, and thus can be calculated using the following mean square error equation (4):
(4);
wherein x represents the RNA sequencing sample and x' identifies the RNA sequencing sample after reduction.
The second term and the third term to the right of equation (3) above are regular terms of KL-divergence, the purpose of which is to approximate the posterior distribution of the hidden layers of the other first encoder 101 and the second encoder 103And a priori assumption ++>In proximity, it can be represented by the following equation (5) and equation (6):
(5);
(6);
Wherein,variance representing noise information distribution, ++>Representing the mean value of the noise information distribution; />Representing the variance of the distribution of biological information, +.>Representing the mean of the biological information distribution.
In addition, the embodiment of the present application adds a similarity regularization loss to the output layer of the first decoder 102 to limit the corrected RNA sequencing data, which is closer to the RNA sequencing data, and may be L in practical applications 2 The calculation is made regularly, and the similarity loss can be expressed by the following formula (7):
(7);
wherein,representing the RNA sequencing sample, +. >Representing the RNA sequencing sample after the calibration treatment.
Further, to enable the first encoder 101 to accurately identify noise information in the RNA sequencing sample, and to enable the first encoder toThe two encoders 103 can accurately identify biological information in the RNA sequencing sample, and a first biological information discriminator 106 is further arranged for the first encoder 101 and the second encoder 103) And a first noise information discriminator 107 (++>)。
Wherein, the first noise information discriminator 107 is configured to perform discrimination processing on the noise information distribution and the biological information distribution, respectively, to obtain a first discrimination result for noise information in the noise information distribution, and a second discrimination result for noise information in the biological information distribution.
A first biological information discriminator 106 for discriminating the noise information distribution and the biological information distribution, respectively, to obtain a third discrimination result for the biological information in the noise information distribution, and a fourth discrimination result for the biological information in the biological information distribution.
Specifically, the first noise information discriminator 107 is configured to distribute the biological information And noise information distribution->The second encoder 103 wants to predict the noise information from the other hidden layer, and the first encoder 101 wants to cheat the first noise information discriminator 107, so that the hidden layer of the first encoder is hard to predict the noise information; the corresponding first biological information discriminator is then used for distributing +_from the biological information>And noise information distribution->In the first encoder is difficult to predict biological information, and in particular, training of the first noise information discriminator 107 and the first biological information discriminator 106 may be understood as maximizing log likelihood to evaluate conditional probabilitiesWherein->,/>,/>Representing the distribution of biological information->Or noise information distribution->In the prediction of the probability of biological information tag and noise information tag, b represents noise information tag, c represents biological information tag,/or->Representing the probability that a noise information tag or a biological information tag cannot be predicted from the hidden layer, further maximizing the log likelihood means maximizing the classification loss function as shown in the following formula (8):
(8);
wherein,representing a probability of predicting noise information tags from the noise information distribution; / >Representing the prediction of noise information tags from biological information distributionProbability; />Representing a probability of predicting a biological information tag from the noise information distribution; />Representing the probability of predicting a biological information tag from a biological information distribution.
Further, the training of the first encoder 101 and the second encoder 103 is to minimize the following two classification loss functions (9), (10):
(9);
(10);
wherein,representing a probability of predicting noise information tags from the noise information distribution; />Representing a probability that a biological information tag cannot be predicted from the noise information distribution; />Representing a probability that a noise information tag cannot be predicted from the biological information distribution; />Indicating the probability of predicting a biological information tag from biological information.
In another alternative embodiment of the present application, the training of the correction model further comprises a second biological information discriminator 108 and a second noise information discriminator 109 based on the data as shown in fig. 3;
the second biological information discriminator is configured to discriminate biological information of the RNA sequencing sample and the corrected RNA sequencing sample, and obtain a fifth discrimination result for the RNA sequencing sample and a sixth discrimination result for the corrected RNA sequencing sample.
The second noise information discriminator is configured to discriminate noise information of the RNA sequencing sample and the corrected RNA sequencing sample, and obtain a seventh discrimination result for the RNA sequencing sample and an eighth discrimination result for the corrected RNA sequencing sample.
Further, training of the correction model by the second biological information discriminator 108 and the second noise information discriminator 109 includes: optimizing the variation self-encoder based on the fifth, sixth, seventh, and eighth discrimination results to train the correction model.
Similar to the first biological information discriminator 106 and the first noise information discriminator 107, training of the second biological information discriminator 108 and the second noise information discriminator 109 may be understood as maximizing log likelihood to evaluate conditional probabilitiesWherein->,/>,/>Representing the probability of predicting biological information tag and noise information tag from RNA sequencing sample and corrected RNA sequencing sample, b representing noise information tag, c representing biological information tag, < >>Representing a probability that a noise information tag or a biological information tag cannot be predicted from an RNA sequencing sample or a corrected RNA sequencing sample, further maximizing log likelihood Maximizing the classification loss function as shown in equation (11) below:
(11);
wherein,representing a probability of predicting noise information from the corrected RNA sequencing sample; />Representing a probability of predicting biological information from the corrected RNA sequencing sample; />Representing a probability of predicting noise information from the RNA sequencing sample; />The probability of predicting biological information from an RNA sequencing sample is shown.
Further, training the correction model based on the second biological information discriminator 108 and the second noise information discriminator 109 is achieved by minimizing the following classification loss functions (12), (13):
(12);
(13);
wherein,representing a probability that noise information cannot be predicted from the corrected RNA sequencing sample;representing a probability of predicting biological information from the corrected RNA sequencing sample; />Representing a probability of predicting noise information from the RNA sequencing sample; />The probability of predicting biological information from an RNA sequencing sample is shown.
The loss function required for training of the correction model by the binding discriminator can be expressed by the following equation (14) in combination with the above equations (8) to (9):
(14);
wherein,、/>、/>respectively representing the weight loss of each part, in the practical application process, different weights can be set according to the different training difficulties of each part, and in an alternative embodiment of the application, <' > the weight loss is different >=5,/>=15,/>=10。
In combination with the above formulas (1) to (14), the training of the correction model can be represented by the following formula (15):
(15);
wherein,and->Respectively indicate->And->In an alternative embodiment of the present application,/weight>=0.5,/>=150。
In an optional embodiment of the present application, after training the correction model based on the above manner, a fully-connected network and a softmax output layer may be added after the first encoder in the correction model, and phenotype classification training may be performed to obtain a prediction model for predicting phenotype information such as a cancer type, a primary site, a disease type, and the like of a tumor sample.
Further, after correcting the RNA sequencing data based on the correction method for RNA sequencing data described above, the corrected RNA sequencing data may also be integrated and classified.
Specifically, the method further comprises the following steps:
acquiring an RNA sequencing data set, wherein the RNA sequencing data set comprises a plurality of corrected RNA sequencing data;
and classifying the corrected RNA sequencing data to obtain the RNA sequencing data of different categories.
Further, to ensure the correction quality of the correction model, the correction model may also be evaluated.
In an alternative embodiment of the present application, the evaluation of the correction model may be based on accumulated adjusted Lande coefficients (Adjusted Rand index, ARI), mutual information (Mutual Information, MI), profile coefficients (Silhouette Coefficient), and kBET based on a k-neighbor distribution as quantitative evaluation indicators.
Wherein the adapted rand coefficient is used to calculate the similarity between the two clusters, i.e. after clustering the RNA sequencing data in the RNA sequencing set based on the correction model.
Specifically, the adjusted rad coefficient can be expressed by the following formula (16):
(16);
wherein,、/>、/>ARI represents the adjusted Rankine coefficient for values in the list in the range of [0,1 ]]When the value is 1, two clusters are identical, and when the clustering result is close to random division, the value is close to 0.
The mutual information preparation is used for measuring the similarity between the true label and the cluster label of the same data, and in the embodiment of the application, adjustment mutual information is adopted for calculation, and the adjustment mutual information can be represented by the following formula (17):
(17);
wherein U, V are real tags and clustering tags after RNA sequencing data clustering, MI is mutual information, and H is shannon entropy.
The profile factor is used to evaluate the quality of clustering effect by measuring compactness within clusters and separation between clusters, specifically, for each RNA sequencing data, the profile factor can be represented by the following formula (18):
(18);
wherein,mean cluster representing RNA test data i to other data of the same cluster, < >>Is the minimum average distance from the RNA sequencing data i to other clusters, the profile coefficient of the final cluster as a whole can be the average value of the profile coefficients of all the RNA sequencing samples, and the value range is [0,1]The larger the value thereof, the better the clustering effect.
kBET (k-mer Balanced Error Rate) is a quality indicator that measures batch effects in RNA sequencing data. In the examples herein, the data set defining a total number n of RNA sequencing samples is D,,/>for the global proportion of each batch, where i is the batch,/is->The number of samples was sequenced for the RNA of lot i, i being a positive integer. In addition, k nearest neighbor subsets of RNA sequencing samples were randomly selected +.>Definitions->For the proportion of the batch i within the local oneself, zero is assumed to be already well mixed for the data, i.e.>And->The distribution is the same. In subset->(j=1, 2 … m) using the pearson chi-square test and returning the P value, the average kBERT acceptance rate for all tests is finally obtained:
The average kBERT acceptance rate may be represented by the following formula (19):
(19);/>
wherein I (x) is an indicator function, α is a significant level threshold, and the higher the acceptance rate, the better the batch correction effect.
In summary, according to the correction method for the RNA sequencing data, correction of the RNA sequencing data is achieved by using the correction model, meanwhile, training of the correction model is based on training of a variation self-encoder comprising the correction model, and discrimination results of noise information distribution and biological information distribution in the RNA sequencing data are obtained, so that correction accuracy of the correction model on the RNA sequencing data is improved.
Exemplary apparatus
Correspondingly, the embodiment of the application also provides a device for correcting the RNA sequencing data, please refer to fig. 4, and fig. 4 is a schematic structural diagram of a device for correcting the RNA sequencing data according to another embodiment of the application.
As shown in fig. 4, the correction device for RNA sequencing data comprises:
a first unit 401, configured to perform correction processing on RNA sequencing data by using a pre-constructed correction model, to obtain corrected RNA sequencing data, where the correction processing is performed on the RNA sequencing data, and is configured to eliminate noise data in the RNA sequencing data;
The correction model is obtained based on training of a variation self-encoder; the variation self-encoder comprises the correction model and is used for extracting noise information distribution in an RNA sequencing sample and biological information distribution in the corrected RNA sequencing sample, and decoding and reducing the RNA sequencing sample by combining the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample;
the variation self-encoder is trained and obtained at least based on the reduced RNA sequencing sample and the discrimination results of the noise information distribution and the biological information; wherein the discrimination of the noise information distribution and the biological information distribution includes: and distinguishing the biological information and the noise information in the noise information distribution, and distinguishing the noise information and the biological information in the biological information distribution.
In an alternative embodiment of the present application, the correction model includes: a first encoder and a first decoder;
correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the method comprises the following steps of:
Performing biological dimension coding processing on the RNA sequencing data by using the first coder to obtain biological information distribution of the RNA sequencing data;
and decoding and reducing the biological information distribution by using the first decoder to obtain the corrected RNA sequencing data.
In an alternative embodiment of the present application, the correction model is trained by:
carrying out noise dimension coding treatment on the RNA sequencing sample to obtain noise information distribution of the RNA sequencing sample;
coding the corrected RNA sequencing sample to obtain biological information distribution of the RNA sequencing sample;
performing joint decoding reduction treatment on the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample;
optimizing the variational self-encoder based on the difference between the RNA sequencing sample and the reduced RNA sequencing sample to train the correction model.
In an optional embodiment of the present application, the coding processing of the noise dimension is performed on the RNA sequencing sample, so as to obtain noise information distribution of the RNA sequencing sample; coding the corrected RNA sequencing sample to obtain biological information distribution of the RNA sequencing sample; performing joint decoding reduction processing on the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample, wherein the method comprises the following steps of:
Performing noise dimension coding processing on the RNA sequencing sample by using a second coder to obtain noise information distribution of the RNA sequencing sample;
encoding the corrected RNA sequencing sample by using a third encoder to obtain biological information distribution of the RNA sequencing sample;
and carrying out joint decoding reduction treatment on the noise information distribution and the biological information distribution by using a second decoder to obtain a reduced RNA sequencing sample.
In an alternative embodiment of the present application, the training of the correction model further includes:
respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first noise information discriminator to obtain a first discrimination result aiming at the noise information in the noise information distribution and a second discrimination result aiming at the noise information in the biological information distribution;
performing discrimination processing on the noise information distribution and the biological information distribution by using a first biological information discriminator to obtain a third discrimination result for biological information in the noise information distribution and a fourth discrimination result for biological information in the biological information distribution;
And optimizing a first encoder and the second encoder based on the first discrimination result, the second discrimination result, the third discrimination result, and the fourth discrimination result to train the correction model.
In an alternative embodiment of the present application, the training of the correction model further includes: optimizing the variational self-encoder based on the differences between the noise information distribution and the prior hypothesis, and the differences between the biological information distribution and the prior hypothesis, to train the correction model; wherein the prior assumption is a normal distribution.
In an alternative embodiment of the present application, the training of the correction model further includes: discriminating biological information of the RNA sequencing sample and the corrected RNA sequencing sample by using a second biological information discriminator to obtain a fifth discrimination result for the RNA sequencing sample and a sixth discrimination result for the corrected RNA sequencing sample;
utilizing a second noise information discriminator to discriminate noise information of the RNA sequencing sample and the corrected RNA sequencing sample, and obtaining a seventh discrimination result for the RNA sequencing sample and an eighth discrimination result for the corrected RNA sequencing sample;
Optimizing the variation self-encoder based on the fifth, sixth, seventh, and eighth discrimination results to train the correction model.
In an alternative embodiment of the present application, the training of the correction model further includes: acquiring an RNA sequencing data set, wherein the RNA sequencing data set comprises a plurality of corrected RNA sequencing data;
and classifying the corrected RNA sequencing data to obtain the RNA sequencing data of different categories.
The correction device for RNA sequencing data provided in this embodiment belongs to the same application concept as the correction method for RNA sequencing data provided in the above embodiment of the present application, and the correction method for RNA sequencing data provided in any of the above embodiments of the present application may be executed, which has a functional module and beneficial effects corresponding to the execution of the correction method for RNA sequencing data. The technical details of the method for correcting RNA sequencing data provided in the above embodiments of the present application are not described in detail in this embodiment, and are not described in detail herein.
Exemplary electronic device
An electronic device is further provided in another embodiment of the present application, please refer to fig. 5, fig. 5 is a schematic structural diagram of an electronic device provided in another embodiment of the present application.
As shown in fig. 5, the electronic device includes:
a memory 200 and a processor 210;
wherein the memory 200 is connected to the processor 210, and is used for storing a program;
the processor 210 is configured to implement the method for correcting RNA sequencing data disclosed in any of the above embodiments by executing the program stored in the memory 200.
Specifically, the electronic device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.
The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are interconnected by a bus. Wherein:
a bus may comprise a path that communicates information between components of a computer system.
Processor 210 may be a general-purpose processor such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., or may be an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with aspects of the present invention. But may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Processor 210 may include a main processor, and may also include a baseband chip, modem, and the like.
The memory 200 stores programs for implementing the technical scheme of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer-operating instructions. More specifically, the memory 200 may include read-only memory (ROM), other types of static storage devices that may store static information and instructions, random access memory (random access memory, RAM), other types of dynamic storage devices that may store information and instructions, disk storage, flash, and the like.
The input device 230 may include means for receiving data and information entered by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.
Output device 240 may include means, such as a display screen, printer, speakers, etc., that allow information to be output to a user.
The communication interface 220 may include devices using any transceiver or the like for communicating with other devices or communication networks, such as ethernet, radio Access Network (RAN), wireless Local Area Network (WLAN), etc.
Processor 210 executes programs stored in memory 200 and invokes other devices that may be used to implement the various steps of any of the methods for correcting RNA sequencing data provided in the above-described embodiments of the present application.
Exemplary computer program product and storage Medium
In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in a method of correcting RNA sequencing data according to various embodiments of the present application described in the "exemplary methods" section of the present specification.
The computer program product may write program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a storage medium having stored thereon a computer program that is executed by a processor to perform the steps in the method for correcting RNA sequencing data according to various embodiments of the present application described in the above "exemplary method" section of the present specification, and specifically may implement the following steps:
s101, correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting the RNA sequencing data is used for eliminating noise data in the RNA sequencing data;
the correction model is obtained based on training of a variation self-encoder; the variation self-encoder comprises the correction model and is used for extracting noise information distribution in an RNA sequencing sample and biological information distribution in the corrected RNA sequencing sample, and decoding and reducing the RNA sequencing sample by combining the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample;
the variation self-encoder is trained and obtained at least based on the reduced RNA sequencing sample and the discrimination results of the noise information distribution and the biological information; wherein the discrimination of the noise information distribution and the biological information distribution includes: and judging the biological information in the noise information distribution and judging the noise information in the biological information distribution.
For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts described, as some acts may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.
The steps in the method of each embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs, and the technical features described in each embodiment can be replaced or combined.
The modules and sub-modules in the device and the terminal of the embodiments of the present application may be combined, divided, and deleted according to actual needs.
In the embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of modules or sub-modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The modules or sub-modules illustrated as separate components may or may not be physically separate, and components that are modules or sub-modules may or may not be physical modules or sub-modules, i.e., may be located in one place, or may be distributed over multiple network modules or sub-modules. Some or all of the modules or sub-modules may be selected according to actual needs to achieve the purpose of the embodiment.
In addition, each functional module or sub-module in each embodiment of the present application may be integrated in one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated in one module. The integrated modules or sub-modules may be implemented in hardware or in software functional modules or sub-modules.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A method of calibrating RNA sequencing data, comprising:
correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is performed on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data, the correction model comprises a first encoder and a first decoder, the first encoder is used for performing biological dimension encoding processing on the RNA sequencing data to obtain biological information distribution of the RNA sequencing data, and the first decoder is used for performing decoding reduction processing on the biological information distribution to obtain corrected RNA sequencing data;
the correction model is obtained based on training of a variation self-encoder; the variable self-encoder comprises the correction model, a second encoder, a third encoder and a second decoder, wherein the second encoder of the variable self-encoder is used for extracting noise information distribution in an RNA sequencing sample, the third encoder is used for extracting biological information distribution in the corrected RNA sequencing sample, and the second decoder is used for combining the noise information distribution and the biological information distribution in the corrected RNA sequencing sample, and performing decoding reduction treatment on the RNA sequencing sample to obtain a reduced RNA sequencing sample;
Training of the variation self-encoder comprises at least: training the variational self-encoder based on the reduced RNA sequencing sample and the RNA sequencing sample, and optimizing the first encoder and the second encoder for the noise information distribution and discrimination results based on biological information distribution in the RNA sequencing sample obtained by the first encoder; wherein the discrimination of the noise information distribution and the biological information distribution in the RNA sequencing sample obtained based on the first encoder comprises: respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first noise information discriminator to obtain a first discrimination result aiming at the noise information in the noise information distribution and a second discrimination result aiming at the noise information in the biological information distribution; and respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first biological information discriminator to obtain a third discrimination result aiming at biological information in the noise information distribution and a fourth discrimination result aiming at biological information in the biological information distribution.
2. The method of claim 1, wherein the correction model is trained by:
carrying out noise dimension coding treatment on the RNA sequencing sample to obtain noise information distribution of the RNA sequencing sample;
coding the corrected RNA sequencing sample to obtain biological information distribution of the RNA sequencing sample;
performing joint decoding reduction treatment on the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample;
optimizing the variational self-encoder based on the difference between the RNA sequencing sample and the reduced RNA sequencing sample to train the correction model.
3. The method as recited in claim 2, further comprising:
and optimizing a first encoder and the second encoder based on the first discrimination result, the second discrimination result, the third discrimination result, and the fourth discrimination result to train the correction model.
4. The method as recited in claim 2, further comprising:
optimizing the variational self-encoder based on the differences between the noise information distribution and the prior hypothesis, and the differences between the biological information distribution and the prior hypothesis, to train the correction model; wherein the prior assumption is a normal distribution.
5. The method as recited in claim 2, further comprising:
discriminating biological information of the RNA sequencing sample and the corrected RNA sequencing sample by using a second biological information discriminator to obtain a fifth discrimination result for the RNA sequencing sample and a sixth discrimination result for the corrected RNA sequencing sample;
utilizing a second noise information discriminator to discriminate noise information of the RNA sequencing sample and the corrected RNA sequencing sample, and obtaining a seventh discrimination result for the RNA sequencing sample and an eighth discrimination result for the corrected RNA sequencing sample;
optimizing the variation self-encoder based on the fifth, sixth, seventh, and eighth discrimination results to train the correction model.
6. The method as recited in claim 1, further comprising:
acquiring an RNA sequencing data set, wherein the RNA sequencing data set comprises a plurality of corrected RNA sequencing data;
and classifying the corrected RNA sequencing data to obtain the RNA sequencing data of different categories.
7. A device for calibrating RNA sequencing data, comprising:
the first unit is used for correcting the RNA sequencing data by utilizing a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is carried out on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data, the correction model comprises a first encoder and a first decoder, the first encoder is used for carrying out biological dimension encoding processing on the RNA sequencing data to obtain biological information distribution of the RNA sequencing data, and the first decoder is used for carrying out decoding reduction processing on the biological information distribution to obtain corrected RNA sequencing data;
the correction model is obtained based on training of a variation self-encoder; the variable self-encoder comprises the correction model, a second encoder, a third encoder and a second decoder, wherein the second encoder of the variable self-encoder is used for extracting noise information distribution in an RNA sequencing sample, the third encoder is used for extracting biological information distribution in the corrected RNA sequencing sample, and the second decoder is used for combining the noise information distribution and the biological information distribution in the corrected RNA sequencing sample, and performing decoding reduction treatment on the RNA sequencing sample to obtain a reduced RNA sequencing sample;
Training of the variation self-encoder comprises at least: training the variation self-encoder based on the reduced RNA sequencing sample and the RNA sequencing sample, and optimizing the first encoder and the second encoder for the noise information distribution and the discrimination result of the biological information distribution in the RNA sequencing sample obtained based on the first encoder; wherein the discrimination of the noise information distribution and the biological information distribution in the RNA sequencing sample obtained based on the first encoder comprises: respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first noise information discriminator to obtain a first discrimination result aiming at the noise information in the noise information distribution and a second discrimination result aiming at the noise information in the biological information distribution; and respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first biological information discriminator to obtain a third discrimination result aiming at biological information in the noise information distribution and a fourth discrimination result aiming at biological information in the biological information distribution.
8. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to execute the method for correcting RNA sequencing data according to any one of claims 1 to 6 by executing the instructions in the memory.
9. A computer storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, performs the method of correcting RNA sequencing data according to any of the preceding claims 1-6.
CN202311388051.1A 2023-10-25 2023-10-25 Correction method and device for RNA sequencing data, electronic equipment and storage medium Active CN117116350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311388051.1A CN117116350B (en) 2023-10-25 2023-10-25 Correction method and device for RNA sequencing data, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311388051.1A CN117116350B (en) 2023-10-25 2023-10-25 Correction method and device for RNA sequencing data, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117116350A CN117116350A (en) 2023-11-24
CN117116350B true CN117116350B (en) 2024-02-27

Family

ID=88798856

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311388051.1A Active CN117116350B (en) 2023-10-25 2023-10-25 Correction method and device for RNA sequencing data, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117116350B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110106178A (en) * 2019-05-17 2019-08-09 淮海工学院 The DNAzyme and screening technique and purposes for identifying and cutting for Vibrio anguillarum
CN110929733A (en) * 2019-12-09 2020-03-27 上海眼控科技股份有限公司 Denoising method and device, computer equipment, storage medium and model training method
CN111523668A (en) * 2020-05-06 2020-08-11 支付宝(杭州)信息技术有限公司 Training method and device of data generation system based on differential privacy
CN111808854A (en) * 2020-07-09 2020-10-23 中国农业科学院农业基因组研究所 Balanced joint with molecular bar code and method for quickly constructing transcriptome library
CN111966998A (en) * 2020-07-23 2020-11-20 华南理工大学 Password generation method, system, medium, and apparatus based on variational automatic encoder
CN112397149A (en) * 2020-11-11 2021-02-23 天津现代创新中药科技有限公司 Transcriptome analysis method and system without reference genome sequence
CN113642716A (en) * 2021-08-31 2021-11-12 南方电网数字电网研究院有限公司 Depth variation autoencoder model training method, device, equipment and storage medium
CN114067915A (en) * 2021-11-22 2022-02-18 湖南大学 scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder
WO2022141714A1 (en) * 2020-12-30 2022-07-07 科大讯飞股份有限公司 Information synthesis method and apparatus, electronic device, and computer readable storage medium
CN115052984A (en) * 2019-06-21 2022-09-13 奥基生物技术公司 Methods and systems for identifying target genes
WO2023025956A1 (en) * 2021-08-27 2023-03-02 NEC Laboratories Europe GmbH Method and system for deconvolution of bulk rna-sequencing data
WO2023025419A1 (en) * 2021-08-27 2023-03-02 NEC Laboratories Europe GmbH Method and system for deconvolution of bulk rna-sequencing data
WO2023123941A1 (en) * 2021-12-31 2023-07-06 深圳前海微众银行股份有限公司 Data anomaly detection method and apparatus
CN116580769A (en) * 2023-05-31 2023-08-11 平安科技(深圳)有限公司 Single-cell RNA sequencing data processing method, device, equipment and storage medium
CN116597274A (en) * 2023-05-02 2023-08-15 西北工业大学 Unsupervised bidirectional variation self-coding essential image decomposition network, method and application
CN116825186A (en) * 2023-06-19 2023-09-29 西北工业大学 Single cell data batch effect correction method based on generation of countermeasure network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021059348A1 (en) * 2019-09-24 2021-04-01 富士通株式会社 Learning method, learning program, and learning device

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110106178A (en) * 2019-05-17 2019-08-09 淮海工学院 The DNAzyme and screening technique and purposes for identifying and cutting for Vibrio anguillarum
CN115052984A (en) * 2019-06-21 2022-09-13 奥基生物技术公司 Methods and systems for identifying target genes
CN110929733A (en) * 2019-12-09 2020-03-27 上海眼控科技股份有限公司 Denoising method and device, computer equipment, storage medium and model training method
CN111523668A (en) * 2020-05-06 2020-08-11 支付宝(杭州)信息技术有限公司 Training method and device of data generation system based on differential privacy
CN111808854A (en) * 2020-07-09 2020-10-23 中国农业科学院农业基因组研究所 Balanced joint with molecular bar code and method for quickly constructing transcriptome library
CN111966998A (en) * 2020-07-23 2020-11-20 华南理工大学 Password generation method, system, medium, and apparatus based on variational automatic encoder
CN112397149A (en) * 2020-11-11 2021-02-23 天津现代创新中药科技有限公司 Transcriptome analysis method and system without reference genome sequence
WO2022141714A1 (en) * 2020-12-30 2022-07-07 科大讯飞股份有限公司 Information synthesis method and apparatus, electronic device, and computer readable storage medium
WO2023025956A1 (en) * 2021-08-27 2023-03-02 NEC Laboratories Europe GmbH Method and system for deconvolution of bulk rna-sequencing data
WO2023025419A1 (en) * 2021-08-27 2023-03-02 NEC Laboratories Europe GmbH Method and system for deconvolution of bulk rna-sequencing data
CN113642716A (en) * 2021-08-31 2021-11-12 南方电网数字电网研究院有限公司 Depth variation autoencoder model training method, device, equipment and storage medium
CN114067915A (en) * 2021-11-22 2022-02-18 湖南大学 scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder
WO2023123941A1 (en) * 2021-12-31 2023-07-06 深圳前海微众银行股份有限公司 Data anomaly detection method and apparatus
CN116597274A (en) * 2023-05-02 2023-08-15 西北工业大学 Unsupervised bidirectional variation self-coding essential image decomposition network, method and application
CN116580769A (en) * 2023-05-31 2023-08-11 平安科技(深圳)有限公司 Single-cell RNA sequencing data processing method, device, equipment and storage medium
CN116825186A (en) * 2023-06-19 2023-09-29 西北工业大学 Single cell data batch effect correction method based on generation of countermeasure network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Simultaneous dimensionality reduction and integration for single-cell ATAC-seq data using deep learning;Wolfgang Kopp ET AL;《NATURE MACHINE INTELLIGENCE》;第4卷;第162-168页 *
Sparsely-connected autoencoder (SCA) for single cell RNAseq data mining;Luca Alessandri et al;《npj Systems Biology and Applications》;第1-10页 *
VEGA is an interpretable generative model for inferring biological network activity in single-cell transcriptomics;Lucas Seninge ET AL;《NATURE COMMUNICATIONS》;第1-9页 *
基于 RNA-seq技术对不同品种猪背最长肌差异表达基因的筛选与注释;钱坤 等;《西北农林科技大学学报(自然科学版)》;第44卷(第6期);第1-8页 *
基于变分自编码器的空间转录组细胞聚类研究;刘腾 等;《生物信息学》;第1-10页 *

Also Published As

Publication number Publication date
CN117116350A (en) 2023-11-24

Similar Documents

Publication Publication Date Title
Wick et al. Performance of neural network basecalling tools for Oxford Nanopore sequencing
Hie et al. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama
Ono et al. PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores
CN112435714B (en) Tumor immune subtype classification method and system
Kong et al. A review of independent component analysis application to microarray gene expression data
CN112256828B (en) Medical entity relation extraction method, device, computer equipment and readable storage medium
Patruno et al. A review of computational strategies for denoising and imputation of single-cell transcriptomic data
Meng et al. scAAGA: Single cell data analysis framework using asymmetric autoencoder with gene attention
US20130198118A1 (en) Annotation of a biological sequence
US11514289B1 (en) Generating machine learning models using genetic data
Glusman et al. Optimal scaling of digital transcriptomes
CN110021344B (en) Method and system for identifying and classifying operational taxa in metagenomic samples
De Marino et al. A comparative analysis of current phasing and imputation software
CN113517022A (en) Gene detection method, feature extraction method, device, equipment and system
Majidian et al. Hap10: reconstructing accurate and long polyploid haplotypes using linked reads
Peric et al. Design and analysis of binary scalar quantizer of Laplacian source with applications
CN114550831A (en) Gastric cancer proteomics typing framework identification method based on deep learning feature extraction
CN117116350B (en) Correction method and device for RNA sequencing data, electronic equipment and storage medium
CN113903420A (en) Semantic label determination model construction method and medical record analysis method
CN117219167B (en) Attribution method and device for differences among samples, electronic equipment and storage medium
Kabanov et al. Weighted single-step genomic best linear unbiased prediction method application for assessing pigs on meat productivity and reproduction traits
Kang et al. Genetic risk assessment of nonsyndromic cleft lip with or without cleft palate by linking genetic networks and deep learning models
CN112507107A (en) Term matching method, device, terminal and computer-readable storage medium
Mesa-Rodríguez et al. Cancer Segmentation by Entropic Analysis of Ordered Gene Expression Profiles
CN115579058B (en) Lossless compression method of genome data, prediction method and device of genetic variation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant