CN117116350B

CN117116350B - Correction method and device for RNA sequencing data, electronic equipment and storage medium

Info

Publication number: CN117116350B
Application number: CN202311388051.1A
Authority: CN
Inventors: 钱坤; 李若男; 刘万飞; 林强; 崔鹏
Original assignee: Shenzhen Institute Of Agricultural Genome Chinese Academy Of Agricultural Sciences Shenzhen Branch Of Guangdong Provincial Laboratory Of Lingnan Modern Agricultural Science And Technology
Current assignee: Shenzhen Institute Of Agricultural Genome Chinese Academy Of Agricultural Sciences Shenzhen Branch Of Guangdong Provincial Laboratory Of Lingnan Modern Agricultural Science And Technology
Priority date: 2023-10-25
Filing date: 2023-10-25
Publication date: 2024-02-27
Anticipated expiration: 2043-10-25
Also published as: CN117116350A

Abstract

The application provides a correction method and device of RNA sequencing data, electronic equipment and storage medium, and the correction method of the RNA sequencing data comprises the following steps: correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data; the correction model is obtained based on training of the variation self-encoder; the variable self-encoder comprises a correction model, and is used for extracting noise information distribution in the RNA sequencing sample and biological information distribution in the corrected RNA sequencing sample, and carrying out decoding reduction treatment on the RNA sequencing sample by combining the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample; the variation self-encoder is trained based on at least the reduced RNA sequencing sample and the RNA sequencing sample, and discrimination results for noise information distribution and biological information.

Description

Correction method and device for RNA sequencing data, electronic equipment and storage medium

Technical Field

The application relates to the technical field of RNA sequencing, in particular to a correction method and device of RNA sequencing data, electronic equipment and a storage medium.

Background

Transcriptome sequencing (rna_seq) technology, which is based on second generation high-throughput DNA sequencing technology, provides single base level full transcript information, has been developed to now as an indispensable tool in the field of molecular biology. Nowadays, RNA-seq technology is widely used in a variety of fields such as gene expression quantification, transcription initiation site recognition, non-coding RNA functional identification, single cell analysis, and the like.

The accumulation of high throughput sequencing data makes integrated analysis of large numbers of common transcriptome sequencing data from which biological laws are found to be more viable, but how to correct noise data in large-scale data sets generated by batch effects becomes a primary issue.

Disclosure of Invention

In order to solve the technical problems, the application provides a method and a device for correcting RNA sequencing data, electronic equipment and a storage medium.

According to a first aspect of embodiments of the present application, there is provided a method of correcting RNA sequencing data, comprising:

correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is performed on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data;

The correction model is obtained based on training of a variation self-encoder; the variation self-encoder comprises the correction model and is used for extracting noise information distribution in an RNA sequencing sample and biological information distribution in the corrected RNA sequencing sample, and decoding and reducing the RNA sequencing sample by combining the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample;

the variation self-encoder is trained and obtained at least based on the reduced RNA sequencing sample and the discrimination results of the noise information distribution and the biological information; wherein the discrimination of the noise information distribution and the biological information distribution includes: and distinguishing the noise information and the biological information in the noise information distribution and distinguishing the noise information and the biological information in the biological information distribution.

In an alternative embodiment of the present application, the correction model includes: a first encoder and a first decoder;

correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the method comprises the following steps of:

Performing biological dimension coding processing on the RNA sequencing data by using the first coder to obtain biological information distribution of the RNA sequencing data;

and decoding and reducing the biological information distribution by using the first decoder to obtain the corrected RNA sequencing data.

In an alternative embodiment of the present application, the correction model is trained by:

carrying out noise dimension coding treatment on the RNA sequencing sample to obtain noise information distribution of the RNA sequencing sample;

coding the corrected RNA sequencing sample to obtain biological information distribution of the RNA sequencing sample;

performing joint decoding reduction treatment on the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample;

optimizing the variational self-encoder based on the difference between the RNA sequencing sample and the reduced RNA sequencing sample to train the correction model.

In an optional embodiment of the present application, the coding processing of the noise dimension is performed on the RNA sequencing sample, so as to obtain noise information distribution of the RNA sequencing sample; coding the corrected RNA sequencing sample to obtain biological information distribution of the RNA sequencing sample; performing joint decoding reduction processing on the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample, wherein the method comprises the following steps of:

Performing noise dimension coding processing on the RNA sequencing sample by using a second coder to obtain noise information distribution of the RNA sequencing sample;

encoding the corrected RNA sequencing sample by using a third encoder to obtain biological information distribution of the RNA sequencing sample;

and carrying out joint decoding reduction treatment on the noise information distribution and the biological information distribution by using a second decoder to obtain a reduced RNA sequencing sample.

In an alternative embodiment of the present application, further comprising:

respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first noise information discriminator to obtain a first discrimination result aiming at the noise information in the noise information distribution and a second discrimination result aiming at the noise information in the biological information distribution;

performing discrimination processing on the noise information distribution and the biological information distribution by using a first biological information discriminator to obtain a third discrimination result for biological information in the noise information distribution and a fourth discrimination result for biological information in the biological information distribution;

And optimizing a first encoder and the second encoder based on the first discrimination result, the second discrimination result, the third discrimination result, and the fourth discrimination result to train the correction model.

In an alternative embodiment of the present application, further comprising:

optimizing the variational self-encoder based on the differences between the noise information distribution and the prior hypothesis, and the differences between the biological information distribution and the prior hypothesis, to train the correction model; wherein the prior assumption is a normal distribution.

In an alternative embodiment of the present application, further comprising:

discriminating biological information of the RNA sequencing sample and the corrected RNA sequencing sample by using a second biological information discriminator to obtain a fifth discrimination result for the RNA sequencing sample and a sixth discrimination result for the corrected RNA sequencing sample;

utilizing a second noise information discriminator to discriminate noise information of the RNA sequencing sample and the corrected RNA sequencing sample, and obtaining a seventh discrimination result for the RNA sequencing sample and an eighth discrimination result for the corrected RNA sequencing sample;

Optimizing the variation self-encoder based on the fifth, sixth, seventh, and eighth discrimination results to train the correction model.

In an alternative embodiment of the present application, further comprising:

acquiring an RNA sequencing data set, wherein the RNA sequencing data set comprises a plurality of corrected RNA sequencing data;

and classifying the corrected RNA sequencing data to obtain the RNA sequencing data of different categories.

According to a second aspect of embodiments of the present application, there is provided a correction device for RNA sequencing data, comprising:

the first unit is used for correcting the RNA sequencing data by utilizing a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is carried out on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data;

According to a third aspect of embodiments of the present application, there is provided an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for executing the correction method of the RNA sequencing data by running the instructions in the memory.

According to a fourth aspect of embodiments of the present application, there is provided a computer storage medium storing a computer program which, when executed by a processor, performs the method of correcting RNA sequencing data described above.

The application provides a correction method and device of RNA sequencing data, electronic equipment and storage medium, wherein the correction method of the RNA sequencing data comprises the following steps: correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is performed on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data; the correction model is obtained based on training of a variation self-encoder; the variation self-encoder comprises the correction model and is used for extracting noise information distribution in an RNA sequencing sample and biological information distribution in the corrected RNA sequencing sample, and decoding and reducing the RNA sequencing sample by combining the noise information distribution and the biological information distribution to obtain a reduced RNA sequencing sample; the variation self-encoder is trained and obtained at least based on the reduced RNA sequencing sample and the discrimination results of the noise information distribution and the biological information; wherein the discrimination of the noise information distribution and the biological information distribution includes: and distinguishing the noise information and the biological information in the noise information distribution and distinguishing the noise information and the biological information in the biological information distribution.

According to the method, the correction of the RNA sequencing data is achieved by utilizing the correction model, meanwhile, the training of the correction model is based on the training of a variation self-encoder comprising the correction model, and the judgment result of noise information distribution and biological information distribution in the RNA sequencing data is obtained, so that the correction precision of the correction model on the RNA sequencing data is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of a method for calibrating RNA sequencing data according to an embodiment of the present application.

Fig. 2 is a flowchart of preprocessing RNA sequencing data according to another embodiment of the present application.

Fig. 3 is a schematic structural diagram of a variable self-encoder according to another embodiment of the present application.

Fig. 4 is a schematic structural diagram of a calibration device for RNA sequencing data according to another embodiment of the present application.

Fig. 5 is a schematic structural diagram of an electronic device according to another embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In order to solve the above technical problems, the present application provides a method, an apparatus, an electronic device and a storage medium for correcting RNA sequencing data, which are described in detail in the following examples.

Exemplary method

The embodiment of the application firstly provides a correction method of RNA sequencing data, which aims to correct noise data (also called batch data) in the RNA sequencing data to obtain noise-free RNA sequencing data under the condition that biological information in the RNA sequencing data is reserved.

Among them, RNA sequencing data is understood to be RNA-seq data, which is transcriptome data obtained by RNA sequencing technology, which contains all RNA information transcribed in cells or tissues, and can be used to study gene expression levels, splice variation, transcript diversity, transcriptome start sites, transcription factor binding sites, etc. In embodiments of the present application, RNA-seq data can be generated by high throughput sequencing techniques.

In the field of biology, biological information in the RNA sequencing data mainly includes the following aspects:

1. the tissue source, i.e., the source used to represent the RNA sequencing sample, may be from different tissues, such as liver, heart, lung, etc., each having specific characteristics and expression profiles at the gene expression level.

2. Physiological status, i.e. the health status used to represent the RNA sequencing sample, may be from a healthy individual or an individual suffering from a disease, in which the transcriptome may be altered, e.g. the expression of immune related genes may be up-regulated under inflammatory conditions.

3. The developmental status, i.e., the developmental stage at which the individual from which the RNA sequencing sample is derived, may be from individuals at different developmental stages, such as embryos, young children, adults, etc., and the transcriptome will change as the developmental stage changes.

4, treatment or stimulation, i.e., to indicate that the RNA sequencing sample is subjected to an external treatment or stimulation, the RNA sequencing sample may be subjected to an external treatment or stimulation, such as: drug treatment, viral infection, etc., which causes changes in the transcriptome.

Referring to fig. 1, fig. 1 is a flowchart of a method for calibrating RNA sequencing data according to an embodiment of the present application.

As shown in fig. 1, the method for correcting RNA sequencing data comprises:

and step S101, correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is performed on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data.

In an embodiment of the application, the correction model is also a variable self-Encoder (VariationalAutoEncoder, VAE) essentially comprising two parts, a first Encoder (Encoder) for compressing a high-dimensional input into a low-dimensional hidden variable and a first Decoder (Decoder) for decoding the low-dimensional hidden variable and reconstructing the high-dimensional input.

In the embodiment of the application, the RNA test data includes not only biological information but also noise information, for example, when integrating RNA sequencing data in different laboratories, batch effects may occur in the RNA sequencing data in different laboratories due to different experimental instruments and experimental conditions.

Further, the first decoder is specifically configured to perform a coding process of a biological dimension on the RNA sequencing data, so as to obtain a biological distribution of the RNA sequencing data; the first decoder is used for decoding and restoring the biological information distribution to obtain the corrected RNA sequencing data.

In an alternative embodiment of the present application, the RNA sequencing data is further pre-processed before being subjected to the correction process using the pre-constructed correction model, in order to make the RNA sequencing data available for input to the convolutional neural network.

Specifically, the process of preprocessing the RNA sequencing data includes:

firstly, expressing RNA sequencing data through a TPM (Transcripts Per Million) expression matrix, then, carrying out standardization treatment on TPM expression (in an alternative embodiment of the application, the library size of the TPM after the standardization treatment is 1000000), carrying out logarithmic conversion with the offset of 1 and the base of 2 on the TPM, and carrying out maximum standardization on the TMP value after the logarithmic conversion at the gene level so as to limit the TMP value to be between 0 and 1.

Because the model framework of the present application is based on a graph Convolutional Neural Network (CNN), two-dimensional image transformation can be performed on a standardized TPM (one-dimensional vector) so as to improve training efficiency of a subsequent model, and specifically, please refer to fig. 2, fig. 2 is a flowchart of preprocessing RNA sequencing data provided in another embodiment of the present application.

As shown in fig. 2, the two-dimensional image conversion of the standardized TPM includes: (1) Performing nonlinear dimension reduction on all genes in the TPM by using a t-SNE algorithm, and mapping each gene to a two-dimensional coordinate axis; (2) Searching and packing the minimum enclosure containing all genes in a coordinate system by utilizing a convex hull algorithm; (3) The genes are transformed from a Cartesian coordinate system to pixels in a two-dimensional matrix using rotation and feature localization. That is, in the present embodiment, the RNA sequencing sample is input to the calibration model in the form of a picture.

Further, in order to facilitate understanding of the training process of the correction model mentioned in the embodiments of the present application, the training process of the correction model is described in detail below with reference to fig. 3.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a variable self-encoder according to another embodiment of the present application.

Wherein the correction model is included in the variation self-encoder, and in the embodiment of the application, training of the correction model is realized based on training of the variation self-encoder.

As shown in fig. 3, the variable self-encoder includes: a first encoder 101, a first decoder 102, a second encoder 103, a third encoder 104, and a second decoder 105.

Wherein the first encoder 101 and the first decoder 102 construct the correction model;

the second encoder 103 is configured to perform a noise dimension encoding process on the RNA sequencing sample x to obtain a noise information distribution z of the RNA sequencing sample _b 。

The first encoder 101 is used for performing a biological dimension encoding process on the RNA sequencing sample x to obtain a biological information distribution z of the RNA sequencing sample _c The method comprises the steps of carrying out a first treatment on the surface of the The first decoder 102 is used for distributing z the biological information _c Decoding to obtain corrected RNA sequencing sample x _bf (i.e., noise-free RNA sequencing samples).

Specifically, in the present embodiment, q (z _b The x) model maps the RNA sequencing sample x directly to the hidden layer, representing the noise information distribution z _b The method comprises the steps of carrying out a first treatment on the surface of the Constructing q (z by using the first encoder 101 _c The |x) model maps the RNA sequencing sample x directly to the hidden layer, representing the biological information distribution z _c 。

In agreement with the original variational self-encoder, the first encoder 101 and the second encoder 103 do not directly embed the input x as the hidden layer representation z, but assume that p (z) complies with the standard gaussian distribution prior, and first dimension-reduce to obtain the mean value of the hidden layerSum of variances->The hidden layer z is obtained by sampling.

Wherein the noise information is distributedBiological information distribution->,，/>Wherein->，/>J is the dimension of the hidden layer, +.>。

In the present embodiment, the first decoder 102 passes throughRepresentation for distributing biological information z _c Restoring to original dimension to obtain RNA sequencing sample x without influence of noise information _bf . Further, a third encoder 104 is used for sequencing the RNA sample x after the correction processing _bf Performing coding treatment to obtain biological information distribution z of the RNA sequencing sample _c 。

Finally, a second decoder 105 is arranged for distributing z the noise information _b And the biological information distribution z _c And carrying out joint decoding reduction treatment to obtain a reduced RNA sequencing sample x'.

Specifically, the second decoder 105 passes throughDividing biological information distribution and noise information The cloths are combined to construct +.>The method is characterized in that the method conforms to a multi-element Gaussian distribution, the input RNA sequencing sample x is reconstructed, biological information and noise information in the RNA sequencing sample x are split by using a first encoder 101 and a third encoder 104, and then combined and reduced by using a second decoder 105, so that a framework of a variable self-encoder is constructed under the condition of no noise information.

As is apparent from the above description, the second encoder 103, the correction model composed of the first encoder 101 and the first decoder 102, the third encoder 104, and the second decoder 105 together constitute the framework of the variable self-encoder,

in an alternative embodiment of the present application, the variation self-encoder may be optimized to train the correction model based on the difference between the RNA sequencing sample and the reduced RNA sequencing sample.

In this embodiment, for the second decoder 105 in the variation self-encoder, the main purpose of the second decoder is to decode the biological information distribution and the noise information distribution jointly to obtain the reduced RNA sequencing data, and assuming that the RNA sequencing data is x and the reduced RNA sequencing data is x ', the variation self-encoder may be trained by constructing a log-likelihood manner, where x is input and x' is output as similar as possible.

Specifically, the log likelihood thereof can be represented by the following formula (1):

（1）；

wherein,independently of each other(s)>Representing noise information distribution, ++>Representing the distribution of biological information,/->Representation->And->Is a joint a priori distribution of->Indicate a given +.>And->The likelihood function below.

Since the integral is difficult to calculate, the posterior probability is then calculatedIs also computationally difficult, and therefore, it is possible to construct the coding network +.>And->To approximate->Therefore, the log likelihood can also be expressed by the following formula (2):

（2）；

wherein,loss of reconstruction from the encoder for said variation,>indicate a given +.>And->The likelihood function below.

KL divergence between the approximate posterior distribution and the prior distribution of the self-encoder for the variation,/for the variation>Representing the encoded network generated posterior distribution +.>Approximation of->Representation->Is a priori distributed->Representing the encoded network generated posterior distribution +.>Approximation of->Representation->Is a priori distributed->Representation->And->Is a joint posterior distribution of (c).

To vary the KL divergence between the approximate posterior distribution and the true posterior distribution of the self-encoder,/for>Representing the encoded network generated posterior distribution +.>Is a approximation of (a).

However, since the divergence between the approximate posterior distribution and the true posterior distribution is difficult to calculate, and KL divergence is a non-negative value, the following loss function (3) can be constructed by maximizing the variable lower bound of the data approximation:

（3）；

The first term to the right of equation (3) above is the reconstruction loss, which aims to minimize the difference between the reduced RNA sequencing sample x' and the RNA sequencing sample x, and thus can be calculated using the following mean square error equation (4):

（4）；

wherein x represents the RNA sequencing sample and x' identifies the RNA sequencing sample after reduction.

The second term and the third term to the right of equation (3) above are regular terms of KL-divergence, the purpose of which is to approximate the posterior distribution of the hidden layers of the other first encoder 101 and the second encoder 103And a priori assumption ++>In proximity, it can be represented by the following equation (5) and equation (6)：

（5）；

（6）；

Wherein,variance representing noise information distribution, ++>Representing the mean value of the noise information distribution; />Representing the variance of the distribution of biological information, +.>Representing the mean of the biological information distribution.

In addition, the embodiment of the present application adds a similarity regularization loss to the output layer of the first decoder 102 to limit the corrected RNA sequencing data, which is closer to the RNA sequencing data, and may be L in practical applications ₂ The calculation is made regularly, and the similarity loss can be expressed by the following formula (7):

（7）；

wherein,representing the RNA sequencing sample, +. >Representing the RNA sequencing sample after the calibration treatment.

Further, to enable the first encoder 101 to accurately identify noise information in the RNA sequencing sample, and to enable the first encoder toThe two encoders 103 can accurately identify biological information in the RNA sequencing sample, and a first biological information discriminator 106 is further arranged for the first encoder 101 and the second encoder 103) And a first noise information discriminator 107 (++>）。

Wherein, the first noise information discriminator 107 is configured to perform discrimination processing on the noise information distribution and the biological information distribution, respectively, to obtain a first discrimination result for noise information in the noise information distribution, and a second discrimination result for noise information in the biological information distribution.

A first biological information discriminator 106 for discriminating the noise information distribution and the biological information distribution, respectively, to obtain a third discrimination result for the biological information in the noise information distribution, and a fourth discrimination result for the biological information in the biological information distribution.

Specifically, the first noise information discriminator 107 is configured to distribute the biological information And noise information distribution->The second encoder 103 wants to predict the noise information from the other hidden layer, and the first encoder 101 wants to cheat the first noise information discriminator 107, so that the hidden layer of the first encoder is hard to predict the noise information; the corresponding first biological information discriminator is then used for distributing +_from the biological information>And noise information distribution->In the first encoder is difficult to predict biological information, and in particular, training of the first noise information discriminator 107 and the first biological information discriminator 106 may be understood as maximizing log likelihood to evaluate conditional probabilitiesWherein->，/>，/>Representing the distribution of biological information->Or noise information distribution->In the prediction of the probability of biological information tag and noise information tag, b represents noise information tag, c represents biological information tag,/or->Representing the probability that a noise information tag or a biological information tag cannot be predicted from the hidden layer, further maximizing the log likelihood means maximizing the classification loss function as shown in the following formula (8):

（8）；

wherein,representing a probability of predicting noise information tags from the noise information distribution; / >Representing the prediction of noise information tags from biological information distributionProbability; />Representing a probability of predicting a biological information tag from the noise information distribution; />Representing the probability of predicting a biological information tag from a biological information distribution.

Further, the training of the first encoder 101 and the second encoder 103 is to minimize the following two classification loss functions (9), (10):

（9）；

（10）；

wherein,representing a probability of predicting noise information tags from the noise information distribution; />Representing a probability that a biological information tag cannot be predicted from the noise information distribution; />Representing a probability that a noise information tag cannot be predicted from the biological information distribution; />Indicating the probability of predicting a biological information tag from biological information.

In another alternative embodiment of the present application, the training of the correction model further comprises a second biological information discriminator 108 and a second noise information discriminator 109 based on the data as shown in fig. 3;

the second biological information discriminator is configured to discriminate biological information of the RNA sequencing sample and the corrected RNA sequencing sample, and obtain a fifth discrimination result for the RNA sequencing sample and a sixth discrimination result for the corrected RNA sequencing sample.

The second noise information discriminator is configured to discriminate noise information of the RNA sequencing sample and the corrected RNA sequencing sample, and obtain a seventh discrimination result for the RNA sequencing sample and an eighth discrimination result for the corrected RNA sequencing sample.

Further, training of the correction model by the second biological information discriminator 108 and the second noise information discriminator 109 includes: optimizing the variation self-encoder based on the fifth, sixth, seventh, and eighth discrimination results to train the correction model.

Similar to the first biological information discriminator 106 and the first noise information discriminator 107, training of the second biological information discriminator 108 and the second noise information discriminator 109 may be understood as maximizing log likelihood to evaluate conditional probabilitiesWherein->，/>，/>Representing the probability of predicting biological information tag and noise information tag from RNA sequencing sample and corrected RNA sequencing sample, b representing noise information tag, c representing biological information tag, < >>Representing a probability that a noise information tag or a biological information tag cannot be predicted from an RNA sequencing sample or a corrected RNA sequencing sample, further maximizing log likelihood Maximizing the classification loss function as shown in equation (11) below:

（11）；

wherein,representing a probability of predicting noise information from the corrected RNA sequencing sample; />Representing a probability of predicting biological information from the corrected RNA sequencing sample; />Representing a probability of predicting noise information from the RNA sequencing sample; />The probability of predicting biological information from an RNA sequencing sample is shown.

Further, training the correction model based on the second biological information discriminator 108 and the second noise information discriminator 109 is achieved by minimizing the following classification loss functions (12), (13):

（12）；

（13）；

wherein,representing a probability that noise information cannot be predicted from the corrected RNA sequencing sample;representing a probability of predicting biological information from the corrected RNA sequencing sample; />Representing a probability of predicting noise information from the RNA sequencing sample; />The probability of predicting biological information from an RNA sequencing sample is shown.

The loss function required for training of the correction model by the binding discriminator can be expressed by the following equation (14) in combination with the above equations (8) to (9):

（14）；

wherein,、/>、/>respectively representing the weight loss of each part, in the practical application process, different weights can be set according to the different training difficulties of each part, and in an alternative embodiment of the application, <' > the weight loss is different >=5，/>=15，/>=10。

In combination with the above formulas (1) to (14), the training of the correction model can be represented by the following formula (15):

（15）；

wherein,and->Respectively indicate->And->In an alternative embodiment of the present application,/weight>=0.5，/>=150。

In an optional embodiment of the present application, after training the correction model based on the above manner, a fully-connected network and a softmax output layer may be added after the first encoder in the correction model, and phenotype classification training may be performed to obtain a prediction model for predicting phenotype information such as a cancer type, a primary site, a disease type, and the like of a tumor sample.

Further, after correcting the RNA sequencing data based on the correction method for RNA sequencing data described above, the corrected RNA sequencing data may also be integrated and classified.

Specifically, the method further comprises the following steps:

Further, to ensure the correction quality of the correction model, the correction model may also be evaluated.

In an alternative embodiment of the present application, the evaluation of the correction model may be based on accumulated adjusted Lande coefficients (Adjusted Rand index, ARI), mutual information (Mutual Information, MI), profile coefficients (Silhouette Coefficient), and kBET based on a k-neighbor distribution as quantitative evaluation indicators.

Wherein the adapted rand coefficient is used to calculate the similarity between the two clusters, i.e. after clustering the RNA sequencing data in the RNA sequencing set based on the correction model.

Specifically, the adjusted rad coefficient can be expressed by the following formula (16):

（16）；

wherein,、/>、/>ARI represents the adjusted Rankine coefficient for values in the list in the range of [0,1 ]]When the value is 1, two clusters are identical, and when the clustering result is close to random division, the value is close to 0.

The mutual information preparation is used for measuring the similarity between the true label and the cluster label of the same data, and in the embodiment of the application, adjustment mutual information is adopted for calculation, and the adjustment mutual information can be represented by the following formula (17):

（17）；

wherein U, V are real tags and clustering tags after RNA sequencing data clustering, MI is mutual information, and H is shannon entropy.

The profile factor is used to evaluate the quality of clustering effect by measuring compactness within clusters and separation between clusters, specifically, for each RNA sequencing data, the profile factor can be represented by the following formula (18):

（18）；

wherein,mean cluster representing RNA test data i to other data of the same cluster, < >>Is the minimum average distance from the RNA sequencing data i to other clusters, the profile coefficient of the final cluster as a whole can be the average value of the profile coefficients of all the RNA sequencing samples, and the value range is [0,1]The larger the value thereof, the better the clustering effect.

kBET (k-mer Balanced Error Rate) is a quality indicator that measures batch effects in RNA sequencing data. In the examples herein, the data set defining a total number n of RNA sequencing samples is D,，/>for the global proportion of each batch, where i is the batch,/is->The number of samples was sequenced for the RNA of lot i, i being a positive integer. In addition, k nearest neighbor subsets of RNA sequencing samples were randomly selected +.>Definitions->For the proportion of the batch i within the local oneself, zero is assumed to be already well mixed for the data, i.e.>And->The distribution is the same. In subset->(j=1, 2 … m) using the pearson chi-square test and returning the P value, the average kBERT acceptance rate for all tests is finally obtained:

The average kBERT acceptance rate may be represented by the following formula (19):

（19）；/>

wherein I (x) is an indicator function, α is a significant level threshold, and the higher the acceptance rate, the better the batch correction effect.

In summary, according to the correction method for the RNA sequencing data, correction of the RNA sequencing data is achieved by using the correction model, meanwhile, training of the correction model is based on training of a variation self-encoder comprising the correction model, and discrimination results of noise information distribution and biological information distribution in the RNA sequencing data are obtained, so that correction accuracy of the correction model on the RNA sequencing data is improved.

Exemplary apparatus

Correspondingly, the embodiment of the application also provides a device for correcting the RNA sequencing data, please refer to fig. 4, and fig. 4 is a schematic structural diagram of a device for correcting the RNA sequencing data according to another embodiment of the application.

As shown in fig. 4, the correction device for RNA sequencing data comprises:

a first unit 401, configured to perform correction processing on RNA sequencing data by using a pre-constructed correction model, to obtain corrected RNA sequencing data, where the correction processing is performed on the RNA sequencing data, and is configured to eliminate noise data in the RNA sequencing data;

the variation self-encoder is trained and obtained at least based on the reduced RNA sequencing sample and the discrimination results of the noise information distribution and the biological information; wherein the discrimination of the noise information distribution and the biological information distribution includes: and distinguishing the biological information and the noise information in the noise information distribution, and distinguishing the noise information and the biological information in the biological information distribution.

In an alternative embodiment of the present application, the training of the correction model further includes:

In an alternative embodiment of the present application, the training of the correction model further includes: optimizing the variational self-encoder based on the differences between the noise information distribution and the prior hypothesis, and the differences between the biological information distribution and the prior hypothesis, to train the correction model; wherein the prior assumption is a normal distribution.

In an alternative embodiment of the present application, the training of the correction model further includes: discriminating biological information of the RNA sequencing sample and the corrected RNA sequencing sample by using a second biological information discriminator to obtain a fifth discrimination result for the RNA sequencing sample and a sixth discrimination result for the corrected RNA sequencing sample;

In an alternative embodiment of the present application, the training of the correction model further includes: acquiring an RNA sequencing data set, wherein the RNA sequencing data set comprises a plurality of corrected RNA sequencing data;

The correction device for RNA sequencing data provided in this embodiment belongs to the same application concept as the correction method for RNA sequencing data provided in the above embodiment of the present application, and the correction method for RNA sequencing data provided in any of the above embodiments of the present application may be executed, which has a functional module and beneficial effects corresponding to the execution of the correction method for RNA sequencing data. The technical details of the method for correcting RNA sequencing data provided in the above embodiments of the present application are not described in detail in this embodiment, and are not described in detail herein.

Exemplary electronic device

An electronic device is further provided in another embodiment of the present application, please refer to fig. 5, fig. 5 is a schematic structural diagram of an electronic device provided in another embodiment of the present application.

As shown in fig. 5, the electronic device includes:

a memory 200 and a processor 210;

wherein the memory 200 is connected to the processor 210, and is used for storing a program;

the processor 210 is configured to implement the method for correcting RNA sequencing data disclosed in any of the above embodiments by executing the program stored in the memory 200.

Specifically, the electronic device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are interconnected by a bus. Wherein:

a bus may comprise a path that communicates information between components of a computer system.

Processor 210 may be a general-purpose processor such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., or may be an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with aspects of the present invention. But may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

Processor 210 may include a main processor, and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for implementing the technical scheme of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer-operating instructions. More specifically, the memory 200 may include read-only memory (ROM), other types of static storage devices that may store static information and instructions, random access memory (random access memory, RAM), other types of dynamic storage devices that may store information and instructions, disk storage, flash, and the like.

The input device 230 may include means for receiving data and information entered by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include means, such as a display screen, printer, speakers, etc., that allow information to be output to a user.

The communication interface 220 may include devices using any transceiver or the like for communicating with other devices or communication networks, such as ethernet, radio Access Network (RAN), wireless Local Area Network (WLAN), etc.

Processor 210 executes programs stored in memory 200 and invokes other devices that may be used to implement the various steps of any of the methods for correcting RNA sequencing data provided in the above-described embodiments of the present application.

Exemplary computer program product and storage Medium

In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in a method of correcting RNA sequencing data according to various embodiments of the present application described in the "exemplary methods" section of the present specification.

The computer program product may write program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a storage medium having stored thereon a computer program that is executed by a processor to perform the steps in the method for correcting RNA sequencing data according to various embodiments of the present application described in the above "exemplary method" section of the present specification, and specifically may implement the following steps:

s101, correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting the RNA sequencing data is used for eliminating noise data in the RNA sequencing data;

the variation self-encoder is trained and obtained at least based on the reduced RNA sequencing sample and the discrimination results of the noise information distribution and the biological information; wherein the discrimination of the noise information distribution and the biological information distribution includes: and judging the biological information in the noise information distribution and judging the noise information in the biological information distribution.

For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts described, as some acts may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

The steps in the method of each embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs, and the technical features described in each embodiment can be replaced or combined.

The modules and sub-modules in the device and the terminal of the embodiments of the present application may be combined, divided, and deleted according to actual needs.

In the embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of modules or sub-modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules or sub-modules illustrated as separate components may or may not be physically separate, and components that are modules or sub-modules may or may not be physical modules or sub-modules, i.e., may be located in one place, or may be distributed over multiple network modules or sub-modules. Some or all of the modules or sub-modules may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional module or sub-module in each embodiment of the present application may be integrated in one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated in one module. The integrated modules or sub-modules may be implemented in hardware or in software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of calibrating RNA sequencing data, comprising:

correcting the RNA sequencing data by using a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is performed on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data, the correction model comprises a first encoder and a first decoder, the first encoder is used for performing biological dimension encoding processing on the RNA sequencing data to obtain biological information distribution of the RNA sequencing data, and the first decoder is used for performing decoding reduction processing on the biological information distribution to obtain corrected RNA sequencing data;

the correction model is obtained based on training of a variation self-encoder; the variable self-encoder comprises the correction model, a second encoder, a third encoder and a second decoder, wherein the second encoder of the variable self-encoder is used for extracting noise information distribution in an RNA sequencing sample, the third encoder is used for extracting biological information distribution in the corrected RNA sequencing sample, and the second decoder is used for combining the noise information distribution and the biological information distribution in the corrected RNA sequencing sample, and performing decoding reduction treatment on the RNA sequencing sample to obtain a reduced RNA sequencing sample;

Training of the variation self-encoder comprises at least: training the variational self-encoder based on the reduced RNA sequencing sample and the RNA sequencing sample, and optimizing the first encoder and the second encoder for the noise information distribution and discrimination results based on biological information distribution in the RNA sequencing sample obtained by the first encoder; wherein the discrimination of the noise information distribution and the biological information distribution in the RNA sequencing sample obtained based on the first encoder comprises: respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first noise information discriminator to obtain a first discrimination result aiming at the noise information in the noise information distribution and a second discrimination result aiming at the noise information in the biological information distribution; and respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first biological information discriminator to obtain a third discrimination result aiming at biological information in the noise information distribution and a fourth discrimination result aiming at biological information in the biological information distribution.

2. The method of claim 1, wherein the correction model is trained by:

3. The method as recited in claim 2, further comprising:

4. The method as recited in claim 2, further comprising:

5. The method as recited in claim 2, further comprising:

6. The method as recited in claim 1, further comprising:

7. A device for calibrating RNA sequencing data, comprising:

the first unit is used for correcting the RNA sequencing data by utilizing a pre-constructed correction model to obtain corrected RNA sequencing data, wherein the correcting process is carried out on the RNA sequencing data and is used for eliminating noise data in the RNA sequencing data, the correction model comprises a first encoder and a first decoder, the first encoder is used for carrying out biological dimension encoding processing on the RNA sequencing data to obtain biological information distribution of the RNA sequencing data, and the first decoder is used for carrying out decoding reduction processing on the biological information distribution to obtain corrected RNA sequencing data;

Training of the variation self-encoder comprises at least: training the variation self-encoder based on the reduced RNA sequencing sample and the RNA sequencing sample, and optimizing the first encoder and the second encoder for the noise information distribution and the discrimination result of the biological information distribution in the RNA sequencing sample obtained based on the first encoder; wherein the discrimination of the noise information distribution and the biological information distribution in the RNA sequencing sample obtained based on the first encoder comprises: respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first noise information discriminator to obtain a first discrimination result aiming at the noise information in the noise information distribution and a second discrimination result aiming at the noise information in the biological information distribution; and respectively carrying out discrimination processing on the noise information distribution and the biological information distribution by using a first biological information discriminator to obtain a third discrimination result aiming at biological information in the noise information distribution and a fourth discrimination result aiming at biological information in the biological information distribution.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to execute the method for correcting RNA sequencing data according to any one of claims 1 to 6 by executing the instructions in the memory.

9. A computer storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, performs the method of correcting RNA sequencing data according to any of the preceding claims 1-6.