CN109063416A

CN109063416A - Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network

Info

Publication number: CN109063416A
Application number: CN201810810239.3A
Authority: CN
Inventors: 王会青; 李春; 董春林
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2018-12-21
Anticipated expiration: 2038-07-23
Also published as: CN109063416B

Abstract

The invention discloses a kind of gene expression prediction techniques based on LSTM Recognition with Recurrent Neural Network, by the building process that LSTM Recognition with Recurrent Neural Network is introduced to gene expression prediction model, reduce model predictive error, it is clustered by K-Means, the methods of Z-Score standardization carries out data prediction, improves model training learning efficiency；And the nonlinear characteristic for being extracted gene expression profile data using LSTM Recognition with Recurrent Neural Network in conjunction with known about 1000 landmark gene, is constructed Nonlinear Prediction Models, finally predicts the expression of about 21000 target gene；The present invention efficiently solves LINCS Program and uses linear regression as estimating method, but ignores the nonlinear problem in gene expression profile, reduces and obtains gene expression profile cost, improves gene expression research efficiency.

Description

Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network

Technical field

The present invention relates to gene expression research and analysis field more particularly to a kind of bases based on LSTM Recognition with Recurrent Neural Network Because expressing prediction technique.

Background technique

The machine learning and deep learning of molecular biology combination computer field, so that gene expression profile is widely used in The preferred good child-rearing of Gene correlation, crops, complex disease parting, disease related gene discovery and drug screening etc..But it obtains The expensive of gene expression profile is taken, process is many and diverse, and the laboratory of only a small number of abundance of capitals at present is caused to be able to carry out big rule Mould full-length genome expression pattern analysis.The method that LINCS program uses linear regression (LR) to predict as gene expression at present, But for complicated gene expression profile data, LR method has ignored the non-linear factor in gene expression profile, this will be reduced The accuracy of final experimental result.

Summary of the invention

It is provided a kind of based on LSTM Recognition with Recurrent Neural Network it is an object of the invention to avoid the deficiencies in the prior art place Gene expression prediction technique.

The purpose of the present invention can be realized using following technical measures, be designed a kind of based on LSTM Recognition with Recurrent Neural Network Gene expression prediction technique, comprising: gene expression profile data is pre-processed；By pretreated gene expression profile data LSTM Recognition with Recurrent Neural Network is introduced, gene expression prediction model is constructed, uses mean square error as the loss function of model, according to The back-propagation algorithm of standard updates weight training model；The parameter combination of change LSTM Recognition with Recurrent Neural Network is trained mould Type tests the model error under different parameters combination using mean absolute error as prediction model Performance Evaluating Indexes respectively；Make With the statistic coefficient of determination (R²) measure models fitting effect.

Wherein, pretreated mode includes at least: using in Unsupervised clustering algorithm removal original gene expression modal data Repeated data；Gene expression profile data after duplicate removal is formatted, saves number using the numpy format in python According to；Landmark gene and target the gene gene annotation in gene expression profile data is determined, by the base after format transformation Because expression modal data standardizes；Wherein, the mode of standardization is to carry out Z-Socre standardization.

Wherein, using the repeated data in Unsupervised clustering algorithm removal original gene expression modal data, being will be original Gene expression profile data is classified data using K-Means clustering algorithm, uses data in each class of euclidean distance metric Between similitude, to judge and to remove repeated data when there are repeated data with the presence or absence of repeating in homogeneous data；Specifically Comprising steps of

Step1: K cluster centre of initialization；

Step2: each gene expression profile sample data is calculated to the distance of cluster centre, and the sample data is divided into In the cluster representated by the nearest cluster centre；

Step3: the coordinate average value of all gene expression profile sample datas in each cluster is calculated, and coordinate average value is made For new cluster centre；

Step4: repeating step 2 and 3, until the movement of cluster centre is less than preset error value, or cluster the number of iterations Until reaching preset value；

Step5: measuring the Euclidean distance in each cluster between gene expression profile data, if with two genes in cluster The Euclidean distance for expressing modal data is less than given threshold, then this pair of express spectra is defined as duplicate keys, is deleted.

Wherein, it is determining landmark gene and target the gene gene annotation in gene expression profile data, will turn In the step of gene expression profile data after changing format is standardized, comprising steps of

According to gene coding annotation, 943 landmark gene probes and 15744 in gene expression profile data are extracted The expression value of target gene probe；Determine that gene expression profile data corresponds to identical gene coding annotation in RNA-Seq data Multiple probe, expression value of the average value as gene expression profile data for taking multiple probe to express, in gene expression profile data Between RNA-Seq data, 9520 combination target genes with one-to-one relationship are obtained；Gene expression profile data 943 landmark genes and 9520 combination target genes expression values standardized using Z-Score；For RNA- Each express spectra of Seq data comes the expression value of 943 landmark genes and 9520 target genes together, Data are standardized using Z-Score method.

Wherein, pretreated gene expression profile data is introduced into the building gene expression of LSTM Recognition with Recurrent Neural Network and predicts mould Type uses in the step of mean square error is as model loss function, comprising steps of

N number of training sample, L landmark genes, T target genes are set, training set is expressed asWherein, x_i∈R^LIndicate the expression value of i-th of landmark genes, y_i∈R^TIndicate i-th of target The expression value of genes；

LSTM Recognition with Recurrent Neural Network is controlled discarding by " door " (gate) or increases information, to realize forgetting or note The function of recalling；There are three such doors for one LSTM unit, are to forget door, input gate and out gate respectively；

Forgeing door is to realize the forgetting of the header length in last moment cell state by sigmoid layers；? When being trained in LSTM Recognition with Recurrent Neural Network, with the h of last moment_t-1With the x at this moment_tIt is last moment as input Cell state C_t-1In each single item generate the value of one [0,1], indicate the size for retaining information content, wherein 1 represent it is complete Retain, 0 indicates to give up completely；With C_t-1It is multiplied；f_tUpdate such as formula (1) shown in:

f_t=σ (W_f·[h_t-1, x_t]+b_f) (1)

Wherein x_tIt is current input vector, h^t-1It is the output vector at t-1 moment.b_f,W_fIt is the biasing for forgeing door respectively, defeated Enter weight, f_tIt indicates to retain last information content；

Input gate is recorded new header length in cell state；Its implementation includes two parts: 1. pass through It inputs gate layer and determines the content i updated_t；2. passing through one candidate value vector of tanh layers of creationAnd increase in cell state； i_tUpdate andAs shown in formula (2) and (3):

i_t=σ (W_i·[h_t-1, x_t]+b_i) (2)

Next cell state C is updated_t, shown in update mode such as formula (4):

Out gate control output, be achieved in that: 1. determine to export the content of new cell state by output layer door o_t；2. then multiplied to h with the output phase of output gate layer then by cell state by tanh layers_t；o_tAnd h_tUpdate mode is such as Shown in formula (5) and (6):

o_t=σ (W_o·[h_t-1, x_t]+b_o) (5)

h_t=o_t*tanh(C_t) (6)

Use mean square error as the loss function of prediction model, as shown in formula (7):

Wherein N is the sample number of test,Indicate the expression for the target gene t that i-th of sample is predicted Value, y_i(t)Indicate the truly expressed value of i-th of sample target gene t.

Wherein, pretreated gene expression profile data is being introduced into the building gene expression prediction of LSTM Recognition with Recurrent Neural Network Model further includes updating according to the back-propagation algorithm of standard after using the step of mean square error is as model loss function Prediction model weight enhances model robustness using Adam optimization algorithm and the training of Dropout technology acceleration model, reduces pre- The step of surveying model error, comprising steps of

Prediction model weight is updated according to the back-propagation algorithm of standard, prediction model is updated by gradient descent method iteration Weight parameter, calculate partial derivative of all parameters based on loss function, according to formula (8) calculate gradient:

Traditional stochastic gradient descent is replaced using Adam optimization algorithm in back-propagation algorithm, is changed based on training data The network weight of the update prediction model in generation, single order moments estimation and second order moments estimation by calculating gradient are different parameters Design independent adaptivity learning rate；

Dropout technology is added in the training process of prediction model；Dropout technology was trained deep learning network It is temporarily abandoned neural network unit according to certain probability by Cheng Zhong from network.

Wherein, change LSTM Recognition with Recurrent Neural Network parameter combination be trained model, using mean absolute error as Prediction model Performance Evaluating Indexes, respectively test different parameters combination under model error the step of in,

The definition of mean absolute error formula is as shown in formula (9):

Wherein N is the sample number of test,Indicate the table for the target gene t that i-th of sample is predicted Up to value, y_i(t)Indicate the truly expressed value of i-th of sample target gene t.

Wherein, the coefficient of determination (R²) it is to be often treated as measuring model prediction through common statistical information in regression analysis The standard of ability quality；Correlation formula is such as shown in (10)-(13):

Wherein, Sum of Squares Due To Error (SSE)

Sum Of Squares Due To Regression(SSR)

Total Sum Of Squares(SST)

In formula (10)-(13)Indicate i-th of gene expression profile predicted value,Indicate sample mean, y_iIndicate the I gene expression profile true value.

Wherein, R²Value be [0,1], the R of a model²Value is 0, complete unpredictable target variable, model R²Value is 1, then works well to what target variable was predicted；R²Value is the numerical value between 0 to 1, then it represents that can in the model Percentage in the target variable explained with feature.

It is different from the prior art, of the present invention is a kind of gene expression prediction side based on LSTM Recognition with Recurrent Neural Network Method is clustered by K-Means, and the methods of standardization carries out data prediction, improves model training efficiency；In conjunction with it is known about 1000 landmark gene extract nonlinear characteristic using LSTM Recognition with Recurrent Neural Network, construct Nonlinear Prediction Models, finally Infer about 21000 target gene expression, efficiently solve LINCS program and use linear regression as prediction technique, But ignore the nonlinear problem in gene expression profile.Using dropout strategy in model training, prediction model is improved Generalization ability and robustness use different weighing apparatuses using the prediction effect of the data test gene expression prediction model of different platform Figureofmerit has evaluated experimental result, further demonstrates gene expression prediction model proposed by the present invention in performance, cross-platform general The advantage of change ability and study nonlinear characteristic ability.

Detailed description of the invention

Fig. 1 is a kind of process signal of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention Figure；

Fig. 2 is a kind of gene expression prediction technique different parameters group based on LSTM Recognition with Recurrent Neural Network provided by the invention Mean absolute error schematic diagram of the prediction model on GEO data set under closing.

Fig. 3 is a kind of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention and existing line Property return gene expression prediction model (LR), average absolute of the k nearest neighbor gene expression prediction model (KNN-GE) in GEO data set Error comparison diagram.

A kind of Fig. 4 gene expression prediction technique different parameters combination based on LSTM Recognition with Recurrent Neural Network provided by the invention Under mean absolute error schematic diagram of the prediction model on GETx/1000G data set.

Fig. 5 a kind of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention and existing linear It returns gene expression prediction model (LR), k nearest neighbor gene expression prediction model (KNN-GE) is on GETx/1000G data set Mean absolute error comparison diagram.

A kind of Fig. 6 gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention and linear regression base Because expressing the coefficient of determination (R of prediction model (LR)²) comparison diagram.

Specific embodiment

Further more detailed description is made to technical solution of the present invention With reference to embodiment.Obviously, it is retouched The embodiment stated is only a part of the embodiments of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, Those of ordinary skill in the art's every other embodiment obtained without creative labor, all should belong to The scope of protection of the invention.

Refering to fig. 1, Fig. 1 is a kind of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention Flow diagram.The step of this method includes:

S110: gene expression profile data is pre-processed.

Pretreated mode includes at least:

1. using the repeated data in Unsupervised clustering algorithm removal original gene expression modal data.

Data are divided into multiclass using K-Means clustering algorithm by original gene expression modal data, then use Euclidean Similitude in each class of distance metric between data is repeated with this to judge to whether there is in homogeneous data, finally removal weight Complex data.Include the following steps:

Step1: K cluster centre of initialization；

Step2: calculate each gene expression profile sample to cluster centre distance, and by the sample be divided into distance recently Cluster centre representated by cluster；

Step3: the coordinate average value of all gene expression profile samples in each cluster is calculated, and using this average value as new Cluster centre.

Step4: repeating step 2, and 3 reach pre- until the small Mr. Yu's error amount of the movement of cluster centre or cluster the number of iterations If until value；

Step5: measuring the Euclidean distance in each cluster between gene expression profile, if the Euclidean distance of two express spectras Less than the threshold value of setting, i.e., this pair of express spectra is defined as duplicate keys, is deleted.

2. the gene expression profile data after pair duplicate removal formats, number is saved using the numpy format in python According to.

3. landmark gene and target the gene gene annotation in gene expression profile data is determined, by format transformation Gene expression profile data afterwards standardizes；Wherein, the mode of standardization is to carry out Z-Socre standardization.

According to gene coding annotation, 943 landmark gene probes and 15744 in gene expression profile data are extracted The expression value of target gene probe；

Gene expression profile data corresponds to the multiple probe of identical gene coding annotation in RNA-Seq data, takes multiple spy Expression value of the average value of needle expression as gene obtains 9520 finally between gene expression profile platform and RNA-Seq platform A combination target gene with one-to-one relationship.

The expression value of 943 landmark genes of gene expression profile data and 9520 combination target genes is used Formula (10) Z-Score standardization.

For each express spectra of RNA-Seq data, by 943 landmark genes and 9520 target genes Expression value comes together, is standardized using formula (10) Z-Score method to data.

Wherein μ_jThe mean value of all sample datas, σ are arranged for jth_jThe standard deviation of all sample datas is arranged for jth.

S120: pretreated gene expression profile data is introduced into LSTM Recognition with Recurrent Neural Network, building gene expression prediction Model, uses mean square error as the loss function of model, updates weight training model according to the back-propagation algorithm of standard.

Gene expression prediction can be counted as a multitask regression forecasting, it is assumed that have N number of training sample, L Landmark genes, T target genes, training set are expressed asWherein x_i∈R^LIt indicates i-th The expression value of landmark genes, y_i∈R^TThe expression value for indicating i-th of target genes, the purpose of the present invention is buildings R^L> > > R^TMapping model.

Long Short-Term Memory Neural Network abbreviation LSTM is a kind of special RNN type, can To learn long-term Dependency Specification.LSTM Recognition with Recurrent Neural Network is controlled discarding by " door " (gate) or increases information, thus Realize the function of forgeing or remember.There are three such doors for one LSTM unit, are to forget door, input gate and out gate respectively.

Forgeing door is to realize the forgetting of the header length in last moment cell state by sigmoid layers.More than The h at one moment_t-1With the x at this moment_tIt is then C as input_t-1In each single item generate one [0,1] value, indicate protect Stay how much information (1 represent be fully retained, 0 indicate give up completely), then with C_t-1Multiply.f_tUpdate such as formula (1) shown in:

f_t=σ (W_f·[h_t-1, x_t]+b_f) (1)

Wherein x_tIt is current input vector, h^t-1It is the output vector at t-1 moment.b_f,W_fIt is the biasing for forgeing door respectively, defeated Enter weight.

Input gate is to determine what is deposited in cell state, by being recorded in cell state for new header length.It is real Existing mode includes two parts: 1.sigmoid layers (input gate layer) determine what value we will update；One time of 2.tanh layers of creation Choosing value vectorIt will be added in cell state.i_tUpdate such as formula (2) shown in:

i_t=σ (W_i·[h_t-1, x_t]+b_i) (2)

Next cell state C is updated_t, shown in update mode such as formula (4):

f_tIt indicates to retain last how much information, i_tIndicate which value updated,Indicate new candidate value.

Out gate control output, is achieved in that 1. by sigmoid layers (output layer doors) to determine to export new cell Which part of state；2. then then cell state is multiplied by tanh layers with sigmoid layers of output.Update mode As shown in formula (5) (6):

o_t=σ (W_o·[h_t-1, x_t]+b_o) ((5)

h_t=o_t*tanh(C_t) (6)

Prediction model weight is updated according to the back-propagation algorithm of standard, prediction model is updated by gradient descent method iteration Weight parameter, key point is to calculate partial derivative of all parameters based on loss function.Gradient is calculated according to formula (8):

Traditional stochastic gradient descent is replaced using Adam optimization algorithm in back-propagation algorithm, it can be based on training number According to the network weight of the update prediction model of iteration, single order moments estimation and second order moments estimation by calculating gradient are different The independent adaptivity learning rate of parameter designing.

Dropout technology is added in the training process of prediction model.Dropout technology can be with the training of deep learning network In the process, for neural network unit, it is temporarily abandoned from network according to certain probability.Due to every time with input network Sample when carrying out right value update, implicit node is all to occur at random with certain probability, therefore cannot be guaranteed every 2 implicit nodes Occur simultaneously every time, the update of such weight is no longer dependent on the collective effect that fixed relationship implies node, prevents certain A little features only situation just effective under other special characteristics.Improve the generalization ability of model.

S140: the parameter combination of change LSTM Recognition with Recurrent Neural Network is trained model, using mean absolute error as pre- Model performance evaluation index is surveyed, tests the model error under different parameters combination respectively.

The definition of mean absolute error formula is as shown in formula (9):

Wherein N is the sample number of test,Indicate the table for the target gene t that i-th of sample is predicted Up to value, y_i(t)Indicate the truly expressed value of i-th of sample target gene t.Test the prediction model under different parameters combination Error result is as shown in figures 2-6.Wherein, Fig. 2 is that the prediction model under different parameters combination of the present invention is flat on GEO data set Equal absolute error schematic diagram.Fig. 3 is method of the invention and existing linear regression gene expression prediction model (LR), k nearest neighbor Mean absolute error comparison diagram of the gene expression prediction model (KNN-GE) in GEO data set.Fig. 4 is the present invention in different parameters Mean absolute error schematic diagram of the prediction model on GETx/1000G data set under combination.Fig. 5 be method of the invention with Existing linear regression gene expression prediction model (LR), k nearest neighbor gene expression prediction model (KNN-GE) is in GETx/1000G Mean absolute error comparison diagram on data set.Fig. 6 is the decision of the present invention with linear regression gene expression prediction model (LR) Coefficient (R²) comparison diagram.

S150: the statistic coefficient of determination (R is used²) measure models fitting effect.

The coefficient of determination (R²) it is that it is good to be often treated as measurement model prediction ability through common statistical information in regression analysis Bad standard.Correlation formula is such as shown in (10)-(13):

Wherein, Sum of Squares Due To Error (SSE)

Sum Of Squares Due To Regression(SSR)

Total Sum Of Squares(SST)

R²Value be [0,1], the R of a model²Value is its complete unpredictable target variable of 0 explanation, and a R²Value Then extraordinary prediction can be carried out to target variable for 1 model.R²Value is the numerical value between 0 to 1, then it represents that in the model Have how much percent can be explained with feature in target variable.Linear regression method and the LSTM proposed in this paper that is based on recycle mind The coefficient of determination comparison diagram of gene expression prediction model through network is shown in Figure of description.

It is different from the prior art, a kind of gene expression prediction side based on LSTM Recognition with Recurrent Neural Network of the present invention Method is clustered by K-Means, and the methods of standardization carries out data prediction, improves model training efficiency；In conjunction with it is known about 1000 landmark gene extract nonlinear characteristic using LSTM Recognition with Recurrent Neural Network, construct Nonlinear Prediction Models, finally Infer about 21000 target gene expression, efficiently solve LINCS program and use linear regression as prediction technique, But ignore the nonlinear problem in gene expression profile.Using dropout strategy in model training, prediction model is improved Generalization ability and robustness use different weighing apparatuses using the prediction effect of the data test gene expression prediction model of different platform Figureofmerit has evaluated experimental result, further demonstrates gene expression prediction model proposed by the present invention in performance, cross-platform general The advantage of change ability and study nonlinear characteristic ability.

The above is only embodiments of the present invention, are not intended to limit the scope of the invention, all to utilize the present invention Equivalent structure or equivalent flow shift made by specification and accompanying drawing content is applied directly or indirectly in other relevant technologies Field is included within the scope of the present invention.

Claims

1. a kind of gene expression prediction technique based on LSTM characterized by comprising

Gene expression profile data is pre-processed；

Pretreated gene expression profile data is introduced into LSTM Recognition with Recurrent Neural Network, constructs gene expression prediction model, is used Loss function of the mean square error as model updates weight training model according to the back-propagation algorithm of standard；

The parameter combination of change LSTM Recognition with Recurrent Neural Network is trained model, using mean absolute error as prediction model performance Evaluation index tests the model error under different parameters combination respectively；

Use the statistic coefficient of determination (R²) measure models fitting effect.

2. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that pretreated mode is extremely Include: less

Use the repeated data in Unsupervised clustering algorithm removal original gene expression modal data；

Gene expression profile data after duplicate removal is formatted, saves data using the numpy format in python；

Landmark gene and target the gene gene annotation in gene expression profile data is determined, by the base after format transformation Because expression modal data standardizes；Wherein, the mode of standardization is to carry out Z-Socre standardization.

3. the gene expression prediction technique according to claim 2 based on LSTM, which is characterized in that use Unsupervised clustering Algorithm removes the repeated data in original gene expression modal data, is to gather original gene expression profile data using K-Means Class algorithm classifies data, using the similitude between data in each class of euclidean distance metric, to judge homogeneous data In with the presence or absence of repeating, and remove repeated data when there are repeated data；Specifically include step:

Step1: K cluster centre of initialization；

Step2: each gene expression profile sample data is calculated to the distance of cluster centre, and the sample data is divided into distance In cluster representated by nearest cluster centre；

Step3: the coordinate average value of all gene expression profile sample datas in each cluster is calculated, and using coordinate average value as new Cluster centre；

Step4: repeating step 2 and 3, until the movement of cluster centre reaches less than preset error value, or cluster the number of iterations Until preset value；

Step5: measuring the Euclidean distance in each cluster between gene expression profile data, if with two gene expression in cluster The Euclidean distance of modal data is less than given threshold, then this pair of express spectra is defined as duplicate keys, deletes one of them.

4. the gene expression prediction technique according to claim 2 based on LSTM, which is characterized in that determining gene expression Landmark gene and target gene gene annotation in modal data carries out the gene expression profile data after format transformation In the step of standardization, comprising steps of

According to gene coding annotation, 943 landmark gene probes and 15744 target in gene expression profile data are extracted The expression value of gene probe；

It determines that gene expression profile data corresponds to the multiple probe of identical gene coding annotation in RNA-Seq data, takes multiple spy Expression value of the average value of needle expression as gene expression profile data obtains between gene expression profile data and RNA-Seq data To 9520 combination target genes with one-to-one relationship；

The expression value of 943 landmark genes of gene expression profile data and 9520 combination target genes is used Z- Score standardization；

For each express spectra of RNA-Seq data, by the expression of 943 landmark genes and 9520 target genes Value comes together, is standardized using Z-Score method to data.

5. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that by pretreated base Gene expression prediction model is constructed because expression modal data introduces LSTM Recognition with Recurrent Neural Network, mean square error is used to lose as model In the step of function, comprising steps of

LSTM Recognition with Recurrent Neural Network is controlled discarding by " door " (gate) or increases information, to realize forgetting or memory Function；There are three such doors for one LSTM unit, are to forget door, input gate and out gate respectively；

Forgetting door is the forgetting of the header length in the cell state by last moment, passes through sigmoid layers and realizes；In LSTM When being trained in Recognition with Recurrent Neural Network, with the h of last moment_t-1With the x at this moment_tSigmoid function as input, For the cell state C of last moment_t-1In each single item generate one [0,1] value, indicate retain information content size, wherein 1 representative is fully retained, and 0 indicates to give up completely；With C_t-1It is multiplied；f_tUpdate such as formula (1) shown in:

f_t=σ (W_f·[h_t-1, x_t]+b_f) (1)

Wherein x_tIt is current input vector, h^t-1It is the output vector at t-1 moment.b_f, W_fIt is the biasing for forgeing door, input power respectively Weight, f_tIt indicates to retain last information content；

Input gate is recorded new header length in cell state；Its implementation includes two parts: 1. pass through input Gate layer determines the content i updated_t；2. passing through one candidate value vector of tanh layers of creationAnd increase in cell state；i_t's Update andAs shown in formula (2) and (3):

i_t=σ (W_i·[h_t-1, x_t]+b_i) (2)

Next cell state C is updated_t, shown in update mode such as formula (4):

Out gate control output, be achieved in that: 1. determine to export the content o of new cell state by output layer door_t；2. Then then multiplied to h with the output phase of output gate layer by cell state by tanh layers_t；o_tAnd h_tUpdate mode such as formula (5) and shown in (6):

o_t=σ (W_o·[h_t-1, x_t]+b_o) (5)

h_t=o_t*tanh(C_t) (6)

Wherein N is the sample number of test,Indicate the expression value for the target gene t that i-th of sample is predicted, y_i(t)Indicate the truly expressed value of i-th of sample target gene t.

6. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that will be pretreated Gene expression profile data introduces LSTM Recognition with Recurrent Neural Network and constructs gene expression prediction model, and mean square error is used to damage as model After the step of losing function, further includes updating prediction model weight according to the back-propagation algorithm of standard, optimized using Adam and calculated Method and the training of Dropout technology acceleration model, the step of enhancing model robustness, reduce error prediction model, comprising steps of

Prediction model weight is updated according to the back-propagation algorithm of standard, the power of prediction model is updated by gradient descent method iteration Weight parameter, calculates partial derivative of all parameters based on loss function, calculates gradient according to formula (8):

Traditional stochastic gradient descent is replaced using Adam optimization algorithm in back-propagation algorithm, based on training data iteration The network weight for updating prediction model, single order moments estimation and second order moments estimation by calculating gradient are different parameter designings Independent adaptivity learning rate；

Dropout technology is added in the training process of prediction model；Training process of the Dropout technology in deep learning network In, for neural network unit, it is temporarily abandoned from network according to certain probability.

7. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that in change LSTM circulation The parameter combination of neural network is trained model, using mean absolute error as prediction model Performance Evaluating Indexes, surveys respectively In the step of trying the model error under different parameters combination,

The definition of mean absolute error formula is as shown in formula (9):

8. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that the coefficient of determination (R²) be Through common statistical information in regression analysis, it is often treated as measuring the standard of model prediction ability quality；Correlation formula is such as (10) shown in-(13):

Wherein, Sum of Squares Due To Error (SSE)

Sum Of Squares Due To Regression(SSR)

Total Sum Of Squares(SST)

In formula (10)-(13)Indicate i-th of gene expression profile predicted value,Indicate sample mean, y_iIt indicates i-th Gene expression profile true value.

9. the gene expression prediction technique according to claim 8 based on LSTM, which is characterized in that R²Value be [0, 1], the R of a model²Value is 0, complete unpredictable target variable, the R of a model²Value is 1, then carries out to target variable pre- That surveys works well；R²Value is the numerical value between 0 to 1, then it represents that in the target variable that can be explained with feature in the model Percentage.