CN109063416A - Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network - Google Patents

Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network Download PDF

Info

Publication number
CN109063416A
CN109063416A CN201810810239.3A CN201810810239A CN109063416A CN 109063416 A CN109063416 A CN 109063416A CN 201810810239 A CN201810810239 A CN 201810810239A CN 109063416 A CN109063416 A CN 109063416A
Authority
CN
China
Prior art keywords
gene expression
data
gene
value
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810810239.3A
Other languages
Chinese (zh)
Other versions
CN109063416B (en
Inventor
王会青
李春
董春林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201810810239.3A priority Critical patent/CN109063416B/en
Publication of CN109063416A publication Critical patent/CN109063416A/en
Application granted granted Critical
Publication of CN109063416B publication Critical patent/CN109063416B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of gene expression prediction techniques based on LSTM Recognition with Recurrent Neural Network, by the building process that LSTM Recognition with Recurrent Neural Network is introduced to gene expression prediction model, reduce model predictive error, it is clustered by K-Means, the methods of Z-Score standardization carries out data prediction, improves model training learning efficiency;And the nonlinear characteristic for being extracted gene expression profile data using LSTM Recognition with Recurrent Neural Network in conjunction with known about 1000 landmark gene, is constructed Nonlinear Prediction Models, finally predicts the expression of about 21000 target gene;The present invention efficiently solves LINCS Program and uses linear regression as estimating method, but ignores the nonlinear problem in gene expression profile, reduces and obtains gene expression profile cost, improves gene expression research efficiency.

Description

Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network
Technical field
The present invention relates to gene expression research and analysis field more particularly to a kind of bases based on LSTM Recognition with Recurrent Neural Network Because expressing prediction technique.
Background technique
The machine learning and deep learning of molecular biology combination computer field, so that gene expression profile is widely used in The preferred good child-rearing of Gene correlation, crops, complex disease parting, disease related gene discovery and drug screening etc..But it obtains The expensive of gene expression profile is taken, process is many and diverse, and the laboratory of only a small number of abundance of capitals at present is caused to be able to carry out big rule Mould full-length genome expression pattern analysis.The method that LINCS program uses linear regression (LR) to predict as gene expression at present, But for complicated gene expression profile data, LR method has ignored the non-linear factor in gene expression profile, this will be reduced The accuracy of final experimental result.
Summary of the invention
It is provided a kind of based on LSTM Recognition with Recurrent Neural Network it is an object of the invention to avoid the deficiencies in the prior art place Gene expression prediction technique.
The purpose of the present invention can be realized using following technical measures, be designed a kind of based on LSTM Recognition with Recurrent Neural Network Gene expression prediction technique, comprising: gene expression profile data is pre-processed;By pretreated gene expression profile data LSTM Recognition with Recurrent Neural Network is introduced, gene expression prediction model is constructed, uses mean square error as the loss function of model, according to The back-propagation algorithm of standard updates weight training model;The parameter combination of change LSTM Recognition with Recurrent Neural Network is trained mould Type tests the model error under different parameters combination using mean absolute error as prediction model Performance Evaluating Indexes respectively;Make With the statistic coefficient of determination (R2) measure models fitting effect.
Wherein, pretreated mode includes at least: using in Unsupervised clustering algorithm removal original gene expression modal data Repeated data;Gene expression profile data after duplicate removal is formatted, saves number using the numpy format in python According to;Landmark gene and target the gene gene annotation in gene expression profile data is determined, by the base after format transformation Because expression modal data standardizes;Wherein, the mode of standardization is to carry out Z-Socre standardization.
Wherein, using the repeated data in Unsupervised clustering algorithm removal original gene expression modal data, being will be original Gene expression profile data is classified data using K-Means clustering algorithm, uses data in each class of euclidean distance metric Between similitude, to judge and to remove repeated data when there are repeated data with the presence or absence of repeating in homogeneous data;Specifically Comprising steps of
Step1: K cluster centre of initialization;
Step2: each gene expression profile sample data is calculated to the distance of cluster centre, and the sample data is divided into In the cluster representated by the nearest cluster centre;
Step3: the coordinate average value of all gene expression profile sample datas in each cluster is calculated, and coordinate average value is made For new cluster centre;
Step4: repeating step 2 and 3, until the movement of cluster centre is less than preset error value, or cluster the number of iterations Until reaching preset value;
Step5: measuring the Euclidean distance in each cluster between gene expression profile data, if with two genes in cluster The Euclidean distance for expressing modal data is less than given threshold, then this pair of express spectra is defined as duplicate keys, is deleted.
Wherein, it is determining landmark gene and target the gene gene annotation in gene expression profile data, will turn In the step of gene expression profile data after changing format is standardized, comprising steps of
According to gene coding annotation, 943 landmark gene probes and 15744 in gene expression profile data are extracted The expression value of target gene probe;Determine that gene expression profile data corresponds to identical gene coding annotation in RNA-Seq data Multiple probe, expression value of the average value as gene expression profile data for taking multiple probe to express, in gene expression profile data Between RNA-Seq data, 9520 combination target genes with one-to-one relationship are obtained;Gene expression profile data 943 landmark genes and 9520 combination target genes expression values standardized using Z-Score;For RNA- Each express spectra of Seq data comes the expression value of 943 landmark genes and 9520 target genes together, Data are standardized using Z-Score method.
Wherein, pretreated gene expression profile data is introduced into the building gene expression of LSTM Recognition with Recurrent Neural Network and predicts mould Type uses in the step of mean square error is as model loss function, comprising steps of
N number of training sample, L landmark genes, T target genes are set, training set is expressed asWherein, xi∈RLIndicate the expression value of i-th of landmark genes, yi∈RTIndicate i-th of target The expression value of genes;
LSTM Recognition with Recurrent Neural Network is controlled discarding by " door " (gate) or increases information, to realize forgetting or note The function of recalling;There are three such doors for one LSTM unit, are to forget door, input gate and out gate respectively;
Forgeing door is to realize the forgetting of the header length in last moment cell state by sigmoid layers;? When being trained in LSTM Recognition with Recurrent Neural Network, with the h of last momentt-1With the x at this momenttIt is last moment as input Cell state Ct-1In each single item generate the value of one [0,1], indicate the size for retaining information content, wherein 1 represent it is complete Retain, 0 indicates to give up completely;With Ct-1It is multiplied;ftUpdate such as formula (1) shown in:
ft=σ (Wf·[ht-1, xt]+bf) (1)
Wherein xtIt is current input vector, ht-1It is the output vector at t-1 moment.bf,WfIt is the biasing for forgeing door respectively, defeated Enter weight, ftIt indicates to retain last information content;
Input gate is recorded new header length in cell state;Its implementation includes two parts: 1. pass through It inputs gate layer and determines the content i updatedt;2. passing through one candidate value vector of tanh layers of creationAnd increase in cell state; itUpdate andAs shown in formula (2) and (3):
it=σ (Wi·[ht-1, xt]+bi) (2)
Next cell state C is updatedt, shown in update mode such as formula (4):
Out gate control output, be achieved in that: 1. determine to export the content of new cell state by output layer door ot;2. then multiplied to h with the output phase of output gate layer then by cell state by tanh layerst;otAnd htUpdate mode is such as Shown in formula (5) and (6):
ot=σ (Wo·[ht-1, xt]+bo) (5)
ht=ot*tanh(Ct) (6)
Use mean square error as the loss function of prediction model, as shown in formula (7):
Wherein N is the sample number of test,Indicate the expression for the target gene t that i-th of sample is predicted Value, yi(t)Indicate the truly expressed value of i-th of sample target gene t.
Wherein, pretreated gene expression profile data is being introduced into the building gene expression prediction of LSTM Recognition with Recurrent Neural Network Model further includes updating according to the back-propagation algorithm of standard after using the step of mean square error is as model loss function Prediction model weight enhances model robustness using Adam optimization algorithm and the training of Dropout technology acceleration model, reduces pre- The step of surveying model error, comprising steps of
Prediction model weight is updated according to the back-propagation algorithm of standard, prediction model is updated by gradient descent method iteration Weight parameter, calculate partial derivative of all parameters based on loss function, according to formula (8) calculate gradient:
Traditional stochastic gradient descent is replaced using Adam optimization algorithm in back-propagation algorithm, is changed based on training data The network weight of the update prediction model in generation, single order moments estimation and second order moments estimation by calculating gradient are different parameters Design independent adaptivity learning rate;
Dropout technology is added in the training process of prediction model;Dropout technology was trained deep learning network It is temporarily abandoned neural network unit according to certain probability by Cheng Zhong from network.
Wherein, change LSTM Recognition with Recurrent Neural Network parameter combination be trained model, using mean absolute error as Prediction model Performance Evaluating Indexes, respectively test different parameters combination under model error the step of in,
The definition of mean absolute error formula is as shown in formula (9):
Wherein N is the sample number of test,Indicate the table for the target gene t that i-th of sample is predicted Up to value, yi(t)Indicate the truly expressed value of i-th of sample target gene t.
Wherein, the coefficient of determination (R2) it is to be often treated as measuring model prediction through common statistical information in regression analysis The standard of ability quality;Correlation formula is such as shown in (10)-(13):
Wherein, Sum of Squares Due To Error (SSE)
Sum Of Squares Due To Regression(SSR)
Total Sum Of Squares(SST)
In formula (10)-(13)Indicate i-th of gene expression profile predicted value,Indicate sample mean, yiIndicate the I gene expression profile true value.
Wherein, R2Value be [0,1], the R of a model2Value is 0, complete unpredictable target variable, model R2Value is 1, then works well to what target variable was predicted;R2Value is the numerical value between 0 to 1, then it represents that can in the model Percentage in the target variable explained with feature.
It is different from the prior art, of the present invention is a kind of gene expression prediction side based on LSTM Recognition with Recurrent Neural Network Method is clustered by K-Means, and the methods of standardization carries out data prediction, improves model training efficiency;In conjunction with it is known about 1000 landmark gene extract nonlinear characteristic using LSTM Recognition with Recurrent Neural Network, construct Nonlinear Prediction Models, finally Infer about 21000 target gene expression, efficiently solve LINCS program and use linear regression as prediction technique, But ignore the nonlinear problem in gene expression profile.Using dropout strategy in model training, prediction model is improved Generalization ability and robustness use different weighing apparatuses using the prediction effect of the data test gene expression prediction model of different platform Figureofmerit has evaluated experimental result, further demonstrates gene expression prediction model proposed by the present invention in performance, cross-platform general The advantage of change ability and study nonlinear characteristic ability.
Detailed description of the invention
Fig. 1 is a kind of process signal of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention Figure;
Fig. 2 is a kind of gene expression prediction technique different parameters group based on LSTM Recognition with Recurrent Neural Network provided by the invention Mean absolute error schematic diagram of the prediction model on GEO data set under closing.
Fig. 3 is a kind of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention and existing line Property return gene expression prediction model (LR), average absolute of the k nearest neighbor gene expression prediction model (KNN-GE) in GEO data set Error comparison diagram.
A kind of Fig. 4 gene expression prediction technique different parameters combination based on LSTM Recognition with Recurrent Neural Network provided by the invention Under mean absolute error schematic diagram of the prediction model on GETx/1000G data set.
Fig. 5 a kind of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention and existing linear It returns gene expression prediction model (LR), k nearest neighbor gene expression prediction model (KNN-GE) is on GETx/1000G data set Mean absolute error comparison diagram.
A kind of Fig. 6 gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention and linear regression base Because expressing the coefficient of determination (R of prediction model (LR)2) comparison diagram.
Specific embodiment
Further more detailed description is made to technical solution of the present invention With reference to embodiment.Obviously, it is retouched The embodiment stated is only a part of the embodiments of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, Those of ordinary skill in the art's every other embodiment obtained without creative labor, all should belong to The scope of protection of the invention.
Refering to fig. 1, Fig. 1 is a kind of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention Flow diagram.The step of this method includes:
S110: gene expression profile data is pre-processed.
Pretreated mode includes at least:
1. using the repeated data in Unsupervised clustering algorithm removal original gene expression modal data.
Data are divided into multiclass using K-Means clustering algorithm by original gene expression modal data, then use Euclidean Similitude in each class of distance metric between data is repeated with this to judge to whether there is in homogeneous data, finally removal weight Complex data.Include the following steps:
Step1: K cluster centre of initialization;
Step2: calculate each gene expression profile sample to cluster centre distance, and by the sample be divided into distance recently Cluster centre representated by cluster;
Step3: the coordinate average value of all gene expression profile samples in each cluster is calculated, and using this average value as new Cluster centre.
Step4: repeating step 2, and 3 reach pre- until the small Mr. Yu's error amount of the movement of cluster centre or cluster the number of iterations If until value;
Step5: measuring the Euclidean distance in each cluster between gene expression profile, if the Euclidean distance of two express spectras Less than the threshold value of setting, i.e., this pair of express spectra is defined as duplicate keys, is deleted.
2. the gene expression profile data after pair duplicate removal formats, number is saved using the numpy format in python According to.
3. landmark gene and target the gene gene annotation in gene expression profile data is determined, by format transformation Gene expression profile data afterwards standardizes;Wherein, the mode of standardization is to carry out Z-Socre standardization.
According to gene coding annotation, 943 landmark gene probes and 15744 in gene expression profile data are extracted The expression value of target gene probe;
Gene expression profile data corresponds to the multiple probe of identical gene coding annotation in RNA-Seq data, takes multiple spy Expression value of the average value of needle expression as gene obtains 9520 finally between gene expression profile platform and RNA-Seq platform A combination target gene with one-to-one relationship.
The expression value of 943 landmark genes of gene expression profile data and 9520 combination target genes is used Formula (10) Z-Score standardization.
For each express spectra of RNA-Seq data, by 943 landmark genes and 9520 target genes Expression value comes together, is standardized using formula (10) Z-Score method to data.
Wherein μjThe mean value of all sample datas, σ are arranged for jthjThe standard deviation of all sample datas is arranged for jth.
S120: pretreated gene expression profile data is introduced into LSTM Recognition with Recurrent Neural Network, building gene expression prediction Model, uses mean square error as the loss function of model, updates weight training model according to the back-propagation algorithm of standard.
Gene expression prediction can be counted as a multitask regression forecasting, it is assumed that have N number of training sample, L Landmark genes, T target genes, training set are expressed asWherein xi∈RLIt indicates i-th The expression value of landmark genes, yi∈RTThe expression value for indicating i-th of target genes, the purpose of the present invention is buildings RL> > > RTMapping model.
Long Short-Term Memory Neural Network abbreviation LSTM is a kind of special RNN type, can To learn long-term Dependency Specification.LSTM Recognition with Recurrent Neural Network is controlled discarding by " door " (gate) or increases information, thus Realize the function of forgeing or remember.There are three such doors for one LSTM unit, are to forget door, input gate and out gate respectively.
Forgeing door is to realize the forgetting of the header length in last moment cell state by sigmoid layers.More than The h at one momentt-1With the x at this momenttIt is then C as inputt-1In each single item generate one [0,1] value, indicate protect Stay how much information (1 represent be fully retained, 0 indicate give up completely), then with Ct-1Multiply.ftUpdate such as formula (1) shown in:
ft=σ (Wf·[ht-1, xt]+bf) (1)
Wherein xtIt is current input vector, ht-1It is the output vector at t-1 moment.bf,WfIt is the biasing for forgeing door respectively, defeated Enter weight.
Input gate is to determine what is deposited in cell state, by being recorded in cell state for new header length.It is real Existing mode includes two parts: 1.sigmoid layers (input gate layer) determine what value we will update;One time of 2.tanh layers of creation Choosing value vectorIt will be added in cell state.itUpdate such as formula (2) shown in:
it=σ (Wi·[ht-1, xt]+bi) (2)
Next cell state C is updatedt, shown in update mode such as formula (4):
ftIt indicates to retain last how much information, itIndicate which value updated,Indicate new candidate value.
Out gate control output, is achieved in that 1. by sigmoid layers (output layer doors) to determine to export new cell Which part of state;2. then then cell state is multiplied by tanh layers with sigmoid layers of output.Update mode As shown in formula (5) (6):
ot=σ (Wo·[ht-1, xt]+bo) ((5)
ht=ot*tanh(Ct) (6)
Use mean square error as the loss function of prediction model, as shown in formula (7):
Wherein N is the sample number of test,Indicate the expression for the target gene t that i-th of sample is predicted Value, yi(t)Indicate the truly expressed value of i-th of sample target gene t.
Prediction model weight is updated according to the back-propagation algorithm of standard, prediction model is updated by gradient descent method iteration Weight parameter, key point is to calculate partial derivative of all parameters based on loss function.Gradient is calculated according to formula (8):
Traditional stochastic gradient descent is replaced using Adam optimization algorithm in back-propagation algorithm, it can be based on training number According to the network weight of the update prediction model of iteration, single order moments estimation and second order moments estimation by calculating gradient are different The independent adaptivity learning rate of parameter designing.
Dropout technology is added in the training process of prediction model.Dropout technology can be with the training of deep learning network In the process, for neural network unit, it is temporarily abandoned from network according to certain probability.Due to every time with input network Sample when carrying out right value update, implicit node is all to occur at random with certain probability, therefore cannot be guaranteed every 2 implicit nodes Occur simultaneously every time, the update of such weight is no longer dependent on the collective effect that fixed relationship implies node, prevents certain A little features only situation just effective under other special characteristics.Improve the generalization ability of model.
S140: the parameter combination of change LSTM Recognition with Recurrent Neural Network is trained model, using mean absolute error as pre- Model performance evaluation index is surveyed, tests the model error under different parameters combination respectively.
The definition of mean absolute error formula is as shown in formula (9):
Wherein N is the sample number of test,Indicate the table for the target gene t that i-th of sample is predicted Up to value, yi(t)Indicate the truly expressed value of i-th of sample target gene t.Test the prediction model under different parameters combination Error result is as shown in figures 2-6.Wherein, Fig. 2 is that the prediction model under different parameters combination of the present invention is flat on GEO data set Equal absolute error schematic diagram.Fig. 3 is method of the invention and existing linear regression gene expression prediction model (LR), k nearest neighbor Mean absolute error comparison diagram of the gene expression prediction model (KNN-GE) in GEO data set.Fig. 4 is the present invention in different parameters Mean absolute error schematic diagram of the prediction model on GETx/1000G data set under combination.Fig. 5 be method of the invention with Existing linear regression gene expression prediction model (LR), k nearest neighbor gene expression prediction model (KNN-GE) is in GETx/1000G Mean absolute error comparison diagram on data set.Fig. 6 is the decision of the present invention with linear regression gene expression prediction model (LR) Coefficient (R2) comparison diagram.
S150: the statistic coefficient of determination (R is used2) measure models fitting effect.
The coefficient of determination (R2) it is that it is good to be often treated as measurement model prediction ability through common statistical information in regression analysis Bad standard.Correlation formula is such as shown in (10)-(13):
Wherein, Sum of Squares Due To Error (SSE)
Sum Of Squares Due To Regression(SSR)
Total Sum Of Squares(SST)
In formula (10)-(13)Indicate i-th of gene expression profile predicted value,Indicate sample mean, yiIndicate the I gene expression profile true value.
R2Value be [0,1], the R of a model2Value is its complete unpredictable target variable of 0 explanation, and a R2Value Then extraordinary prediction can be carried out to target variable for 1 model.R2Value is the numerical value between 0 to 1, then it represents that in the model Have how much percent can be explained with feature in target variable.Linear regression method and the LSTM proposed in this paper that is based on recycle mind The coefficient of determination comparison diagram of gene expression prediction model through network is shown in Figure of description.
It is different from the prior art, a kind of gene expression prediction side based on LSTM Recognition with Recurrent Neural Network of the present invention Method is clustered by K-Means, and the methods of standardization carries out data prediction, improves model training efficiency;In conjunction with it is known about 1000 landmark gene extract nonlinear characteristic using LSTM Recognition with Recurrent Neural Network, construct Nonlinear Prediction Models, finally Infer about 21000 target gene expression, efficiently solve LINCS program and use linear regression as prediction technique, But ignore the nonlinear problem in gene expression profile.Using dropout strategy in model training, prediction model is improved Generalization ability and robustness use different weighing apparatuses using the prediction effect of the data test gene expression prediction model of different platform Figureofmerit has evaluated experimental result, further demonstrates gene expression prediction model proposed by the present invention in performance, cross-platform general The advantage of change ability and study nonlinear characteristic ability.
The above is only embodiments of the present invention, are not intended to limit the scope of the invention, all to utilize the present invention Equivalent structure or equivalent flow shift made by specification and accompanying drawing content is applied directly or indirectly in other relevant technologies Field is included within the scope of the present invention.

Claims (9)

1. a kind of gene expression prediction technique based on LSTM characterized by comprising
Gene expression profile data is pre-processed;
Pretreated gene expression profile data is introduced into LSTM Recognition with Recurrent Neural Network, constructs gene expression prediction model, is used Loss function of the mean square error as model updates weight training model according to the back-propagation algorithm of standard;
The parameter combination of change LSTM Recognition with Recurrent Neural Network is trained model, using mean absolute error as prediction model performance Evaluation index tests the model error under different parameters combination respectively;
Use the statistic coefficient of determination (R2) measure models fitting effect.
2. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that pretreated mode is extremely Include: less
Use the repeated data in Unsupervised clustering algorithm removal original gene expression modal data;
Gene expression profile data after duplicate removal is formatted, saves data using the numpy format in python;
Landmark gene and target the gene gene annotation in gene expression profile data is determined, by the base after format transformation Because expression modal data standardizes;Wherein, the mode of standardization is to carry out Z-Socre standardization.
3. the gene expression prediction technique according to claim 2 based on LSTM, which is characterized in that use Unsupervised clustering Algorithm removes the repeated data in original gene expression modal data, is to gather original gene expression profile data using K-Means Class algorithm classifies data, using the similitude between data in each class of euclidean distance metric, to judge homogeneous data In with the presence or absence of repeating, and remove repeated data when there are repeated data;Specifically include step:
Step1: K cluster centre of initialization;
Step2: each gene expression profile sample data is calculated to the distance of cluster centre, and the sample data is divided into distance In cluster representated by nearest cluster centre;
Step3: the coordinate average value of all gene expression profile sample datas in each cluster is calculated, and using coordinate average value as new Cluster centre;
Step4: repeating step 2 and 3, until the movement of cluster centre reaches less than preset error value, or cluster the number of iterations Until preset value;
Step5: measuring the Euclidean distance in each cluster between gene expression profile data, if with two gene expression in cluster The Euclidean distance of modal data is less than given threshold, then this pair of express spectra is defined as duplicate keys, deletes one of them.
4. the gene expression prediction technique according to claim 2 based on LSTM, which is characterized in that determining gene expression Landmark gene and target gene gene annotation in modal data carries out the gene expression profile data after format transformation In the step of standardization, comprising steps of
According to gene coding annotation, 943 landmark gene probes and 15744 target in gene expression profile data are extracted The expression value of gene probe;
It determines that gene expression profile data corresponds to the multiple probe of identical gene coding annotation in RNA-Seq data, takes multiple spy Expression value of the average value of needle expression as gene expression profile data obtains between gene expression profile data and RNA-Seq data To 9520 combination target genes with one-to-one relationship;
The expression value of 943 landmark genes of gene expression profile data and 9520 combination target genes is used Z- Score standardization;
For each express spectra of RNA-Seq data, by the expression of 943 landmark genes and 9520 target genes Value comes together, is standardized using Z-Score method to data.
5. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that by pretreated base Gene expression prediction model is constructed because expression modal data introduces LSTM Recognition with Recurrent Neural Network, mean square error is used to lose as model In the step of function, comprising steps of
N number of training sample, L landmark genes, T target genes are set, training set is expressed asWherein, xi∈RLIndicate the expression value of i-th of landmark genes, yi∈RTIndicate i-th of target The expression value of genes;
LSTM Recognition with Recurrent Neural Network is controlled discarding by " door " (gate) or increases information, to realize forgetting or memory Function;There are three such doors for one LSTM unit, are to forget door, input gate and out gate respectively;
Forgetting door is the forgetting of the header length in the cell state by last moment, passes through sigmoid layers and realizes;In LSTM When being trained in Recognition with Recurrent Neural Network, with the h of last momentt-1With the x at this momenttSigmoid function as input, For the cell state C of last momentt-1In each single item generate one [0,1] value, indicate retain information content size, wherein 1 representative is fully retained, and 0 indicates to give up completely;With Ct-1It is multiplied;ftUpdate such as formula (1) shown in:
ft=σ (Wf·[ht-1, xt]+bf) (1)
Wherein xtIt is current input vector, ht-1It is the output vector at t-1 moment.bf, WfIt is the biasing for forgeing door, input power respectively Weight, ftIt indicates to retain last information content;
Input gate is recorded new header length in cell state;Its implementation includes two parts: 1. pass through input Gate layer determines the content i updatedt;2. passing through one candidate value vector of tanh layers of creationAnd increase in cell state;it's Update andAs shown in formula (2) and (3):
it=σ (Wi·[ht-1, xt]+bi) (2)
Next cell state C is updatedt, shown in update mode such as formula (4):
Out gate control output, be achieved in that: 1. determine to export the content o of new cell state by output layer doort;2. Then then multiplied to h with the output phase of output gate layer by cell state by tanh layerst;otAnd htUpdate mode such as formula (5) and shown in (6):
ot=σ (Wo·[ht-1, xt]+bo) (5)
ht=ot*tanh(Ct) (6)
Use mean square error as the loss function of prediction model, as shown in formula (7):
Wherein N is the sample number of test,Indicate the expression value for the target gene t that i-th of sample is predicted, yi(t)Indicate the truly expressed value of i-th of sample target gene t.
6. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that will be pretreated Gene expression profile data introduces LSTM Recognition with Recurrent Neural Network and constructs gene expression prediction model, and mean square error is used to damage as model After the step of losing function, further includes updating prediction model weight according to the back-propagation algorithm of standard, optimized using Adam and calculated Method and the training of Dropout technology acceleration model, the step of enhancing model robustness, reduce error prediction model, comprising steps of
Prediction model weight is updated according to the back-propagation algorithm of standard, the power of prediction model is updated by gradient descent method iteration Weight parameter, calculates partial derivative of all parameters based on loss function, calculates gradient according to formula (8):
Traditional stochastic gradient descent is replaced using Adam optimization algorithm in back-propagation algorithm, based on training data iteration The network weight for updating prediction model, single order moments estimation and second order moments estimation by calculating gradient are different parameter designings Independent adaptivity learning rate;
Dropout technology is added in the training process of prediction model;Training process of the Dropout technology in deep learning network In, for neural network unit, it is temporarily abandoned from network according to certain probability.
7. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that in change LSTM circulation The parameter combination of neural network is trained model, using mean absolute error as prediction model Performance Evaluating Indexes, surveys respectively In the step of trying the model error under different parameters combination,
The definition of mean absolute error formula is as shown in formula (9):
Wherein N is the sample number of test,Indicate the expression value for the target gene t that i-th of sample is predicted, yi(t)Indicate the truly expressed value of i-th of sample target gene t.
8. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that the coefficient of determination (R2) be Through common statistical information in regression analysis, it is often treated as measuring the standard of model prediction ability quality;Correlation formula is such as (10) shown in-(13):
Wherein, Sum of Squares Due To Error (SSE)
Sum Of Squares Due To Regression(SSR)
Total Sum Of Squares(SST)
In formula (10)-(13)Indicate i-th of gene expression profile predicted value,Indicate sample mean, yiIt indicates i-th Gene expression profile true value.
9. the gene expression prediction technique according to claim 8 based on LSTM, which is characterized in that R2Value be [0, 1], the R of a model2Value is 0, complete unpredictable target variable, the R of a model2Value is 1, then carries out to target variable pre- That surveys works well;R2Value is the numerical value between 0 to 1, then it represents that in the target variable that can be explained with feature in the model Percentage.
CN201810810239.3A 2018-07-23 2018-07-23 Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network Active CN109063416B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810810239.3A CN109063416B (en) 2018-07-23 2018-07-23 Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810810239.3A CN109063416B (en) 2018-07-23 2018-07-23 Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network

Publications (2)

Publication Number Publication Date
CN109063416A true CN109063416A (en) 2018-12-21
CN109063416B CN109063416B (en) 2019-08-27

Family

ID=64834852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810810239.3A Active CN109063416B (en) 2018-07-23 2018-07-23 Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network

Country Status (1)

Country Link
CN (1) CN109063416B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978228A (en) * 2019-01-31 2019-07-05 中南大学 A kind of PM2.5 concentration prediction method, apparatus and medium
CN110070145A (en) * 2019-04-30 2019-07-30 天津开发区精诺瀚海数据科技有限公司 LSTM wheel hub single-item energy consumption prediction based on increment cluster
CN110111848A (en) * 2019-05-08 2019-08-09 南京鼓楼医院 A kind of human cyclin expressing gene recognition methods based on RNN-CNN neural network fusion algorithm
CN110502806A (en) * 2019-07-31 2019-11-26 电子科技大学 A kind of wireless frequency spectrum degree prediction technique based on LSTM network
CN111611835A (en) * 2019-12-23 2020-09-01 珠海大横琴科技发展有限公司 Ship detection method and device
CN111785326A (en) * 2020-06-28 2020-10-16 西安电子科技大学 Method for predicting gene expression profile after drug action based on generation of confrontation network
CN113178234A (en) * 2021-02-23 2021-07-27 北京亿药科技有限公司 Compound function prediction method based on neural network and connection graph algorithm
CN114418071A (en) * 2022-01-24 2022-04-29 中国光大银行股份有限公司 Cyclic neural network training method
CN116705150A (en) * 2023-06-05 2023-09-05 国家超级计算天津中心 Method, device, equipment and medium for determining gene expression efficiency

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7003403B1 (en) * 2000-06-15 2006-02-21 The United States Of America As Represented By The Department Of Health And Human Services Quantifying gene relatedness via nonlinear prediction of gene
CN107025386A (en) * 2017-03-22 2017-08-08 杭州电子科技大学 A kind of method that gene association analysis is carried out based on deep learning algorithm
US20180144261A1 (en) * 2016-11-18 2018-05-24 NantOmics, LLC. Methods and systems for predicting dna accessibility in the pan-cancer genome
CN108170529A (en) * 2017-12-26 2018-06-15 北京工业大学 A kind of cloud data center load predicting method based on shot and long term memory network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7003403B1 (en) * 2000-06-15 2006-02-21 The United States Of America As Represented By The Department Of Health And Human Services Quantifying gene relatedness via nonlinear prediction of gene
US20180144261A1 (en) * 2016-11-18 2018-05-24 NantOmics, LLC. Methods and systems for predicting dna accessibility in the pan-cancer genome
CN107025386A (en) * 2017-03-22 2017-08-08 杭州电子科技大学 A kind of method that gene association analysis is carried out based on deep learning algorithm
CN108170529A (en) * 2017-12-26 2018-06-15 北京工业大学 A kind of cloud data center load predicting method based on shot and long term memory network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BYUNGHAN LEE ETC: "DNA-Level Splice Junction Prediction using Deep Recurrent Neural Networks", 《SEMANTIC SCHOLAR》 *
李洪顺: "只利用序列信息预测核苷酸结合蛋白的深度学习模型研究", 《中国优秀硕士学位论文全文数据库》 *
程国建: "神经网络在基因序列预测中的应用研究", 《万方数据》 *
薛燕娜: "机器学习算法在蛋白质结构预测中的应用", 《中国知网》 *
黄易初: "基于深度学习的蛋白质结构域边界预测研究", 《万方数据》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978228A (en) * 2019-01-31 2019-07-05 中南大学 A kind of PM2.5 concentration prediction method, apparatus and medium
CN109978228B (en) * 2019-01-31 2023-12-12 中南大学 PM2.5 concentration prediction method, device and medium
CN110070145A (en) * 2019-04-30 2019-07-30 天津开发区精诺瀚海数据科技有限公司 LSTM wheel hub single-item energy consumption prediction based on increment cluster
CN110111848B (en) * 2019-05-08 2023-04-07 南京鼓楼医院 Human body cycle expression gene identification method based on RNN-CNN neural network fusion algorithm
CN110111848A (en) * 2019-05-08 2019-08-09 南京鼓楼医院 A kind of human cyclin expressing gene recognition methods based on RNN-CNN neural network fusion algorithm
CN110502806A (en) * 2019-07-31 2019-11-26 电子科技大学 A kind of wireless frequency spectrum degree prediction technique based on LSTM network
CN110502806B (en) * 2019-07-31 2022-03-15 电子科技大学 Wireless spectrum occupancy rate prediction method based on LSTM network
CN111611835A (en) * 2019-12-23 2020-09-01 珠海大横琴科技发展有限公司 Ship detection method and device
CN111785326A (en) * 2020-06-28 2020-10-16 西安电子科技大学 Method for predicting gene expression profile after drug action based on generation of confrontation network
CN111785326B (en) * 2020-06-28 2024-02-06 西安电子科技大学 Gene expression profile prediction method after drug action based on generation of antagonism network
CN113178234A (en) * 2021-02-23 2021-07-27 北京亿药科技有限公司 Compound function prediction method based on neural network and connection graph algorithm
CN113178234B (en) * 2021-02-23 2023-10-31 北京亿药科技有限公司 Compound function prediction method based on neural network and connection graph algorithm
CN114418071A (en) * 2022-01-24 2022-04-29 中国光大银行股份有限公司 Cyclic neural network training method
CN116705150A (en) * 2023-06-05 2023-09-05 国家超级计算天津中心 Method, device, equipment and medium for determining gene expression efficiency

Also Published As

Publication number Publication date
CN109063416B (en) 2019-08-27

Similar Documents

Publication Publication Date Title
CN109063416B (en) Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network
CN112101630A (en) Multi-target optimization method for injection molding process parameters of thin-wall plastic part
CN107992976B (en) Hot topic early development trend prediction system and prediction method
CN106453293A (en) Network security situation prediction method based on improved BPNN (back propagation neural network)
CN109523021A (en) A kind of dynamic network Structure Prediction Methods based on long memory network in short-term
CN109840595B (en) Knowledge tracking method based on group learning behavior characteristics
CN109543731A (en) A kind of three preferred Semi-Supervised Regression algorithms under self-training frame
CN110990718B (en) Social network model building module of company image lifting system
CN111680786B (en) Time sequence prediction method based on improved weight gating unit
CN114547974A (en) Dynamic soft measurement modeling method based on input variable selection and LSTM neural network
CN110598902A (en) Water quality prediction method based on combination of support vector machine and KNN
CN116721537A (en) Urban short-time traffic flow prediction method based on GCN-IPSO-LSTM combination model
CN115188412A (en) Drug prediction algorithm based on Transformer and graph neural network
CN116451556A (en) Construction method of concrete dam deformation observed quantity statistical model
CN113103535A (en) GA-ELM-GA-based injection molding part mold parameter optimization method
CN113052373A (en) Monthly runoff change trend prediction method based on improved ELM model
Barman et al. A neuro-evolution approach to infer a Boolean network from time-series gene expressions
Haixiang et al. Optimizing reservoir features in oil exploration management based on fusion of soft computing
CN104899507A (en) Detecting method for abnormal intrusion of large high-dimensional data of network
Li et al. Solubility prediction of gases in polymers using fuzzy neural network based on particle swarm optimization algorithm and clustering method
Khajeh et al. Diffusion coefficient prediction of acids in water at infinite dilution by QSPR method
Li et al. Data cleaning method for the process of acid production with flue gas based on improved random forest
Liu et al. A quantitative study of the effect of missing data in classifiers
CN117350146A (en) GA-BP neural network-based drainage pipe network health evaluation method
CN116542382A (en) Sewage treatment dissolved oxygen concentration prediction method based on mixed optimization algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant