CN109063416A - Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network - Google Patents
Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network Download PDFInfo
- Publication number
- CN109063416A CN109063416A CN201810810239.3A CN201810810239A CN109063416A CN 109063416 A CN109063416 A CN 109063416A CN 201810810239 A CN201810810239 A CN 201810810239A CN 109063416 A CN109063416 A CN 109063416A
- Authority
- CN
- China
- Prior art keywords
- gene expression
- data
- gene
- value
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of gene expression prediction techniques based on LSTM Recognition with Recurrent Neural Network, by the building process that LSTM Recognition with Recurrent Neural Network is introduced to gene expression prediction model, reduce model predictive error, it is clustered by K-Means, the methods of Z-Score standardization carries out data prediction, improves model training learning efficiency;And the nonlinear characteristic for being extracted gene expression profile data using LSTM Recognition with Recurrent Neural Network in conjunction with known about 1000 landmark gene, is constructed Nonlinear Prediction Models, finally predicts the expression of about 21000 target gene;The present invention efficiently solves LINCS Program and uses linear regression as estimating method, but ignores the nonlinear problem in gene expression profile, reduces and obtains gene expression profile cost, improves gene expression research efficiency.
Description
Technical field
The present invention relates to gene expression research and analysis field more particularly to a kind of bases based on LSTM Recognition with Recurrent Neural Network
Because expressing prediction technique.
Background technique
The machine learning and deep learning of molecular biology combination computer field, so that gene expression profile is widely used in
The preferred good child-rearing of Gene correlation, crops, complex disease parting, disease related gene discovery and drug screening etc..But it obtains
The expensive of gene expression profile is taken, process is many and diverse, and the laboratory of only a small number of abundance of capitals at present is caused to be able to carry out big rule
Mould full-length genome expression pattern analysis.The method that LINCS program uses linear regression (LR) to predict as gene expression at present,
But for complicated gene expression profile data, LR method has ignored the non-linear factor in gene expression profile, this will be reduced
The accuracy of final experimental result.
Summary of the invention
It is provided a kind of based on LSTM Recognition with Recurrent Neural Network it is an object of the invention to avoid the deficiencies in the prior art place
Gene expression prediction technique.
The purpose of the present invention can be realized using following technical measures, be designed a kind of based on LSTM Recognition with Recurrent Neural Network
Gene expression prediction technique, comprising: gene expression profile data is pre-processed;By pretreated gene expression profile data
LSTM Recognition with Recurrent Neural Network is introduced, gene expression prediction model is constructed, uses mean square error as the loss function of model, according to
The back-propagation algorithm of standard updates weight training model;The parameter combination of change LSTM Recognition with Recurrent Neural Network is trained mould
Type tests the model error under different parameters combination using mean absolute error as prediction model Performance Evaluating Indexes respectively;Make
With the statistic coefficient of determination (R2) measure models fitting effect.
Wherein, pretreated mode includes at least: using in Unsupervised clustering algorithm removal original gene expression modal data
Repeated data;Gene expression profile data after duplicate removal is formatted, saves number using the numpy format in python
According to;Landmark gene and target the gene gene annotation in gene expression profile data is determined, by the base after format transformation
Because expression modal data standardizes;Wherein, the mode of standardization is to carry out Z-Socre standardization.
Wherein, using the repeated data in Unsupervised clustering algorithm removal original gene expression modal data, being will be original
Gene expression profile data is classified data using K-Means clustering algorithm, uses data in each class of euclidean distance metric
Between similitude, to judge and to remove repeated data when there are repeated data with the presence or absence of repeating in homogeneous data;Specifically
Comprising steps of
Step1: K cluster centre of initialization;
Step2: each gene expression profile sample data is calculated to the distance of cluster centre, and the sample data is divided into
In the cluster representated by the nearest cluster centre;
Step3: the coordinate average value of all gene expression profile sample datas in each cluster is calculated, and coordinate average value is made
For new cluster centre;
Step4: repeating step 2 and 3, until the movement of cluster centre is less than preset error value, or cluster the number of iterations
Until reaching preset value;
Step5: measuring the Euclidean distance in each cluster between gene expression profile data, if with two genes in cluster
The Euclidean distance for expressing modal data is less than given threshold, then this pair of express spectra is defined as duplicate keys, is deleted.
Wherein, it is determining landmark gene and target the gene gene annotation in gene expression profile data, will turn
In the step of gene expression profile data after changing format is standardized, comprising steps of
According to gene coding annotation, 943 landmark gene probes and 15744 in gene expression profile data are extracted
The expression value of target gene probe;Determine that gene expression profile data corresponds to identical gene coding annotation in RNA-Seq data
Multiple probe, expression value of the average value as gene expression profile data for taking multiple probe to express, in gene expression profile data
Between RNA-Seq data, 9520 combination target genes with one-to-one relationship are obtained;Gene expression profile data
943 landmark genes and 9520 combination target genes expression values standardized using Z-Score;For RNA-
Each express spectra of Seq data comes the expression value of 943 landmark genes and 9520 target genes together,
Data are standardized using Z-Score method.
Wherein, pretreated gene expression profile data is introduced into the building gene expression of LSTM Recognition with Recurrent Neural Network and predicts mould
Type uses in the step of mean square error is as model loss function, comprising steps of
N number of training sample, L landmark genes, T target genes are set, training set is expressed asWherein, xi∈RLIndicate the expression value of i-th of landmark genes, yi∈RTIndicate i-th of target
The expression value of genes;
LSTM Recognition with Recurrent Neural Network is controlled discarding by " door " (gate) or increases information, to realize forgetting or note
The function of recalling;There are three such doors for one LSTM unit, are to forget door, input gate and out gate respectively;
Forgeing door is to realize the forgetting of the header length in last moment cell state by sigmoid layers;?
When being trained in LSTM Recognition with Recurrent Neural Network, with the h of last momentt-1With the x at this momenttIt is last moment as input
Cell state Ct-1In each single item generate the value of one [0,1], indicate the size for retaining information content, wherein 1 represent it is complete
Retain, 0 indicates to give up completely;With Ct-1It is multiplied;ftUpdate such as formula (1) shown in:
ft=σ (Wf·[ht-1, xt]+bf) (1)
Wherein xtIt is current input vector, ht-1It is the output vector at t-1 moment.bf,WfIt is the biasing for forgeing door respectively, defeated
Enter weight, ftIt indicates to retain last information content;
Input gate is recorded new header length in cell state;Its implementation includes two parts: 1. pass through
It inputs gate layer and determines the content i updatedt;2. passing through one candidate value vector of tanh layers of creationAnd increase in cell state;
itUpdate andAs shown in formula (2) and (3):
it=σ (Wi·[ht-1, xt]+bi) (2)
Next cell state C is updatedt, shown in update mode such as formula (4):
Out gate control output, be achieved in that: 1. determine to export the content of new cell state by output layer door
ot;2. then multiplied to h with the output phase of output gate layer then by cell state by tanh layerst;otAnd htUpdate mode is such as
Shown in formula (5) and (6):
ot=σ (Wo·[ht-1, xt]+bo) (5)
ht=ot*tanh(Ct) (6)
Use mean square error as the loss function of prediction model, as shown in formula (7):
Wherein N is the sample number of test,Indicate the expression for the target gene t that i-th of sample is predicted
Value, yi(t)Indicate the truly expressed value of i-th of sample target gene t.
Wherein, pretreated gene expression profile data is being introduced into the building gene expression prediction of LSTM Recognition with Recurrent Neural Network
Model further includes updating according to the back-propagation algorithm of standard after using the step of mean square error is as model loss function
Prediction model weight enhances model robustness using Adam optimization algorithm and the training of Dropout technology acceleration model, reduces pre-
The step of surveying model error, comprising steps of
Prediction model weight is updated according to the back-propagation algorithm of standard, prediction model is updated by gradient descent method iteration
Weight parameter, calculate partial derivative of all parameters based on loss function, according to formula (8) calculate gradient:
Traditional stochastic gradient descent is replaced using Adam optimization algorithm in back-propagation algorithm, is changed based on training data
The network weight of the update prediction model in generation, single order moments estimation and second order moments estimation by calculating gradient are different parameters
Design independent adaptivity learning rate;
Dropout technology is added in the training process of prediction model;Dropout technology was trained deep learning network
It is temporarily abandoned neural network unit according to certain probability by Cheng Zhong from network.
Wherein, change LSTM Recognition with Recurrent Neural Network parameter combination be trained model, using mean absolute error as
Prediction model Performance Evaluating Indexes, respectively test different parameters combination under model error the step of in,
The definition of mean absolute error formula is as shown in formula (9):
Wherein N is the sample number of test,Indicate the table for the target gene t that i-th of sample is predicted
Up to value, yi(t)Indicate the truly expressed value of i-th of sample target gene t.
Wherein, the coefficient of determination (R2) it is to be often treated as measuring model prediction through common statistical information in regression analysis
The standard of ability quality;Correlation formula is such as shown in (10)-(13):
Wherein, Sum of Squares Due To Error (SSE)
Sum Of Squares Due To Regression(SSR)
Total Sum Of Squares(SST)
In formula (10)-(13)Indicate i-th of gene expression profile predicted value,Indicate sample mean, yiIndicate the
I gene expression profile true value.
Wherein, R2Value be [0,1], the R of a model2Value is 0, complete unpredictable target variable, model
R2Value is 1, then works well to what target variable was predicted;R2Value is the numerical value between 0 to 1, then it represents that can in the model
Percentage in the target variable explained with feature.
It is different from the prior art, of the present invention is a kind of gene expression prediction side based on LSTM Recognition with Recurrent Neural Network
Method is clustered by K-Means, and the methods of standardization carries out data prediction, improves model training efficiency;In conjunction with it is known about
1000 landmark gene extract nonlinear characteristic using LSTM Recognition with Recurrent Neural Network, construct Nonlinear Prediction Models, finally
Infer about 21000 target gene expression, efficiently solve LINCS program and use linear regression as prediction technique,
But ignore the nonlinear problem in gene expression profile.Using dropout strategy in model training, prediction model is improved
Generalization ability and robustness use different weighing apparatuses using the prediction effect of the data test gene expression prediction model of different platform
Figureofmerit has evaluated experimental result, further demonstrates gene expression prediction model proposed by the present invention in performance, cross-platform general
The advantage of change ability and study nonlinear characteristic ability.
Detailed description of the invention
Fig. 1 is a kind of process signal of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention
Figure;
Fig. 2 is a kind of gene expression prediction technique different parameters group based on LSTM Recognition with Recurrent Neural Network provided by the invention
Mean absolute error schematic diagram of the prediction model on GEO data set under closing.
Fig. 3 is a kind of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention and existing line
Property return gene expression prediction model (LR), average absolute of the k nearest neighbor gene expression prediction model (KNN-GE) in GEO data set
Error comparison diagram.
A kind of Fig. 4 gene expression prediction technique different parameters combination based on LSTM Recognition with Recurrent Neural Network provided by the invention
Under mean absolute error schematic diagram of the prediction model on GETx/1000G data set.
Fig. 5 a kind of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention and existing linear
It returns gene expression prediction model (LR), k nearest neighbor gene expression prediction model (KNN-GE) is on GETx/1000G data set
Mean absolute error comparison diagram.
A kind of Fig. 6 gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention and linear regression base
Because expressing the coefficient of determination (R of prediction model (LR)2) comparison diagram.
Specific embodiment
Further more detailed description is made to technical solution of the present invention With reference to embodiment.Obviously, it is retouched
The embodiment stated is only a part of the embodiments of the present invention, instead of all the embodiments.Based on the embodiments of the present invention,
Those of ordinary skill in the art's every other embodiment obtained without creative labor, all should belong to
The scope of protection of the invention.
Refering to fig. 1, Fig. 1 is a kind of gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network provided by the invention
Flow diagram.The step of this method includes:
S110: gene expression profile data is pre-processed.
Pretreated mode includes at least:
1. using the repeated data in Unsupervised clustering algorithm removal original gene expression modal data.
Data are divided into multiclass using K-Means clustering algorithm by original gene expression modal data, then use Euclidean
Similitude in each class of distance metric between data is repeated with this to judge to whether there is in homogeneous data, finally removal weight
Complex data.Include the following steps:
Step1: K cluster centre of initialization;
Step2: calculate each gene expression profile sample to cluster centre distance, and by the sample be divided into distance recently
Cluster centre representated by cluster;
Step3: the coordinate average value of all gene expression profile samples in each cluster is calculated, and using this average value as new
Cluster centre.
Step4: repeating step 2, and 3 reach pre- until the small Mr. Yu's error amount of the movement of cluster centre or cluster the number of iterations
If until value;
Step5: measuring the Euclidean distance in each cluster between gene expression profile, if the Euclidean distance of two express spectras
Less than the threshold value of setting, i.e., this pair of express spectra is defined as duplicate keys, is deleted.
2. the gene expression profile data after pair duplicate removal formats, number is saved using the numpy format in python
According to.
3. landmark gene and target the gene gene annotation in gene expression profile data is determined, by format transformation
Gene expression profile data afterwards standardizes;Wherein, the mode of standardization is to carry out Z-Socre standardization.
According to gene coding annotation, 943 landmark gene probes and 15744 in gene expression profile data are extracted
The expression value of target gene probe;
Gene expression profile data corresponds to the multiple probe of identical gene coding annotation in RNA-Seq data, takes multiple spy
Expression value of the average value of needle expression as gene obtains 9520 finally between gene expression profile platform and RNA-Seq platform
A combination target gene with one-to-one relationship.
The expression value of 943 landmark genes of gene expression profile data and 9520 combination target genes is used
Formula (10) Z-Score standardization.
For each express spectra of RNA-Seq data, by 943 landmark genes and 9520 target genes
Expression value comes together, is standardized using formula (10) Z-Score method to data.
Wherein μjThe mean value of all sample datas, σ are arranged for jthjThe standard deviation of all sample datas is arranged for jth.
S120: pretreated gene expression profile data is introduced into LSTM Recognition with Recurrent Neural Network, building gene expression prediction
Model, uses mean square error as the loss function of model, updates weight training model according to the back-propagation algorithm of standard.
Gene expression prediction can be counted as a multitask regression forecasting, it is assumed that have N number of training sample, L
Landmark genes, T target genes, training set are expressed asWherein xi∈RLIt indicates i-th
The expression value of landmark genes, yi∈RTThe expression value for indicating i-th of target genes, the purpose of the present invention is buildings
RL> > > RTMapping model.
Long Short-Term Memory Neural Network abbreviation LSTM is a kind of special RNN type, can
To learn long-term Dependency Specification.LSTM Recognition with Recurrent Neural Network is controlled discarding by " door " (gate) or increases information, thus
Realize the function of forgeing or remember.There are three such doors for one LSTM unit, are to forget door, input gate and out gate respectively.
Forgeing door is to realize the forgetting of the header length in last moment cell state by sigmoid layers.More than
The h at one momentt-1With the x at this momenttIt is then C as inputt-1In each single item generate one [0,1] value, indicate protect
Stay how much information (1 represent be fully retained, 0 indicate give up completely), then with Ct-1Multiply.ftUpdate such as formula (1) shown in:
ft=σ (Wf·[ht-1, xt]+bf) (1)
Wherein xtIt is current input vector, ht-1It is the output vector at t-1 moment.bf,WfIt is the biasing for forgeing door respectively, defeated
Enter weight.
Input gate is to determine what is deposited in cell state, by being recorded in cell state for new header length.It is real
Existing mode includes two parts: 1.sigmoid layers (input gate layer) determine what value we will update;One time of 2.tanh layers of creation
Choosing value vectorIt will be added in cell state.itUpdate such as formula (2) shown in:
it=σ (Wi·[ht-1, xt]+bi) (2)
Next cell state C is updatedt, shown in update mode such as formula (4):
ftIt indicates to retain last how much information, itIndicate which value updated,Indicate new candidate value.
Out gate control output, is achieved in that 1. by sigmoid layers (output layer doors) to determine to export new cell
Which part of state;2. then then cell state is multiplied by tanh layers with sigmoid layers of output.Update mode
As shown in formula (5) (6):
ot=σ (Wo·[ht-1, xt]+bo) ((5)
ht=ot*tanh(Ct) (6)
Use mean square error as the loss function of prediction model, as shown in formula (7):
Wherein N is the sample number of test,Indicate the expression for the target gene t that i-th of sample is predicted
Value, yi(t)Indicate the truly expressed value of i-th of sample target gene t.
Prediction model weight is updated according to the back-propagation algorithm of standard, prediction model is updated by gradient descent method iteration
Weight parameter, key point is to calculate partial derivative of all parameters based on loss function.Gradient is calculated according to formula (8):
Traditional stochastic gradient descent is replaced using Adam optimization algorithm in back-propagation algorithm, it can be based on training number
According to the network weight of the update prediction model of iteration, single order moments estimation and second order moments estimation by calculating gradient are different
The independent adaptivity learning rate of parameter designing.
Dropout technology is added in the training process of prediction model.Dropout technology can be with the training of deep learning network
In the process, for neural network unit, it is temporarily abandoned from network according to certain probability.Due to every time with input network
Sample when carrying out right value update, implicit node is all to occur at random with certain probability, therefore cannot be guaranteed every 2 implicit nodes
Occur simultaneously every time, the update of such weight is no longer dependent on the collective effect that fixed relationship implies node, prevents certain
A little features only situation just effective under other special characteristics.Improve the generalization ability of model.
S140: the parameter combination of change LSTM Recognition with Recurrent Neural Network is trained model, using mean absolute error as pre-
Model performance evaluation index is surveyed, tests the model error under different parameters combination respectively.
The definition of mean absolute error formula is as shown in formula (9):
Wherein N is the sample number of test,Indicate the table for the target gene t that i-th of sample is predicted
Up to value, yi(t)Indicate the truly expressed value of i-th of sample target gene t.Test the prediction model under different parameters combination
Error result is as shown in figures 2-6.Wherein, Fig. 2 is that the prediction model under different parameters combination of the present invention is flat on GEO data set
Equal absolute error schematic diagram.Fig. 3 is method of the invention and existing linear regression gene expression prediction model (LR), k nearest neighbor
Mean absolute error comparison diagram of the gene expression prediction model (KNN-GE) in GEO data set.Fig. 4 is the present invention in different parameters
Mean absolute error schematic diagram of the prediction model on GETx/1000G data set under combination.Fig. 5 be method of the invention with
Existing linear regression gene expression prediction model (LR), k nearest neighbor gene expression prediction model (KNN-GE) is in GETx/1000G
Mean absolute error comparison diagram on data set.Fig. 6 is the decision of the present invention with linear regression gene expression prediction model (LR)
Coefficient (R2) comparison diagram.
S150: the statistic coefficient of determination (R is used2) measure models fitting effect.
The coefficient of determination (R2) it is that it is good to be often treated as measurement model prediction ability through common statistical information in regression analysis
Bad standard.Correlation formula is such as shown in (10)-(13):
Wherein, Sum of Squares Due To Error (SSE)
Sum Of Squares Due To Regression(SSR)
Total Sum Of Squares(SST)
In formula (10)-(13)Indicate i-th of gene expression profile predicted value,Indicate sample mean, yiIndicate the
I gene expression profile true value.
R2Value be [0,1], the R of a model2Value is its complete unpredictable target variable of 0 explanation, and a R2Value
Then extraordinary prediction can be carried out to target variable for 1 model.R2Value is the numerical value between 0 to 1, then it represents that in the model
Have how much percent can be explained with feature in target variable.Linear regression method and the LSTM proposed in this paper that is based on recycle mind
The coefficient of determination comparison diagram of gene expression prediction model through network is shown in Figure of description.
It is different from the prior art, a kind of gene expression prediction side based on LSTM Recognition with Recurrent Neural Network of the present invention
Method is clustered by K-Means, and the methods of standardization carries out data prediction, improves model training efficiency;In conjunction with it is known about
1000 landmark gene extract nonlinear characteristic using LSTM Recognition with Recurrent Neural Network, construct Nonlinear Prediction Models, finally
Infer about 21000 target gene expression, efficiently solve LINCS program and use linear regression as prediction technique,
But ignore the nonlinear problem in gene expression profile.Using dropout strategy in model training, prediction model is improved
Generalization ability and robustness use different weighing apparatuses using the prediction effect of the data test gene expression prediction model of different platform
Figureofmerit has evaluated experimental result, further demonstrates gene expression prediction model proposed by the present invention in performance, cross-platform general
The advantage of change ability and study nonlinear characteristic ability.
The above is only embodiments of the present invention, are not intended to limit the scope of the invention, all to utilize the present invention
Equivalent structure or equivalent flow shift made by specification and accompanying drawing content is applied directly or indirectly in other relevant technologies
Field is included within the scope of the present invention.
Claims (9)
1. a kind of gene expression prediction technique based on LSTM characterized by comprising
Gene expression profile data is pre-processed;
Pretreated gene expression profile data is introduced into LSTM Recognition with Recurrent Neural Network, constructs gene expression prediction model, is used
Loss function of the mean square error as model updates weight training model according to the back-propagation algorithm of standard;
The parameter combination of change LSTM Recognition with Recurrent Neural Network is trained model, using mean absolute error as prediction model performance
Evaluation index tests the model error under different parameters combination respectively;
Use the statistic coefficient of determination (R2) measure models fitting effect.
2. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that pretreated mode is extremely
Include: less
Use the repeated data in Unsupervised clustering algorithm removal original gene expression modal data;
Gene expression profile data after duplicate removal is formatted, saves data using the numpy format in python;
Landmark gene and target the gene gene annotation in gene expression profile data is determined, by the base after format transformation
Because expression modal data standardizes;Wherein, the mode of standardization is to carry out Z-Socre standardization.
3. the gene expression prediction technique according to claim 2 based on LSTM, which is characterized in that use Unsupervised clustering
Algorithm removes the repeated data in original gene expression modal data, is to gather original gene expression profile data using K-Means
Class algorithm classifies data, using the similitude between data in each class of euclidean distance metric, to judge homogeneous data
In with the presence or absence of repeating, and remove repeated data when there are repeated data;Specifically include step:
Step1: K cluster centre of initialization;
Step2: each gene expression profile sample data is calculated to the distance of cluster centre, and the sample data is divided into distance
In cluster representated by nearest cluster centre;
Step3: the coordinate average value of all gene expression profile sample datas in each cluster is calculated, and using coordinate average value as new
Cluster centre;
Step4: repeating step 2 and 3, until the movement of cluster centre reaches less than preset error value, or cluster the number of iterations
Until preset value;
Step5: measuring the Euclidean distance in each cluster between gene expression profile data, if with two gene expression in cluster
The Euclidean distance of modal data is less than given threshold, then this pair of express spectra is defined as duplicate keys, deletes one of them.
4. the gene expression prediction technique according to claim 2 based on LSTM, which is characterized in that determining gene expression
Landmark gene and target gene gene annotation in modal data carries out the gene expression profile data after format transformation
In the step of standardization, comprising steps of
According to gene coding annotation, 943 landmark gene probes and 15744 target in gene expression profile data are extracted
The expression value of gene probe;
It determines that gene expression profile data corresponds to the multiple probe of identical gene coding annotation in RNA-Seq data, takes multiple spy
Expression value of the average value of needle expression as gene expression profile data obtains between gene expression profile data and RNA-Seq data
To 9520 combination target genes with one-to-one relationship;
The expression value of 943 landmark genes of gene expression profile data and 9520 combination target genes is used Z-
Score standardization;
For each express spectra of RNA-Seq data, by the expression of 943 landmark genes and 9520 target genes
Value comes together, is standardized using Z-Score method to data.
5. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that by pretreated base
Gene expression prediction model is constructed because expression modal data introduces LSTM Recognition with Recurrent Neural Network, mean square error is used to lose as model
In the step of function, comprising steps of
N number of training sample, L landmark genes, T target genes are set, training set is expressed asWherein, xi∈RLIndicate the expression value of i-th of landmark genes, yi∈RTIndicate i-th of target
The expression value of genes;
LSTM Recognition with Recurrent Neural Network is controlled discarding by " door " (gate) or increases information, to realize forgetting or memory
Function;There are three such doors for one LSTM unit, are to forget door, input gate and out gate respectively;
Forgetting door is the forgetting of the header length in the cell state by last moment, passes through sigmoid layers and realizes;In LSTM
When being trained in Recognition with Recurrent Neural Network, with the h of last momentt-1With the x at this momenttSigmoid function as input,
For the cell state C of last momentt-1In each single item generate one [0,1] value, indicate retain information content size, wherein
1 representative is fully retained, and 0 indicates to give up completely;With Ct-1It is multiplied;ftUpdate such as formula (1) shown in:
ft=σ (Wf·[ht-1, xt]+bf) (1)
Wherein xtIt is current input vector, ht-1It is the output vector at t-1 moment.bf, WfIt is the biasing for forgeing door, input power respectively
Weight, ftIt indicates to retain last information content;
Input gate is recorded new header length in cell state;Its implementation includes two parts: 1. pass through input
Gate layer determines the content i updatedt;2. passing through one candidate value vector of tanh layers of creationAnd increase in cell state;it's
Update andAs shown in formula (2) and (3):
it=σ (Wi·[ht-1, xt]+bi) (2)
Next cell state C is updatedt, shown in update mode such as formula (4):
Out gate control output, be achieved in that: 1. determine to export the content o of new cell state by output layer doort;2.
Then then multiplied to h with the output phase of output gate layer by cell state by tanh layerst;otAnd htUpdate mode such as formula
(5) and shown in (6):
ot=σ (Wo·[ht-1, xt]+bo) (5)
ht=ot*tanh(Ct) (6)
Use mean square error as the loss function of prediction model, as shown in formula (7):
Wherein N is the sample number of test,Indicate the expression value for the target gene t that i-th of sample is predicted,
yi(t)Indicate the truly expressed value of i-th of sample target gene t.
6. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that will be pretreated
Gene expression profile data introduces LSTM Recognition with Recurrent Neural Network and constructs gene expression prediction model, and mean square error is used to damage as model
After the step of losing function, further includes updating prediction model weight according to the back-propagation algorithm of standard, optimized using Adam and calculated
Method and the training of Dropout technology acceleration model, the step of enhancing model robustness, reduce error prediction model, comprising steps of
Prediction model weight is updated according to the back-propagation algorithm of standard, the power of prediction model is updated by gradient descent method iteration
Weight parameter, calculates partial derivative of all parameters based on loss function, calculates gradient according to formula (8):
Traditional stochastic gradient descent is replaced using Adam optimization algorithm in back-propagation algorithm, based on training data iteration
The network weight for updating prediction model, single order moments estimation and second order moments estimation by calculating gradient are different parameter designings
Independent adaptivity learning rate;
Dropout technology is added in the training process of prediction model;Training process of the Dropout technology in deep learning network
In, for neural network unit, it is temporarily abandoned from network according to certain probability.
7. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that in change LSTM circulation
The parameter combination of neural network is trained model, using mean absolute error as prediction model Performance Evaluating Indexes, surveys respectively
In the step of trying the model error under different parameters combination,
The definition of mean absolute error formula is as shown in formula (9):
Wherein N is the sample number of test,Indicate the expression value for the target gene t that i-th of sample is predicted,
yi(t)Indicate the truly expressed value of i-th of sample target gene t.
8. the gene expression prediction technique according to claim 1 based on LSTM, which is characterized in that the coefficient of determination (R2) be
Through common statistical information in regression analysis, it is often treated as measuring the standard of model prediction ability quality;Correlation formula is such as
(10) shown in-(13):
Wherein, Sum of Squares Due To Error (SSE)
Sum Of Squares Due To Regression(SSR)
Total Sum Of Squares(SST)
In formula (10)-(13)Indicate i-th of gene expression profile predicted value,Indicate sample mean, yiIt indicates i-th
Gene expression profile true value.
9. the gene expression prediction technique according to claim 8 based on LSTM, which is characterized in that R2Value be [0,
1], the R of a model2Value is 0, complete unpredictable target variable, the R of a model2Value is 1, then carries out to target variable pre-
That surveys works well;R2Value is the numerical value between 0 to 1, then it represents that in the target variable that can be explained with feature in the model
Percentage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810810239.3A CN109063416B (en) | 2018-07-23 | 2018-07-23 | Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810810239.3A CN109063416B (en) | 2018-07-23 | 2018-07-23 | Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109063416A true CN109063416A (en) | 2018-12-21 |
CN109063416B CN109063416B (en) | 2019-08-27 |
Family
ID=64834852
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810810239.3A Active CN109063416B (en) | 2018-07-23 | 2018-07-23 | Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109063416B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109978228A (en) * | 2019-01-31 | 2019-07-05 | 中南大学 | A kind of PM2.5 concentration prediction method, apparatus and medium |
CN110070145A (en) * | 2019-04-30 | 2019-07-30 | 天津开发区精诺瀚海数据科技有限公司 | LSTM wheel hub single-item energy consumption prediction based on increment cluster |
CN110111848A (en) * | 2019-05-08 | 2019-08-09 | 南京鼓楼医院 | A kind of human cyclin expressing gene recognition methods based on RNN-CNN neural network fusion algorithm |
CN110502806A (en) * | 2019-07-31 | 2019-11-26 | 电子科技大学 | A kind of wireless frequency spectrum degree prediction technique based on LSTM network |
CN111611835A (en) * | 2019-12-23 | 2020-09-01 | 珠海大横琴科技发展有限公司 | Ship detection method and device |
CN111785326A (en) * | 2020-06-28 | 2020-10-16 | 西安电子科技大学 | Method for predicting gene expression profile after drug action based on generation of confrontation network |
CN113178234A (en) * | 2021-02-23 | 2021-07-27 | 北京亿药科技有限公司 | Compound function prediction method based on neural network and connection graph algorithm |
CN114418071A (en) * | 2022-01-24 | 2022-04-29 | 中国光大银行股份有限公司 | Cyclic neural network training method |
CN116705150A (en) * | 2023-06-05 | 2023-09-05 | 国家超级计算天津中心 | Method, device, equipment and medium for determining gene expression efficiency |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7003403B1 (en) * | 2000-06-15 | 2006-02-21 | The United States Of America As Represented By The Department Of Health And Human Services | Quantifying gene relatedness via nonlinear prediction of gene |
CN107025386A (en) * | 2017-03-22 | 2017-08-08 | 杭州电子科技大学 | A kind of method that gene association analysis is carried out based on deep learning algorithm |
US20180144261A1 (en) * | 2016-11-18 | 2018-05-24 | NantOmics, LLC. | Methods and systems for predicting dna accessibility in the pan-cancer genome |
CN108170529A (en) * | 2017-12-26 | 2018-06-15 | 北京工业大学 | A kind of cloud data center load predicting method based on shot and long term memory network |
-
2018
- 2018-07-23 CN CN201810810239.3A patent/CN109063416B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7003403B1 (en) * | 2000-06-15 | 2006-02-21 | The United States Of America As Represented By The Department Of Health And Human Services | Quantifying gene relatedness via nonlinear prediction of gene |
US20180144261A1 (en) * | 2016-11-18 | 2018-05-24 | NantOmics, LLC. | Methods and systems for predicting dna accessibility in the pan-cancer genome |
CN107025386A (en) * | 2017-03-22 | 2017-08-08 | 杭州电子科技大学 | A kind of method that gene association analysis is carried out based on deep learning algorithm |
CN108170529A (en) * | 2017-12-26 | 2018-06-15 | 北京工业大学 | A kind of cloud data center load predicting method based on shot and long term memory network |
Non-Patent Citations (5)
Title |
---|
BYUNGHAN LEE ETC: "DNA-Level Splice Junction Prediction using Deep Recurrent Neural Networks", 《SEMANTIC SCHOLAR》 * |
李洪顺: "只利用序列信息预测核苷酸结合蛋白的深度学习模型研究", 《中国优秀硕士学位论文全文数据库》 * |
程国建: "神经网络在基因序列预测中的应用研究", 《万方数据》 * |
薛燕娜: "机器学习算法在蛋白质结构预测中的应用", 《中国知网》 * |
黄易初: "基于深度学习的蛋白质结构域边界预测研究", 《万方数据》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109978228A (en) * | 2019-01-31 | 2019-07-05 | 中南大学 | A kind of PM2.5 concentration prediction method, apparatus and medium |
CN109978228B (en) * | 2019-01-31 | 2023-12-12 | 中南大学 | PM2.5 concentration prediction method, device and medium |
CN110070145A (en) * | 2019-04-30 | 2019-07-30 | 天津开发区精诺瀚海数据科技有限公司 | LSTM wheel hub single-item energy consumption prediction based on increment cluster |
CN110111848B (en) * | 2019-05-08 | 2023-04-07 | 南京鼓楼医院 | Human body cycle expression gene identification method based on RNN-CNN neural network fusion algorithm |
CN110111848A (en) * | 2019-05-08 | 2019-08-09 | 南京鼓楼医院 | A kind of human cyclin expressing gene recognition methods based on RNN-CNN neural network fusion algorithm |
CN110502806A (en) * | 2019-07-31 | 2019-11-26 | 电子科技大学 | A kind of wireless frequency spectrum degree prediction technique based on LSTM network |
CN110502806B (en) * | 2019-07-31 | 2022-03-15 | 电子科技大学 | Wireless spectrum occupancy rate prediction method based on LSTM network |
CN111611835A (en) * | 2019-12-23 | 2020-09-01 | 珠海大横琴科技发展有限公司 | Ship detection method and device |
CN111785326A (en) * | 2020-06-28 | 2020-10-16 | 西安电子科技大学 | Method for predicting gene expression profile after drug action based on generation of confrontation network |
CN111785326B (en) * | 2020-06-28 | 2024-02-06 | 西安电子科技大学 | Gene expression profile prediction method after drug action based on generation of antagonism network |
CN113178234A (en) * | 2021-02-23 | 2021-07-27 | 北京亿药科技有限公司 | Compound function prediction method based on neural network and connection graph algorithm |
CN113178234B (en) * | 2021-02-23 | 2023-10-31 | 北京亿药科技有限公司 | Compound function prediction method based on neural network and connection graph algorithm |
CN114418071A (en) * | 2022-01-24 | 2022-04-29 | 中国光大银行股份有限公司 | Cyclic neural network training method |
CN116705150A (en) * | 2023-06-05 | 2023-09-05 | 国家超级计算天津中心 | Method, device, equipment and medium for determining gene expression efficiency |
Also Published As
Publication number | Publication date |
---|---|
CN109063416B (en) | 2019-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109063416B (en) | Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network | |
CN112101630A (en) | Multi-target optimization method for injection molding process parameters of thin-wall plastic part | |
CN107992976B (en) | Hot topic early development trend prediction system and prediction method | |
CN106453293A (en) | Network security situation prediction method based on improved BPNN (back propagation neural network) | |
CN109523021A (en) | A kind of dynamic network Structure Prediction Methods based on long memory network in short-term | |
CN109840595B (en) | Knowledge tracking method based on group learning behavior characteristics | |
CN109543731A (en) | A kind of three preferred Semi-Supervised Regression algorithms under self-training frame | |
CN110990718B (en) | Social network model building module of company image lifting system | |
CN111680786B (en) | Time sequence prediction method based on improved weight gating unit | |
CN114547974A (en) | Dynamic soft measurement modeling method based on input variable selection and LSTM neural network | |
CN110598902A (en) | Water quality prediction method based on combination of support vector machine and KNN | |
CN116721537A (en) | Urban short-time traffic flow prediction method based on GCN-IPSO-LSTM combination model | |
CN115188412A (en) | Drug prediction algorithm based on Transformer and graph neural network | |
CN116451556A (en) | Construction method of concrete dam deformation observed quantity statistical model | |
CN113103535A (en) | GA-ELM-GA-based injection molding part mold parameter optimization method | |
CN113052373A (en) | Monthly runoff change trend prediction method based on improved ELM model | |
Barman et al. | A neuro-evolution approach to infer a Boolean network from time-series gene expressions | |
Haixiang et al. | Optimizing reservoir features in oil exploration management based on fusion of soft computing | |
CN104899507A (en) | Detecting method for abnormal intrusion of large high-dimensional data of network | |
Li et al. | Solubility prediction of gases in polymers using fuzzy neural network based on particle swarm optimization algorithm and clustering method | |
Khajeh et al. | Diffusion coefficient prediction of acids in water at infinite dilution by QSPR method | |
Li et al. | Data cleaning method for the process of acid production with flue gas based on improved random forest | |
Liu et al. | A quantitative study of the effect of missing data in classifiers | |
CN117350146A (en) | GA-BP neural network-based drainage pipe network health evaluation method | |
CN116542382A (en) | Sewage treatment dissolved oxygen concentration prediction method based on mixed optimization algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |