CN108764019A

CN108764019A - A kind of Video Events detection method based on multi-source deep learning

Info

Publication number: CN108764019A
Application number: CN201810290777.4A
Authority: CN
Inventors: 刘安安; 邵壮
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-04-03
Filing date: 2018-04-03
Publication date: 2018-11-06

Abstract

The invention discloses a kind of Video Events detection methods based on multi-source deep learning, include the following steps：Gray processing is carried out to each frame of video and interframe makes the difference acquisition three-dimensional array, three-dimensional array is inputted in C3D networks, trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains 4096 dimensional feature vectors；Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, using fc7 layers of output as feature, obtains the feature vector of 2048 dimensions；Feature vector is spliced, the vector of 6124 dimensions is obtained, after processing to all training videos and be converted into vector, carries out dimensionality reduction, converts the data of 6124 dimensions to the data of 256 dimensions；Start model training stage after each pending video is mapped as the data of one 256 dimension；Unknown sample is tested using trained model.The present invention greatly improves the efficiency and accuracy rate of video monitoring information processing.

Description

A kind of Video Events detection method based on multi-source deep learning

Technical field

The present invention relates to Video Events detection field more particularly to a kind of Video Events detections based on multi-source deep learning Method.

Background technology

The processing of video monitoring is in an increasingly wide range of applications in fields such as public safety, business.Video monitoring is with it Intuitively, accurately, it is abundant with the information content in time and be widely used in many occasions.In recent years, with computer, network and figure As processing, the rapid development of transmission technology, there has also been significant progresses for Video Supervision Technique.However, the place of current video monitoring Reason depends on artificial treatment, and costly, processing speed and efficiency are also relatively slow for cost.

A variety of methods have been proposed at present to detect Video Events.First, the human motion of unmarked view-based access control model Analysis is likely to provide a cheap, unobtrusive method to estimate the posture of human body.Therefore, it is widely used in transporting Dynamic analysis.Fujiyoshi etc. (such as bibliography [1]) proposes " fixed star " skeletonizing process to analyze target movement.Secondly, row Dynamic or collective activity identification can teach that the existence of group event in video.

A kind of video actions identification of the shallow higher-dimension coding based on early stage local space time's feature is had also been proposed in the prior art Method.Sparse space-time point of interest can with local space time come Expressive Features, including：The histogram (HOG) of gradient vector and Light stream histogram (HOF) (such as bibliography [2]).These characteristics are then encoded into feature packet (BoF) description (as with reference to text Offer [3]), then support vector machines is used to carry out classification task.In addition, being dedicated in video in a large amount of related work of last decade Group work research.Work before the overwhelming majority be all using the feature of engineer come individual aerial when stating (such as Bibliography [4]).Lan etc. (such as bibliography [5]), which is proposed, can represent the layer from lower people and information to higher group It is secondary with interactive level relationship, and being capable of potential adaptive Structure learning.

Recently, multi-task learning method has been applied to mode of human's social activity identification.Wherein, Liu et al. (such as bibliography [6]) It proposes a kind of hierarchical clustering multi-task learning human behavior is grouped and is identified.Again, video frequency abstract is to be used for vision Another method for understanding and showing.There are several methods that can generate video frequency abstract from a long video.A kind of representativeness side Method is to appear in video in different time periods to an object and activity to generate summary.Pritch etc. (such as bibliography [7]) is also A kind of new method is proposed, it can be according to similar events or activities fasciation at short and coherent video outline.Another method Generate text based abstract.Chu et al. (such as bibliography [8]) proposes a multimedia analysis frame while handling video And text, the relationship between entity is built jointly by scene graph to understand event (such as bibliography [9]).

Current most methods are required for handling multinomial challenging visual analysis task.Lee (such as bibliography [10]) an effective Gaussian Mixture learning method is proposed for video background removal.Dai etc. (such as bibliography [11]) is carried A kind of R-FCN object detection networks of robust are gone out.Although existing method shows on the problem of handling some aspects Validity is gone out, the processing for being directed to automatic understanding monitor video still has many challenges and limitation.Main challenge comes From following two aspects：The problem of complexity and processing method of data.It is directed to data itself, main challenge is Resolution ratio is low, data volume is big, event set and scene are complicated, data source occlusion.

For method, mainly there is limitation below：

1) many methods depend on prospect background cutting techniques, however this technology can cause mistake cumulative.

2) many methods depend on detect and track, however for different videos and mobile object, detect and track Robustness it is relatively low, these disadvantages reduce the efficiency of time analysis.

3) when data volume increases, calculation amount can be substantially improved.

4) the problem of most of the event detection in real-life is multi-tag, is especially among monitor video, more A event can occur simultaneously.However, action recognition and group activity identification are all based on the event detecting method of single label.Cause This both recognition methods can lose the simultaneous time.

5) most methods are only with a kind of feature extracting method, and feature is more single, the tables of data being susceptible in root Sign is inaccurate.

Invention content

The present invention provides a kind of Video Events detection methods based on multi-source deep learning, and the present invention is based on multi-source informations Video information is handled, original model can be optimized based on new data, due to using C3D features and tradition CNN features It is merged, C3D features can be added in the case where keeping original characteristic maturation to be easy to the advantage of extraction, it is more accurate to be carried out to video Characterization, can greatly improve video monitoring information processing efficiency and accuracy rate, it is described below：

A kind of Video Events detection method based on multi-source deep learning, the described method comprises the following steps：

Gray processing is carried out to each frame of video and adjacent video frames are carried out to subtract each other processing acquisition three-dimensional array, it will Three-dimensional array inputs in C3D networks, and trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains To 4096 dimensional feature vectors；

Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, by fc7 layers Output as feature, obtain the feature vector of 2048 dimensions；

Feature vector is spliced, the vector of 6124 dimensions is obtained, all training videos are processed and be converted into After amount, dimensionality reduction is carried out, converts the data of 6124 dimensions to the data of 256 dimensions；

Start model training stage after each pending video is mapped as the data of one 256 dimension；Using trained Model tests unknown sample.

Each frame progress gray processing and interframe to video makes the difference acquisition three-dimensional array：

For a given image sequence x={ x₁, x₂..., x_n, corresponding tally set is y={ y₁, y₂..., y_m, each frame of video is subjected to gray processing first, then the second frame makes the difference with first frame, and third frame makes the difference with the second frame, with This analogizes, and just converts original video data for another three-dimensional array by a three-dimensional array.

The C3D networks are specially：

All frames of sequence of video images x are divided into the picture group of one group of 8 frame, every 8 frame exports the fc7 numbers of plies of a C3D According to as feature extraction as a result, the feature vector of 4096 dimensions of k is obtained, finally using the fc7 layer datas in network structure as special Sign is exported, and 4096 dimensional feature vectors are obtained.

The CNN network structures are specially：The structural network that convolutional layer, full articulamentum are connected with each other with pond layer, such as Fig. 5 It is shown.

The beginning model training stage specifically includes：

By the label of action to be identified,<BOS>Label is included in candidate word；

By maximizing class index function learning, maximum likelihood function method prediction output weight；

Softmax layers of loss is calculated and is propagated in LSTM.

The method further includes：

In decoding stage, the probability distribution of a hidden state is built by following formula：

Wherein, p (y_t|h_n+t-1,y_t-1) it is to be obtained based on whole word softmax functions, h_tIt is the hidden state of t steps, Hidden state is depended not only in the output of t steps, additionally depends on the output of previous step t-1.

It is described that unknown sample is tested specially using trained model：

With<BOS>Label starts, by all hidden state value h in obtained sequence of video images x_tIt is input to second In LSTM network systems, variable below is calculated separately：

f_1t=σ (W_1xfz_t+W_1zfz_t-1+b_1f)

i_1t=σ (W_1xih_t+W_1ziz_t-1+b_1i)

g_1t=tanh (W_1xgh_t+W_1zgz_t-1+b_1g)

c_1t=f_1t⊙c_1(t-1)+i_1t⊙g_1t

o_1t=σ (W_1ofh_t+W_1zoz_t-1+b_1o)

z_t=o_1t⊙tanh(c_1t)

Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is that element respective items multiply Method；W_1xfIndicate the weight matrix of input and the door for forgeing door outlet chamber, W_1zfIndicate output valve and the door of forgetting door outlet chamber Weight matrix, W_1xiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, W_1ziIndicate output valve and input gate The weight matrix of the door of outlet chamber, W_1xgIt indicates input and calculates new c_1tCandidate value between door weight matrix, W_1zgIt indicates The output valve c new with calculating_1tCandidate value between door weight matrix, W_1ofIt indicates out gate output and forgets door outlet chamber The weight matrix of door, W_1zoIndicate the weight matrix of output valve and the door of out gate outlet chamber, b_1fIt indicates to forget hiding for door output State, b_1iTo indicate that input gate exports hidden state, b_1gIt indicates to calculate new c_1tCandidate value hidden state, b_1oIndicate defeated It gos out the hidden state of output, f_1tIt indicates to forget door output, i_1tIndicate input gate output, o_1tIndicate the output of out gate.c_1tIt is Cell member state value, z_tIt is output valve, g_1tIt is to calculate new c_1tCandidate value.

The advantageous effect of technical solution provided by the invention is：

1, multiple events that video monitoring is generated by this method are reported, have avoided object monitoring and tracking process；

2, video is included by this method image information and action message, spatial information and temporal information are in different depth It is separately handled in network, has accomplished multisource data fusion；

3, this method devises completely new binary-flow network framework for video monitoring processing, and improves treatment effeciency and place Rationality energy.

Description of the drawings

Fig. 1 is a kind of flow chart of the Video Events detection method based on multi-source deep learning；

Fig. 2 is bilateral LSTM network structures；

Fig. 3 is the flow diagram of C3D convolution；

Fig. 4 is the schematic network structure of C3D；

Fig. 5 is the schematic network structure that CNN networks extract picture feature.

Specific implementation mode

To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further It is described in detail on ground.

Embodiment 1

An embodiment of the present invention provides a kind of Video Events detection methods based on multi-source deep learning, referring to Fig. 1, the inspection Survey method includes the following steps：

101：Gray processing is carried out to each frame of video and interframe makes the difference acquisition three-dimensional array, by three-dimensional array It inputs in C3D networks, trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains 4096 dimensions Feature vector；

102：Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, it will Fc7 layers of output obtains the feature vector of 2048 dimensions as feature；

103：Feature vector is spliced, the vector of 6124 dimensions is obtained, all training videos are processed and converted After vector, dimensionality reduction is carried out, converts the data of 6124 dimensions to the data of 256 dimensions；

104：Start model training stage after each pending video is mapped as the data of one 256 dimension；Utilize training Good model tests unknown sample.

In conclusion the embodiment of the present invention, which is based on multi-source information, handles video information, it can be based on new data to original Model optimizes, and greatly improves the efficiency and accuracy rate of video monitoring information processing.

Embodiment 2

The scheme in embodiment 1 is further introduced with reference to specific calculation formula, example, it is as detailed below Description：

C3D is a job of facebook, and network is constructed using 3D convolution sum 3D Pooling.By 3D convolution, C3D can directly handle video (or perhaps volume of video frame).The sharpest edges of C3D are that its speed, speed are 314fps.And actually this is based on the video card before 2 years.It can reach 600fps or more with 1080 video cards of Nvidia.Institute Efficiency with C3D is other methods to be significantly larger than.

201：C3D algorithms are used to extract video features first；

Convolutional neural networks (CNN) are widely used in computer vision in recent years, including：Classification, detection, segmentation etc. are appointed Business.These tasks are typically all to be carried out for image, use two-dimensional convolution (i.e. the dimension of convolution kernel is two dimension).And it is right In based on video analysis the problem of, 2D convolution cannot the fine information that must be captured in sequential.Therefore 3D convolution just has also been proposed.

C3D is the improvement based on original convolutional neural networks algorithm, and original each convolutional layer of convolutional neural networks makes Convolution is carried out with two-dimensional convolution kernel, has changed three dimensional convolution kernel among C3D algorithms, can thus handle multiple image number According to.Used network structure is as shown in Figure 3.The classification that one group of every 8 frame is carried out for the video of input, then inputs it C3D algorithms obtain multigroup 4096 dimensional vector.

202：For a given image sequence x={ x₁, x₂..., x_n, corresponding tally set is y={ y₁, y₂..., y_m, each frame of video is subjected to gray processing first, then the second frame makes the difference with first frame, and third frame is done with the second frame Difference, and so on, in this way, just original video data is converted by a three-dimensional array for another three-dimensional array；

203：Using the obtained transformed three-dimensional array of above-mentioned steps as input, C3D networks as shown in Figure 3 are inputted In, feature extraction is carried out with the Sport1M models of advance trained 1,900,000 iteration on other data sets, is calculated using C3D Method obtains characteristics of image；

Referring to Fig. 3, which is specially：For the sequence of video images x={ x of each input₁, x₂..., x_t..., x_n, wherein x₁, x₂..., x_t..., x_nCorrespond to respectively the 1st frame in sequence of video images x, the 2nd frame ..., t frames ..., All frames of sequence of video images x, are divided into the picture group of one group of 8 frame by n-th frame image, and every 8 frame exports the fc7 layers of a C3D For data as feature extraction as a result, obtaining the feature vector of k 4096 dimension, wherein k is 8 downward roundings of n ÷.Finally by network knot Fc7 layer datas in structure are exported as feature, obtain 4096 dimensional feature vectors.

204：Using the RGB image of original video first frame as inputting, figure is extracted in CNN network structures shown in Fig. 4 Piece feature obtains the feature vector of one 2048 dimension using fc7 layers of output as feature；

Referring to Fig. 4, which is specially：The Structure Network that convolutional layer, full articulamentum are connected with each other with pond layer Network.

205：Then the feature vector that upper two steps are extracted is spliced, obtains the vector of one 6124 dimension, for All training videos do treatment thereto, are completely converted into after vector, carry out dimensionality reduction, convert the data of 6124 dimensions to The data of 256 dimensions；

206：Start model training rank after each pending video is mapped as the data of one 256 dimension by 205 steps Section；

When decoding the training stage, such as regulation has n actions to be identified, and the form of label is that label 1 arrives n, this n is a Label is also known as candidate word.In data prediction by the action that the video occurs connect as sentence (such as：1,3,6 It represents and 1,3,6 these three actions successively has occurred in video.), in addition, in general, starting to need a mark due to candidate word Will, that is, so-called<BOS>Label, therefore will<BOS>Label is also included in candidate word<BOS>,1,3,6.).

Then, this frame is attempted (to ensure (5) by maximum likelihood function method by maximizing class index function learning Maximum value θ is got in formula_maxProbability distribution, then the hidden state that is obtained by probability distribution in (2) formula and defeated in last moment Go out (prediction output can be arbitrary initial value as t=0) prediction output weight.

Finally, softmax layers of loss is calculated and is propagated based on (4) formula in LSTM, and LSTM is for RNN (cycle god Through network) upgrading variant, this method enables previous information to be contacted with current task, and computer is allow to handle It is this kind of problem of different length sequence to output and input.For the input x of t moment_t, can be based on by formula below h_t-1And x_tContinuously calculate variable below：(1)

f_t=σ (W_xfx_t+W_hfh_t-1+b_f)

i_t=σ (W_xix_t+W_hih_t-1+b_i)

g_t=tanh (W_xgx_t+W_hgh_t-1+b_g)

c_t=f_t⊙c_t-1+i_t⊙g_t

o_t=σ (W_ofx_t+W_hoh_t-1+b_o)

h_t=o_t⊙tanh(c_t)

Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is that element respective items multiply Method；W_xfIndicate the weight matrix of input and the door for forgeing door outlet chamber, W_hfIt indicates hidden state value and forgets the door of door outlet chamber Weight matrix, W_xiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, W_hiIndicate hidden state value with it is defeated The weight matrix of the door for outlet chamber of geting started, W_xgIt indicates input and calculates new c_tCandidate value between door weight matrix, W_hgTable Show hidden state value and calculates new c_tCandidate value between door weight matrix, W_ofIt indicates out gate output and forgets door output Between door weight matrix, W_hoIndicate the weight matrix of hidden state value and the door of out gate outlet chamber, b_fIndicate that forgetting door is defeated The hidden state gone out, b_iIndicate that input gate exports hidden state, b_gIt indicates to calculate new c_tCandidate value hidden state, b_oTable Show the hidden state of out gate output, f_tIt indicates to forget door output, i_tIndicate input gate output, o_tIndicate the output of out gate.c_t It is cell member state value, h_tIt is hidden state value, g_tIt is to calculate new c_tCandidate value.

LSTM is also a kind of tool of coder-decoder framework.Assuming that input is x={ x₁, x₂..., x_n, and it is defeated It is y={ y to go out₁, y₂..., y_m}.In coding stage, x is encoded, and calculates above-mentioned variable by LSTM.In decoding rank Section, the probability distribution of a hidden state is built by the algorithmic formula of designed, designed：

Wherein, p (y_t|h_n+t-1,y_t-1) it is to be obtained based on whole word (label) softmax functions, h_tIt is the hidden of t steps Tibetan state is calculated by (1) formula.Hidden state is depended not only in the output of t steps, additionally depends on the defeated of previous step t-1 Go out.

207：Unknown sample is tested using step 206 trained model.

In test phase, the video inputted as the processing that hereinbefore training step is done, obtain one 256 dimension Feature, decoding start.With<BOS>Label starts, by all hidden states in the sequence of video images x obtained in step 206 Value h_t, t=1,2 ..., n are input in second exclusive LSTM network system of this method, calculate separately variable below：

Export z_tIn obtain be each word the score for each label in vocabulary.Then for each label Carry out mean value pond.This value can be used for indicating the possibility that each time occurs in the segment.By in (6) formula Sigmoid functions generate a probability distribution, obtain the probability that each action occurs in corresponding video, then by giving threshold Value (is generally set to 0.5 and is compared the label 0 or 1 for judging whether the action occurs in video), finally will be each potential dynamic The label of work is integrated, and the prediction result for whole section of video is obtained.

Bibliography

[1]Gutchess D,Trajkovics M,Cohen-Solal E,et al.A background model initialization algorithm for video surveillance[C]//Computer Vision,2001.ICCV 2001.Proceedings.Eighth IEEE International Conference on.IEEE,2001,1:733-740.

[2]Fan C,Crandall D J.Deepdiary:Automatically captioning lifelogging image streams[C]//European Conference on Computer Vision.Springer International Publishing,2016:459-473.

[3]Lazebnik S,Schmid C,Ponce J.Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories[C]//Computer vision and pattern recognition,2006IEEE computer society conference on.IEEE,2006,2: 2169-2178.

[4]Ibrahim M S,Muralidharan S,Deng Z,et al.A hierarchical deep temporal model for group activity recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:1971-1980.

[5]Lan T,Wang Y,Yang W,et al.Discriminative latent models for recognizing contextual group activities[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(8):1549-1562.

[6]Liu A A,Su Y T,Nie W Z,et al.Hierarchical clustering multi-task learning for joint human action grouping and recognition[J].IEEE transactions on pattern analysis and machine intelligence,2017,39(1):102-114.

[7]Pritch Y,Ratovitch S,Hendel A,et al.Clustered synopsis of surveillance video[C]//Advanced Video and Signal Based Surveillance, 2009.AVSS'09.Sixth IEEE International Conference on.IEEE,2009:195-200.

[8]Tu K,Meng M,Lee M W,et al.Joint video and text parsing for understanding events and answering queries[J].IEEE MultiMedia,2014,21(2):42- 70.

[9]He X,Gao M,Kan M Y,et al.Birank:Towards ranking on bipartite graphs[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(1):57- 71.

[10]Hochreiter S,Schmidhuber J.LSTM can solve hard long time lag problems[C]//Advances in neural information processing systems.1997:473-479.

[11]Venugopalan S,Rohrbach M,Donahue J,et al.Sequence to Sequence-- Video to Text[C]//IEEE International Conference on Computer Vision.IEEE,2015: 4534-4542.

To the model of each device in addition to doing specified otherwise, the model of other devices is not limited the embodiment of the present invention, As long as the device of above-mentioned function can be completed.

It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Serial number is for illustration only, can not represent the quality of embodiment.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims

1. a kind of Video Events detection method based on multi-source deep learning, which is characterized in that the described method comprises the following steps：

Gray processing is carried out to each frame of video and interframe makes the difference acquisition three-dimensional array, three-dimensional array is inputted into C3D nets In network, trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains 4096 dimensional feature vectors；

Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, it is defeated by fc7 layers Go out the feature vector that 2048 dimensions are obtained as feature；

Feature vector is spliced, the vector of 6124 dimensions is obtained, all training videos are processed and is converted into vector Afterwards, dimensionality reduction is carried out, converts the data of 6124 dimensions to the data of 256 dimensions；

Start model training stage after each pending video is mapped as the data of one 256 dimension；Utilize trained model Unknown sample is tested.

2. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute It states and gray processing is carried out to each frame of video and interframe makes the difference and obtains three-dimensional array and be specially：

For a given image sequence x={ x₁, x₂..., x_n, corresponding tally set is y={ y₁, y₂..., y_m, it is first Each frame of video is first subjected to gray processing, then the second frame makes the difference with first frame, and third frame makes the difference with the second frame, with such It pushes away, just converts original video data for another three-dimensional array by a three-dimensional array.

3. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute Stating C3D networks is specially：

All frames of sequence of video images x are divided into the picture group of one group of 8 frame, the fc7 layer datas that every 8 frame exports a C3D are made Extraction is characterized as a result, obtaining the feature vector of 4096 dimensions of k, finally using the fc7 layer datas in network structure as feature into Row output, obtains 4096 dimensional feature vectors.

4. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute Stating CNN network structures is specially：The structural network that convolutional layer, full articulamentum are connected with each other with pond layer.

5. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute Beginning model training stage is stated to specifically include：

Softmax layers of loss is calculated and is propagated in LSTM.

6. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute The method of stating further includes：

Wherein, p (y_t|h_n+t-1, y_t-1) it is to be obtained based on whole word softmax functions, h_tIt is the hidden state of t steps, in t The output of step depends not only on hidden state, additionally depends on the output of previous step t-1.

7. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute It states and unknown sample is tested specially using trained model：

With<BOS>Label starts, by all hidden state value h in obtained sequence of video images x_tIt is input to second LSTM In network system, variable below is calculated separately：

f_1t=σ (W_1xfz_t+W_1zfz_t-1+b_1f)

i_1t=σ (W_1xih_t+W_1ziz_t-1+b_1i)

g_1t=tanh (W_1xgh_t+W_1zgz_t-1+b_1g)

c_1t=f_1t⊙c_1(t-1)+i_1t⊙g_1t

o_1t=σ (W_1ofh_t+W_1zoz_t-1+b_1o)

z_t=o_1t⊙tant(c_1t)

Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is element respective items multiplication； W_1xfIndicate the weight matrix of input and the door for forgeing door outlet chamber, W_1zfIndicate the weights of output valve and the door for forgeing door outlet chamber Matrix, W_1xiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, W_1ziIndicate that output valve is exported with input gate Between door weight matrix, W_1xgIt indicates input and calculates new c_1tCandidate value between door weight matrix, W_1zgIndicate output The value c new with calculating_1tCandidate value between door weight matrix, W_1ofIndicate out gate output and the door of forgetting door outlet chamber Weight matrix, W_1zoIndicate the weight matrix of output valve and the door of out gate outlet chamber, b_1fIt indicates to forget the hiding shape that door exports State, b_1iTo indicate that input gate exports hidden state, b_1gIt indicates to calculate new c_1tCandidate value hidden state, b_1oIndicate output The hidden state of door output, f_1tIt indicates to forget door output, i_1tIndicate input gate output, o_1tIndicate the output of out gate.c_1tIt is thin Cell element state value, z_tIt is output valve, g_1tIt is to calculate new c_1tCandidate value.