CN109784280A

CN109784280A - Human bodys' response method based on Bi-LSTM-Attention model

Info

Publication number: CN109784280A
Application number: CN201910048015.8A
Authority: CN
Inventors: 卢先领; 朱铭康; 王骏
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2019-01-18
Filing date: 2019-01-18
Publication date: 2019-05-21

Abstract

The present invention provides a kind of Human bodys' response method based on Bi-LSTM-Attention model, the following steps are included: step S1, the video frame of extraction is inputted into InceptionV3 model, network parameter is reduced while increasing convolutional neural networks depth using InceptionV3 model, the depth characteristic for sufficiently extracting video frame, obtains relevant feature vector；The obtained feature vector of step S1 is passed in Bi-LSTM neural network and handles by step S2, sufficiently learns the temporal aspect between video frame by Bi-LSTM neural network；The temporal aspect vector that step S2 is obtained is passed to attention Mechanism Model and adaptively perceives the network weight for having larger impact to recognition result by step S3, and the relevant feature of these network weights is more paid close attention to.The present invention can be improved the discrimination of human body behavior.

Description

Human bodys' response method based on Bi-LSTM-Attention model

Technical field

The present invention relates to video analysis and identification field, especially a kind of people based on Bi-LSTM-Attention model Body Activity recognition method.

Background technique

For Human bodys' response, most of early stage is to extract video features using the method for engineer.A kind of scheme The characteristics of human body under complex background is extracted using the method for space-time interest points, this method is by calculating each position in video sequence Power and space-time interest points are found by the method for very big value filtering.WANG W et al. is learnt quiet using the method for sparse coding State feature, and histogram is carried out to feature with the time domain pyramid structure based on maximum pond indicate, finally divided using SVM Class.Another scheme proposes a kind of hierarchical cluster multi-task learning (HC-MTL) method, reinforces shared row by objective function Human bodys' response is realized with specific behavioural characteristic is learnt for relationship.Method based on manual features extraction is in Activity recognition Aspect achieves many excellent achievements, however there is also some insoluble problems, the method for engineer tends not to The substantive characteristics of movement is given expression to, and due to the diversity of movement, is often easy to ignore some important features, for row There is large effect for identification.

JI S et al. has been put forward for the first time a kind of 3D CNN algorithm, and this method is by using 3D volumes to the video frame on time shaft Product core is used to identify human body behavior to capture the room and time information of video.B.Mahasseni et al. is tieed up by construction human body 3 Then skeleton is used to identify human body behavior using the timing information that LSTM study human body 3 ties up skeleton.CNN net is utilized in Ullah A Network extracts the further feature of video frame, and carries out the timing information in learning characteristic sequence by two-way LSTM, finally by Softmax classifier is classified.J.Donahue et al. proposes a kind of long-term cyclic convolution network, and the network is from 2D CNN Middle extraction feature simultaneously learns the ordinal relation between these features by LSTM network.The CNN and LSTM in Activity recognition Using greatly improving the precision of identification, and reduce workload.But the depth of CNN has the feature extraction of video frame Large effect: the low depth characteristic for being not easy to show image of network layer is easy poor fitting；Profound network model is easy Gradient dispersion is generated to be difficult to optimize network mould.LSTM can not effectively learn the temporal aspect of movement, lack autonomous adaptability.

Present document relates to term:

SVM: support vector machines；

3D CNN:3D convolutional neural networks；

LSTM: long memory network in short-term.

Attention: attention.

Summary of the invention

It is an object of the present invention to overcome the shortcomings of the prior art and provide one kind to be based on Bi-LSTM- The Human bodys' response method of Attention model, this method can be with the timing information in learning characteristic sequence, and passes through attention Power mechanism trains network weight, reaches better performance, reduces identification error.

The technical solution adopted by the present invention is that:

A kind of Human bodys' response method based on Bi-LSTM-Attention model, comprising the following steps:

Step S1, inputs InceptionV3 model for the video frame of extraction, increases convolution using InceptionV3 model Network parameter is reduced while neural network depth, the depth characteristic of video frame is sufficiently extracted, obtains relevant feature vector；

The obtained feature vector of step S1 is passed in Bi-LSTM neural network and handles, passes through Bi- by step S2 LSTM neural network sufficiently learns the temporal aspect between video frame；

The temporal aspect vector that step S2 is obtained is passed to attention Mechanism Model and adaptively perceived to knowledge by step S3 Other result has the network weight of larger impact, and the relevant feature of these network weights is more paid close attention to.

Further, in step S1, different convolutional layers is incorporated in one by way of in parallel by InceptionV3 model It rises, while convolution operation is carried out to video frame using various sizes of convolution kernel, finally by filter fused layer different volumes The feature vector of product core processing is stitched together, and exports depth characteristic matrix by full articulamentum and is used for transmission Bi-LSTM nerve In network.

Further, step S2 is specifically included:

w_i(i=1 ... 6) indicates a layer network layer to the weight of another network layer；{…h_t-1,h_t,h_t+1... indicate LSTM mind Through the propagated forward layer in network, the input of propagated forward layer be ... x_t-1,x_t,x_t+1... characteristic sequence from front to back；

{…h_t+1',h_t',h_t-1' ... indicating back-propagating layer in LSTM neural network, the input of back-propagating layer is {…x_t+1,x_t,x_t-1... characteristic sequence from back to front；

X therein_tIndicate that extracted video frame passes through the feature obtained after InceptionV3 model extraction depth characteristic Vector；Such as following formula:

h_t=f (w₁x_t+w₂h_t-1+b₁) (1)

h_t'=f (w₃x_t+w₅h_t+1+b₂) (2)

o_t'=g (w₄h_t+b₃) (3)

o_t"=g (w₆h_t'+b₃) (4)

o_t=(o_t'+o_t”)/2 (5)

Above formula (1), (2), (3), the f in (4) and g represent activation primitive, b₁、b₂、b₃、b₄Represent the biasing of hidden unit Coefficient, o', o " are the result that two LSTM units handle the feature vector of Inceptionv3 layers of output at the corresponding moment respectively； Two feature vectors at corresponding moment are added the temporal aspect vector summed and be averaged as output.

Further, step S3 is specifically included:

o_tIt indicates t-th of the temporal aspect vector exported from Bi-LSTM neural network, then temporal aspect vector is passed Enter into attention Mechanism Model, obtains initial state vector S by the hidden layer in attention Mechanism Model_t；Weight coefficient α_t Indicate initial state vector S_tThe shared specific gravity size in the state vector Y of final output；Each initial state vector S_tWith power Weight factor alpha_tProduct cumulative and obtain the state vector Y of final output；Calculation formula is as follows:

e_t=tanh (w_ts_t+b_t) (6)

Tanh indicates that excitation function, n indicate the quantity of video frame；e_tIndicate the state vector S of t-th of temporal aspect vector_t The energy value determined, w_tAnd b_tIndicate weight and biasing；By formula (7) using e as the power of truth of a matter various pieces energy value therewith The available weight coefficient for having much influences on classification results of ratio of the cumulative sum of the energy value of preceding part, it is thus achieved that Conversion of the original state to state of attention；Finally as formula (8) obtains the state vector Y of final output.

The present invention has the advantages that feature extraction phases of the present invention in video frame, use InceptionV3 model extraction Feature solves the problems, such as network depth, the timing information that then Bi-LSTM neural network can sufficiently between learning characteristic, finally Attention mechanism the performance of network model can be made more preferable.By Action Youtobe and KTH human body behavioral data collection with The methods of existing DB-LSTM, 3D CNN are compared, the experimental results showed that algorithm discrimination proposed by the invention reaches 94.38% and 95.67%.

Detailed description of the invention

Fig. 1 is the Activity recognition block schematic illustration of the invention based on Bi-LSTM-Attention model.

Fig. 2 is the schematic diagram of InceptionV3 model of the invention.

Fig. 3 is the schematic diagram of Bi-LSTM neural network of the invention.

Fig. 4 is the schematic diagram of attention Mechanism Model of the invention.

Specific embodiment

Below with reference to specific drawings and examples, the invention will be further described.

The present invention proposes a kind of Human bodys' response method (One Human based on Bi-LSTM-Attention model Action Recognition Algorithm Based on Bi-LSTM-Attention model)；

This method extracts 20 video frames first from each video, passes through InceptionV3 model extraction video frame Then depth characteristic constructs the feature vector in Bi-LSTM neural network forwardly and rearwardly, followed by attention (Attention) Mechanism Model adaptively perceives the network weight for having larger impact to recognition result, makes Bi-LSTM- Attention model can realize more accurate identification according to the context of behavior, connect finally by one layer of full articulamentum Softmax classifier classifies to video.

This method mainly includes three big steps:

InceptionV3 model mainly carries out feature extraction to input video frame and these video frames is processed into Bi- The feature vector form that LSTM neural network is able to receive directly and can handle；It is different from traditional CNN feature extracting method, it Different convolutional layers is combined together by way of in parallel, while video frame is rolled up using various sizes of convolution kernel Obtained feature vector, is then stitched together by product operation again；As shown in Fig. 2.

Therein 128 × 128 × 3 represent video frame size, and (128 × 128 represent pixel, and 3 represent the channel of rgb video Number), 1 × 1,1 × n, n × 1 represent convolution kernel size, pool representativeization layer operation；Finally by Filter Concat (filtering Device fused layer) feature vector of different convolution kernels processing is stitched together, exporting S*1024 by full articulamentum, (S represents video Frame number, the depth characteristic matrix for 20) dimension is used for transmission in Bi-LSTM neural network herein.

The obtained feature vector of step S1 is passed in Bi-LSTM neural network and handles, passes through Bi- by step S2 LSTM neural network sufficiently learns the temporal aspect between video frame；As shown in figure 3,

Wherein w_i(i=1 ... 6) indicates a layer network layer to the weight of another network layer；{…h_t-1,h_t,h_t+1... indicate Propagated forward layer in LSTM neural network, the input of propagated forward layer be ... x_t-1,x_t,x_t+1... feature sequence from front to back Column；

h_t=f (w₁x_t+w₂h_t-1+b₁) (1)

h_t'=f (w₃x_t+w₅h_t+1+b₂) (2)

o_t'=g (w₄h_t+b₃) (3)

o_t"=g (w₆h_t'+b₃) (4)

o_t=(o_t'+o_t”)/2 (5)

Above formula (1), (2), (3), the f in (4) and g represent activation primitive, b₁、b₂、b₃、b₄Represent the biasing of hidden unit Coefficient, o', o " are the result that two LSTM units handle the feature vector of Inceptionv3 layers of output at the corresponding moment respectively； Two feature vectors at corresponding moment are added the temporal aspect vector summed and be averaged as output, output result is one S*1024 ties up matrix；Temporal aspect vector is finally sent to progress sensing network weight in attention Mechanism Model again；With biography The individual event LSTM algorithm of system is compared, and Bi-LSTM algorithm due to that can learn in the past with the information in future to obtain more simultaneously The temporal information of robust.

The temporal aspect vector that step S2 is obtained is passed to attention Mechanism Model and adaptively perceived to knowledge by step S3 Other result has the network weight of larger impact, and the relevant feature of these network weights is more paid close attention to.Such as Fig. 4 It is shown；

o_tIt indicates t-th of the temporal aspect vector exported from Bi-LSTM neural network, then temporal aspect vector is passed Enter into attention Mechanism Model, obtains initial state vector S by the hidden layer in attention Mechanism Model_t；Weight coefficient α_t Indicate initial state vector S_tThe shared specific gravity size in the state vector Y (1024*1) of final output；Each original state Vector S_tWith weight coefficient α_tProduct cumulative and obtain the state vector Y of final output；Calculation formula is as follows:

e_t=tanh (w_ts_t+b_t) (6)

Tanh indicates that excitation function, n indicate the quantity of video frame；e_tIndicate the state vector S of t-th of temporal aspect vector_t The energy value determined, w_tAnd b_tIndicate weight and biasing；By formula (7) using e as the power of truth of a matter various pieces energy value therewith The available weight coefficient for having much influences on classification results of ratio of the cumulative sum of the energy value of preceding part, it is thus achieved that Conversion of the original state to state of attention；Then as formula (8) obtains the state vector Y of final output；Finally by Y by connecting entirely It connects layer to combine as an output valve, reducing feature locations influences classification bring, will by softmax classifier The output of multiple neurons is mapped in (0,1) section, thus to carry out classify more.

The present invention is tested under GPU acceleration environment using python language, using keras deep learning frame, electricity Brain is configured to Win10 system, 16GB memory, GTX1080 11G video memory.Network in training Bi-LSTM-Attention model Parameter.

Experiment shows that the precision of network model proposed in this paper reaches 94.38% on Action Youtobe data set,

1 Action Youtobe data set of table is compared with other model algorithms

As it can be seen from table 1 in Action Youtobe data set, it is proposed by the present invention to be based on Bi-LSTM- The Human bodys' response method of Attention model is available to be better than Binary after combining InceptionV3 model CNN-Flow, Discriminative representation, tri- kinds of Proposed DB-LSTM based on deep learning algorithm Precision can also be obtained also superior to other three kinds of traditional algorithms based on manual feature extraction: Hierarchical clustering multi-task,Fisher vectors,3D spatio-temporal.Meanwhile the present invention under same model to LSTM, Two kinds of algorithms of Bi-LSTM are tested, the experimental results showed that, using Bi-LSTM-Attention model to accuracy of identification band Carry out 4.85% and 1.57% promotion.

The present invention also uses LSTM, Bi-LSTM, and Bi-LSTM-Attention combination InceptionV3 three kinds of methods of model exist It is tested on KTH data set, average accuracy of identification of the method proposed by the present invention on KTH can achieve 95.67%.Than The result of LSTM and Bi-LSTM algorithm has been higher by 5.33% and 1%.

2 KTH data set of table is compared with other model algorithms

The Human bodys' response proposed by the invention based on Bi-LSTM-Attention model it can be seen from upper table 2 Method still has good performance on KTH data set, it was demonstrated that the feasibility of algorithm proposed in this paper.

It should be noted last that the above specific embodiment is only used to illustrate the technical scheme of the present invention and not to limit it, Although being described the invention in detail referring to example, those skilled in the art should understand that, it can be to the present invention Technical solution be modified or replaced equivalently, without departing from the spirit and scope of the technical solution of the present invention, should all cover In the scope of the claims of the present invention.

Claims

1. a kind of Human bodys' response method based on Bi-LSTM-Attention model, which is characterized in that including following step It is rapid:

Step S1, inputs InceptionV3 model for the video frame of extraction, increases convolutional Neural using InceptionV3 model Network parameter is reduced while network depth, is sufficiently extracted the depth characteristic of video frame, is obtained relevant feature vector；

The obtained feature vector of step S1 is passed in Bi-LSTM neural network and handles, passes through Bi-LSTM by step S2 Neural network sufficiently learns the temporal aspect between video frame；

The temporal aspect vector that step S2 is obtained is passed to attention Mechanism Model and adaptively perceived to identification knot by step S3 Fruit has the network weight of larger impact, and the relevant feature of these network weights is more paid close attention to.

2. as described in claim 1 based on the Human bodys' response method of Bi-LSTM-Attention model, feature exists In,

In step S1, different convolutional layers is combined together by InceptionV3 model by way of in parallel, while using not Convolution kernel with size carries out convolution operation, the feature that different convolution kernels are handled finally by filter fused layer to video frame Vector is stitched together, and exports depth characteristic matrix by full articulamentum and is used for transmission in Bi-LSTM neural network.

3. as described in claim 1 based on the Human bodys' response method of Bi-LSTM-Attention model, feature exists In,

Step S2 is specifically included:

w_i(i=1 ... 6) indicates a layer network layer to the weight of another network layer；{…h_t-1,h_t,h_t+1... indicate LSTM nerve net Propagated forward layer in network, the input of propagated forward layer be ... x_t-1,x_t,x_t+1... characteristic sequence from front to back；

{…h_t+1',h_t',h_t-1' ... indicate LSTM neural network in back-propagating layer, the input of back-propagating layer be ... x_t+1,x_t,x_t-1... characteristic sequence from back to front；

X therein_tIndicate that extracted video frame passes through the feature vector obtained after InceptionV3 model extraction depth characteristic； Such as following formula:

h_t=f (w₁x_t+w₂h_t-1+b₁) (1)

h′_t=f (w₃x_t+w₅h_t+1+b₂) (2)

o′_t=g (w₄h_t+b₃) (3)

o″_t=g (w₆h′_t+b₃) (4)

o_t=(o '_t+o″_t)/2 (5)

Above formula (1), (2), (3), the f in (4) and g represent activation primitive, b₁、b₂、b₃、b₄The biasing coefficient of hidden unit is represented, O', o " are the result that two LSTM units handle the feature vector of Inceptionv3 layers of output at the corresponding moment respectively；Corresponding Two feature vectors at moment are added the temporal aspect vector summed and be averaged as output.

4. as described in claim 1 based on the Human bodys' response method of Bi-LSTM-Attention model, feature exists In step S3 is specifically included:

o_tIt indicates t-th of the temporal aspect vector exported from Bi-LSTM neural network, then temporal aspect vector is passed to In attention Mechanism Model, initial state vector S is obtained by the hidden layer in attention Mechanism Model_t；Weight coefficient α_tIt indicates Initial state vector S_tThe shared specific gravity size in the state vector Y of final output；Each initial state vector S_tWith weight system Number α_tProduct cumulative and obtain the state vector Y of final output；Calculation formula is as follows:

e_t=tanh (w_ts_t+b_t) (6)

Tanh indicates that excitation function, n indicate the quantity of video frame；e_tIndicate the state vector S of t-th of temporal aspect vector_tIt is determined Fixed energy value, w_tAnd b_tIndicate weight and biasing；By formula (7) using e as the power of truth of a matter various pieces energy value front therewith The available weight coefficient for having much influences on classification results of ratio of the cumulative sum of the energy value divided, it is thus achieved that initially Conversion of the state to state of attention；Then as formula (8) obtains the state vector Y of final output.