CN110059587A

CN110059587A - Human bodys' response method based on space-time attention

Info

Publication number: CN110059587A
Application number: CN201910250775.7A
Authority: CN
Inventors: 田智强; 产文颂; 郑帅; 杜少毅; 兰旭光
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2019-07-26

Abstract

The invention discloses a kind of Human bodys' response methods based on space-time attention, and the present invention extracts picture feature using convolutional neural networks, so that input of the feature vector as shot and long term memory network is obtained, it is more more advantageous than directly inputting picture；Preferably retain and handled the timing information in video using shot and long term memory network；Using space-time attention mechanism, model is allowed to pay close attention to the upper important sequence of spatially important point and time, to improve the efficiency and accuracy rate of identification.

Description

Human bodys' response method based on space-time attention

Technical field

The invention belongs to computer vision, visual classification, deep learning and field in intelligent robotics, and in particular to Yi Zhongji In the Human bodys' response method of space-time attention.

Background technique

Depth learning technology achieves very more research achievements, is more and more leading in fast development recent years Domain plays an important role.

The application prospect of computer vision is very wide, it obtains image using image capture device, then calculates Machine analyzes these images, obtain mutually in requisition for information, as made by the eyes and brain for scheming the mankind and many biologies It works similar.With the development of technology, machine learning and depth learning technology are combined, this field obtained many weights in recent years Quantum jump, while also having more and more problem and demand urgently to be resolved.

With flourishing for internet area and mobile terminal many years, there is a large amount of video collected daily and uploads, How to carry out Classification and Identification to these videos has research significance very much.On the other hand, a kind of carrier of the video as information, is obtained Take information therein that all there is important value many-sided.However, being accomplished manually these tasks due to the substantial amounts of video It is very unpractical, and substitutes that manually to complete task be natural using computer.

All kinds of robots play increasingly important role in today's society, and society is also more next with the demand in market It is bigger.In this case, robot is made to become more intelligent necessary.Robot is exactly intelligence to the Activity recognition of the mankind Change a form, have Human bodys' response robot can preferably carry out human-computer interaction and man-machine collaboration etc. it is many compared with For complicated behavior.

Since AlexNet comes out, extensive concern and application that convolutional neural networks are subject to.Convolutional neural networks are deep One of most representative algorithm of learning areas is spent, is a kind of BP network model comprising convolutional calculation, answers extensively For computer vision field, the network structure represented has VGG, GoogleNet, ResNet etc..

Shot and long term memory network is one of the representative algorithm in a kind of recursive neural network and deep learning.With Convolutional neural networks are compared, it is generally better at processing sequence information, such as machine translation, sentiment analysis etc..

Many behavior recognizers are currently existed, but much the effect is unsatisfactory, mainly due to the following aspects: It is relatively convenient that the Spatial information processing of normal picture gets up, and in video other than comprising spatial information, when further including Sequence information, this part are difficult to handle, and the related information between multiframe is difficult to hold；Since video file is often bigger, so The hardware requirement for handling video is often relatively high, so there is hardware limitation；Many information is not valuable in video Value, concern is not needed, so being highly desirable to carry out the extraction of key point and key frame, but this is one very difficult again It solves the problems, such as.

Summary of the invention

The purpose of the present invention is to overcome the above shortcomings and to provide a kind of Human bodys' response sides based on space-time attention Method, it is intended in solve the problems, such as video identification timing information processing and concern video in key message.

In order to achieve the above object, the present invention the following steps are included:

The video of input is split into picture frame, and uniformly extracts required amount of picture by step 1；

Step 2 carries out feature extraction to the picture of extraction using the convolutional neural networks of completion, to obtain corresponding Feature vector；

Step 3 calculates the corresponding space of every picture to perceptron using preceding using feature vector is extracted as input Attention weight；

Step 4, use space attention weight are weighted picture feature vector to obtain weighted feature vector；

Weighted feature vector is input in shot and long term memory network by step 5, before in shot and long term memory network To propagation, the class probability vector of output is calculated；

It is calculated using the feature vector of each picture and the output of corresponding shot and long term memory network hidden layer corresponding Spatial attention weight；

Step 6, use space attention weight are weighted summation to the class probability vector of every picture, obtain one A class probability vector；

Step 7 is trained model using several marked video datas；Backpropagation is used in training process, When losing larger, model parameter is constantly updated, lesser value is converged to until losing, saves as model；

Take the corresponding classification of maximum value in class probability vector as final classification and output, as model parameter；

Step 8 combines the model of preservation and model parameter, constitutes Human bodys' response model.

In step 2, convolutional neural networks are used for using the VGG19 convolutional neural networks that training is completed on ImageNet The network for carrying out picture classification, using picture as the input of network, and the feature vector for taking it not connect entirely also.

In step 3, the calculation formula of spatial attention weight are as follows:

Wherein e_tFor results of intermediate calculations, l_t,iFor the spatial attention weighted value of t-th of picture ith zone,WithIt for weight parameter, is obtained in training, X_tFor the corresponding feature vector of t picture, h_t-1It is corresponding for t-1 picture The output of hidden layer, K²For the number of regions that each picture is divided into, b is biasing.

In step 5, shot and long term memory network using two layers shot and long term memory network as master network, its calculation formula is:

Wherein Y_tFor the input of t-th of time step short-term memory network, x_t,iFor t picture character pair vector The feature of ith zone.

In step 5, the input needs of shot and long term memory network are weighted to obtain single class by time attention weight Other probability vector, its calculation formula is:

Wherein o is the categorization vector of output, and tanh is activation primitive；

Choose output result of the corresponding classification of maximum probability value as prediction in class probability vector；

The calculation formula of time attention weight are as follows:

β_t=ReLU (W_out(W_XX_t+W_hh_t-1+b))

Wherein β t_ForThe spatial attention weight of t picture, ReLU are linear activation primitive, W_out、W_XAnd W_hIt is weight Parameter obtains in training, X_tFor the corresponding feature vector of t picture, h_t-1For the corresponding hidden layer of t-1 picture Output, b are biasing.

Shot and long term memory network initializes hidden layer, its calculation formula is:

Wherein f_init,hTo be preceding to perceptron.

The cell state c of first time step hidden layer input of shot and long term memory network₀Initialization calculation formula are as follows:

Wherein f_init,cTo be preceding to perceptron.

In step 7, loss function is used in training process, and the adjustment of parameter is carried out when losing backpropagation, damage Lose function calculation formula are as follows:

Wherein C is the sum of classification, y_iFor true tag,For the probability for belonging to the i-th class, T is input total picture Number, λ₁Spatial attention penalty coefficient, λ₂Time attention penalty coefficient.

Compared with prior art, the present invention extracts picture feature using convolutional neural networks, to obtain feature vector work It is more more advantageous than directly inputting picture for the input of shot and long term memory network；Preferably retained using shot and long term memory network With the timing information handled in video；Using space-time attention mechanism, model is paid close attention to spatially important Point and time upper important sequence, to improve the efficiency and accuracy rate of identification；After the video pre-filtering stage reduces Continuous calculation amount alleviates the calculating pressure of hardware.

Detailed description of the invention

Fig. 1 is flow chart of the invention；

Fig. 2 is model structure of the invention.

Specific embodiment

The present invention will be further described with reference to the accompanying drawing.

Referring to Fig. 1, the present invention the following steps are included:

Step 101, camera is used to obtain video data or direct uploaded videos data as video input.

Step 102, to being originally inputted video data pre-processes, and video is split framing, in order to reduce subsequent meter Calculation amount equably extracts wherein certain amount ground picture, and keep originally temporal order arrange these pictures.

Step 103, feature is carried out to every picture using with training completion on ImageNet convolutional neural networks VGG19 It extracts, obtains corresponding feature vector；For convenience subsequent calculating, by feature vector from bivector be stretched as it is one-dimensional to Amount, the feature vector of t picture are X_t={ x_t,1,x_t,2,…,x_t,i,…}。

Step 104, due to the importance of all parts in each picture be not it is identical, some parts are important, to identification It is helpful, and some parts are useless, so introducing spatial attention weight, that is, indicate the important journey of each part of picture Degree, the size of numerical value represent the height of importance,

The calculation formula of spatial attention weight are as follows:

Wherein e_tFor results of intermediate calculations, l_t,iFor the spatial attention weighted value of t-th of picture ith zone,WithIt for weight parameter, is obtained in training, X_tFor the corresponding feature vector of t picture, h_t-1It is corresponding for t-1 picture The output of hidden layer, K²For the number of regions that each picture is divided into, b is biasing；

Step 105, in the spatial attention that every picture is calculated, spatial attention weighting is carried out immediately, to picture Corresponding feature vector is weighted, the formula of weighting are as follows:

Step 106, later feature vector will be weighted and inputs shot and long term memory network, before shot and long term memory network To propagation, the corresponding hidden layer of every picture exports h_t, there are two the effect of aspect, one is as output for the output of hidden layer To next step, the other is calculating spatial attention and time attention weight.

Step 107, since the importance of frame different in video is different, some frames are important, some frames are then not So important, so needing to distinguish these importance for not having to frame, this introduces time attention mechanism, the numerical value of weight is big The small height for representing importance, the wherein calculation formula of time attention are as follows:

β_t=ReLU (W_out(W_XX_t+W_hh_t-1+b))

Wherein β_tFor the spatial attention weight of t picture, ReLU is linear activation primitive, W_out、W_XAnd W_hIt is power Weight parameter, obtains, X in training_tFor the corresponding feature vector of t picture, h_t-1For the corresponding hidden layer of t-1 picture Output, b be biasing

Step 108, it is calculated after time attention weight, needs to add the corresponding categorization vector of each picture Power, and sum and obtain a categorization vector, and input softmax function and obtain final class probability vector, calculation formula Are as follows:

Wherein o is the categorization vector of output, and tanh is activation primitive；For the probability for belonging to i-th of classification, C is classification Sum.

Step 109, after obtaining class probability vector, probability is takenMaximum corresponding classification is as final classification and defeated Result out.

In above-mentioned steps, need to initialize shot and long term memory network hidden layer, its calculation formula is:

Wherein f_init,hTo be preceding to perceptron；

The cellular layer to shot and long term memory network is also needed to initialize, its calculation formula is:

Wherein f_init,cTo be preceding to perceptron.

In addition, the foundation of model is a large amount of marked firstly the need of using as scheming most of deep learning algorithms Video data is trained, and uses backpropagation among these, can adjust model parameter according to loss in this backpropagation, therefore It needs to construct loss function, its calculation formula is:

Referring to fig. 2, which depict the specific structures of model of the present invention, including following part:

Step 201, it is the video data of input, and video is split into framing, uniformly extracts a part of frame therein.

It step 202, is the trained VGG19 network on ImageNet, for carrying out the feature extraction of picture.

Step 203, it is spatial attention weighted portion, carries out spatial attention weighting for the feature vector to picture.

Step 204, it is shot and long term memory network (LSTM), is the master network of model.

Step 205, it is time attention weighted portion, is weighted summation for the output to shot and long term memory network.

Step 206, it is softmax function, the output of front is input to softmax function and obtains class probability vector, And the corresponding classification of value for choosing maximum probability is as final classification.

Claims

1. the Human bodys' response method based on space-time attention, which comprises the following steps:

Step 3 calculates the corresponding space transforms of every picture to perceptron using preceding using feature vector is extracted as input Power weight；

Weighted feature vector is input in shot and long term memory network by step 5, is passed by the forward direction in shot and long term memory network It broadcasts, the class probability vector of output is calculated；

Corresponding space is calculated using the output of the feature vector and corresponding shot and long term memory network hidden layer of each picture Attention weight；

Step 6, use space attention weight are weighted summation to the class probability vector of every picture, obtain a class Other probability vector；

Step 7 is trained model using several marked video datas；Backpropagation is used in training process, works as damage When losing larger, model parameter is constantly updated, lesser value is converged to until losing, saves as model；

2. the Human bodys' response method according to claim 1 based on space-time attention, which is characterized in that step 2 In, convolutional neural networks are using the VGG19 convolutional neural networks that training is completed on ImageNet, for carrying out picture classification Network, using picture as the input of network, and the feature vector for taking it not connect entirely also.

3. the Human bodys' response method according to claim 1 based on space-time attention, which is characterized in that step 3 In, the calculation formula of spatial attention weight are as follows:

Wherein e_tFor results of intermediate calculations, l_t,iFor the spatial attention weighted value of t-th of picture ith zone,WithFor Weight parameter obtains in training, X_tFor the corresponding feature vector of t picture, h_t-1It is hidden for t-1 picture is corresponding The output of layer, K²For the number of regions that each picture is divided into, b is biasing.

4. the Human bodys' response method according to claim 1 based on space-time attention, which is characterized in that step 5 In, shot and long term memory network using two layers shot and long term memory network as master network, its calculation formula is:

Wherein Y_tFor the input of t-th of time step short-term memory network, x_t,iIt is i-th of t picture character pair vector The feature in region.

5. the Human bodys' response method according to claim 1 based on space-time attention, which is characterized in that step 5 In, the input of shot and long term memory network is weighted to obtain single class probability vector by time attention weight, meter Calculate formula are as follows:

The calculation formula of time attention weight are as follows:

β_t=ReLU (W_out(W_XX_t+W_hh_t-1+b))

Wherein β_tFor the spatial attention weight of t picture, ReLU is linear activation primitive, W_out、W_XAnd W_hIt is weight ginseng Number, obtains, X in training_tFor the corresponding feature vector of t picture, h_t-1For the defeated of the corresponding hidden layer of t-1 picture Out, b is biasing.

6. the Human bodys' response method according to claim 1 based on space-time attention, which is characterized in that shot and long term note Recall network to initialize hidden layer, its calculation formula is:

Wherein f_init,hTo be preceding to perceptron.

7. the Human bodys' response method according to claim 1 based on space-time attention, which is characterized in that shot and long term note Recall the cell state c of first time step hidden layer input of network₀Initialization calculation formula are as follows:

Wherein f_init,cTo be preceding to perceptron.

8. the Human bodys' response method according to claim 1 based on space-time attention, which is characterized in that step 7 In, loss function is used in training process, and the adjustment of parameter, loss function calculation formula are carried out when losing backpropagation Are as follows:

Wherein C is the sum of classification, y_iFor true tag,For the probability for belonging to the i-th class, T is the total picture number of input, λ₁ Spatial attention penalty coefficient, λ₂Time attention penalty coefficient.