CN108830157A

CN108830157A - Human bodys' response method based on attention mechanism and 3D convolutional neural networks

Info

Publication number: CN108830157A
Application number: CN201810463529.5A
Authority: CN
Inventors: 袁和金; 牛为华; 张颖; 崔克彬
Original assignee: North China Electric Power University
Current assignee: China Power Investment Northeast Energy Technology Co ltd; North China Electric Power University
Priority date: 2018-05-15
Filing date: 2018-05-15
Publication date: 2018-11-16
Anticipated expiration: 2038-05-15
Also published as: CN108830157B

Abstract

The Human bodys' response method based on attention mechanism and 3D convolutional neural networks that the invention discloses a kind of, human body Activity recognition method constructs a 3D convolutional neural networks, and the input layer of the 3D convolutional neural networks includes two channels of original gradation figure and attention matrix.The 3D CNN model of the human body behavior in identification video is constructed in this method, introduce attention mechanism, the distance of two interframe is calculated as attention matrix, binary channels is constituted with original human body behavior video sequence to be input in the 3D CNN of building, and convolution operation is allowed to carry out feature extraction emphatically to case of visual emphasis region.Simultaneously, 3DCNN structure is optimized, Dropout layers are added in a network and freezes network portion connection weight at random, use ReLU activation primitive, improve network sparsity, solve the problems, such as to increase with dimension, calculation amount increases severely, gradient disappears, and prevents the over-fitting under small data set caused by the number of plies is deepened, reduce the loss of time while promoting Network Recognition accuracy rate.

Description

Human bodys' response method based on attention mechanism and 3D convolutional neural networks

Technical field

The present invention relates to Human bodys' response method, espespecially a kind of people based on attention mechanism and 3D convolutional neural networks Body Activity recognition method.

Background technique

Intelligent video analysis is always the research field that there is Important Academic to be worth, and Human bodys' response is as in the field Essential a part becomes new research hotspot, in intelligent video monitoring, advanced human-computer interaction, sports analysis All have broad application prospects with content based video retrieval system etc..The Human bodys' response method of mainstream makes mostly at present The feature manually designed characterizes the human motion in video, as profile, outline, HOG, Harris, SIFT and this A little extensions etc. of the feature in three-dimensional.Artificial design features are a kind of wisdom and priori knowledge using the mankind, and these are known Know the good mode being applied in target and Activity recognition technology.But this mode, which needs manually to excavate, can show movement Feature, and the sometimes more difficult substantive characteristics for showing movement of the feature of artificial selection, are affected to recognition result.

Therefore, Human bodys' response accuracy rate in video how is improved, is this preferably using the raw information in video Field technical staff makes great efforts the direction of research.

Summary of the invention

In view of this, it is a primary object of the present invention to improve Human bodys' response accuracy rate in video, it is contemplated that video It as continuous sequence of the image on time dimension that be mutually related, can be handled, can be incited somebody to action by convolutional neural networks Original video is entered directly into the neural network of building, carries out the training and identification of human body behavior, an object of the present invention It is to propose a kind of 3D convolutional neural networks model based on attention mechanism that can preferably utilize the raw information in video.

To achieve the above object, the present invention provides a kind of human body row based on attention mechanism and 3D convolutional neural networks For recognition methods, which is characterized in that human body Activity recognition method constructs a 3D convolutional neural networks, the 3D convolutional Neural The input layer of network includes two channels of original gradation figure and attention matrix.

Preferably, the attention matrix and is obtained after being normalized by calculating the difference between two continuous frames 's.

Preferably, the calculating of the attention matrix be using to two calculus of finite differences of the Difference Calculation between two continuous frames or Person is using three adjacent frame images as one group of progress, three frame difference method of difference again.Three traditional frame difference methods are that present frame and before and after frames are poor Difference again is carried out after point, it is " union " for taking difference result twice that the present invention, which is further improved, this concept of union is every by taking The larger value of a pixel present frame and before and after frames difference result obtains, which can indicate that maximum variation occurs before and after present frame Region.

Three frame difference method is the difference image for finding out present frame and former frame and present frame and a later frame respectively, is continued Two frame differences are allowed to make the difference again.

Preferably, attention matrix A is to be calculated by the following formula to obtain in the two frame differences method：

Wherein, x, y are the coordinate of target pixel points, and t is current frame number, and t-1 indicates the former frame of present frame, I_tTo work as Previous frame is to calculate the distance between adjacent two frame in x, the gray value of y location, formula (3), will by the threshold value T in formula (2) No conspicuousness region of variation is rejected, and is obtained conspicuousness region of variation ID, is adjusted the distance and be normalized by formula (1), final To attention matrix A, wherein min and max be minimum value in all pixels in conspicuousness region of variation ID in gray value and Maximum value, the three-dimensional matrice can indicate to act conspicuousness region of variation in the human body behavior video of input.

Preferably, the Three image difference step is：

1) the continuous three frames image I in selecting video frame sequence_t-1(x, y), I_t(x, y), I_t+1(x, y) calculates separately phase The difference D of adjacent two field pictures_t-1,t(x, y), D_t,t+1(x,y)：

2) noise jamming is excluded by selecting suitable threshold value T to extract conspicuousness region of variation to obtained difference image：

3) two difference image logical "or"s will be obtained in one group, obtain the union of region of variation between two continuous frames, Obtain the front and back conspicuousness region of variation of the intermediate frame in three frame images, B (x, y),

B (x, y)=max (B₁(x,y),B₂(x,y)) (23)

4) finally obtained difference image is normalized, obtains frame difference channel A (x, y), which being capable of table Show movement conspicuousness region of variation in the human body behavior video of input.

Preferably, the 3D convolutional Neural pessimistic concurrency control of the 3D convolutional neural networks includes：

One binary channels input layer, multiple 3D convolutional layers and multiple ponds 3D layer intermesh interspersed, and final connection connects entirely Classification results are obtained after connecing layer, attention matrix is inputted with original gradation video frame cube by the binary channels input layer together Into neural network model.

Preferably, the full articulamentum is two, there are one Dropout layers respectively before two full articulamentums.It is described Dropout probability is set as the decimal between 0.25 to 0.5.

The preferred Dropout probability is respectively 0.5 and 0.25.

Preferably, the 3D convolutional layer and the pond 3D layer are to be respectively 3-7.The preferred 3D convolutional layer and the pond 3D The number of layer is respectively 5.

Preferably, the 3D convolutional Neural pessimistic concurrency control of the 3D convolutional neural networks includes：1 binary channels input layer, 5 3D volumes Lamination and 5 pond 3D layers intermesh it is interspersed, finally connect 2 full articulamentums after obtain classification results, in 2 full articulamentums Dropout operation twice is carried out respectively,

Wherein：

C1 to C5 is convolutional layer, and every layer of convolution kernel is 3 × 3 × 3, and convolution nuclear volume is incremented by successively by 16 to 256, with Just further types of high-level characteristic is generated from rudimentary feature combination, at C1 layers, convolution kernel is to attention matrix and original video Frame carries out binary channels convolution,

S1 to S5 layers are down-sampling layer, using maximum pond method, reduce the resolution ratio of characteristic pattern, reduce characteristic pattern rule Mould reduces calculation amount, improves the tolerance to distort to input picture；Wherein use 2 × 2 × 2 window right simultaneously for S2, S4 layers Time dimension and Spatial Dimension carry out down-sampling, other layers use 1 × 2 × 2 window, adopt under only carrying out on Spatial Dimension Sample；

D1 layers are full articulamentums, include 256 neurons, the feature cube and D1 layers of 256 nerves of S5 layers of output Member is connected, and the input video of continuous 15 frame is converted into the feature vector of 256 dimensions in this layer, uses between S5 and D1 Dropout layers, freezes S5 layers with 0.25 probability and connected with D1 layers of part；

D2 layers are second full articulamentums while being also output layer, and neuron number 6 is identical with target category number, D2 layers of each neuron are connect entirely with 256 neuron of D1 layer, are finally classified by classifier softmax recurrence, are obtained energy The output of enough marking behavior classifications.

Preferably, the 3D convolutional neural networks use performance of the ReLU as activation primitive to promote depth network.It is preferred that , 3D convolutional layer therein and full articulamentum D1 use ReLU as activation primitive, and output layer uses Softmax as activation letter Number, majorized function use SGD function, and loss function intersects entropy function using multiclass.

Wherein, log-likelihood cost function, formula are：

C=- ∑_ky_kloga_k (5)

Wherein, a_kIndicate the output valve of k-th of neuron, y_kIndicate the corresponding true value of k-th of neuron, value be 0 or 1.The gradient formula of neural network weight w and biasing b are as follows：

Wherein j is current layer neuron serial number, and k is the upper one layer neuron serial number being attached thereto, and L indicates Current neural member The number of plies.Log-likelihood function has nonnegativity as cross entropy cost function, therefore target is exactly to minimize cost function, When really output a and desired output y are close, cost function is close to 0.Variance generation can be overcome using entropy function is intersected Valence function updates the excessively slow problem of weight.Softmax function cooperation log-likelihood cost function can carry out classify more well Neural metwork training under task.

By the above method, the present invention mainly has the following advantages：

Present invention uses the attention channels based on visual perception, carry out auxiliary nervous network from original video frame Feature extraction is carried out, network carries out convolution operation to two channels simultaneously, and the feature in two channels interacts.It mentions Human bodys' response accuracy rate is proposed preferably using the raw information in video based on attention mechanism in high video 3D convolutional neural networks model.

One deep layer Three dimensional convolution nerve net of the model construction,

Attention mechanism is introduced, obtains to describe the attention torque in human motion region by calculating interframe distance Battle array, is combined into input of the binary channels as convolutional neural networks for attention matrix and original video, is carried out using 3D convolution kernel Convolution operation extracts feature of the human body behavior in time domain and airspace in video.

In order to overcome overfitting problem when network training, increase Dropout layers in network structure, in the training process with Random " freezing " the partial nerve member of certain proportion improves network sparsity, alleviates network over-fitting to a certain extent.

ReLU activation primitive has been used, network sparsity is improved, has solved to increase with dimension, be calculated caused by number of plies intensification The problem of amount increases severely, gradient disappears, prevents the over-fitting under small data set, reduces the time while promoting Network Recognition accuracy rate Loss

It is on KTH data set the experimental results showed that, which has preferable recognition effect.

Detailed description of the invention

The 3D convolutional neural networks frame based on attention mechanism of Fig. 1 specific embodiments of the present invention；

Fig. 2 often uses activation primitive (Sigmoid, tanh and ReLU function) image；

The visable representation schematic diagram of Fig. 3 Dropout；

Fig. 4 (a) (b) vision attention example diagram；

Attention schematic diagram of mechanism in Fig. 5 convolutional Neural net；

Fig. 6 (a) (b) attention mechanism comparative experiments curve：(a) recognition accuracy curve (b) error curve；

The three-frame difference flow diagram of Fig. 7 specific embodiment of the invention 2；

Fig. 8 binary channels 3D CNN network structure of the invention based on attention mechanism；

The experiment flow figure of Fig. 9 specific embodiment of the invention.

Specific embodiment

Below with reference to the embodiments and with reference to the accompanying drawing technical solution of the present invention is described in further detail.

The Human bodys' response method based on attention mechanism and 3D convolutional Neural net that the present invention provides a kind of, it is main It is that vision noticing mechanism is applied to 3D convolutional neural networks human body to make in identification model.Identification video is constructed in the method In human body behavior 3D CNN model, introduce attention mechanism, the distance for calculating two interframe is used as attention matrix, and original Beginning human body behavior video sequence constitutes binary channels and is input in the 3D CNN of building, and convolution operation is allowed to carry out case of visual emphasis region Feature extraction emphatically.Meanwhile 3D CNN structure is optimized, Dropout layers are added in a network and freezes network portion at random Connection weight improves network sparsity using ReLU activation primitive, solves to increase with dimension, calculate caused by number of plies intensification The problem of amount increases severely, gradient disappears, prevents the over-fitting under small data set, reduces the time while promoting Network Recognition accuracy rate Loss.

Present invention introduces attention mechanism, be because human visual perception a key property be will not immediately treat it is whole A scene, but be primarily focused in certain a part of visual space, it is scanned on the image according to certain order, from One zone-transfer is to another region.The information for obtaining sometime some regional area, then by the information in these regions Combine for whole judgement and impression.

And attention mechanism in RNN be widely utilized that for describe in sequence or between sequence element correlation, Different weights are assigned to attention matrix by different correlations, network is allowed to give more concerns to the element of big weight.

Referring to fig. 4, wherein Fig. 4 (a) is demonstrated by under stationary state, and it is poor with ambient enviroment that the vision of people can always be paid close attention to emphatically Different biggish part, can extract the salient region of still image under this principle.And Fig. 4 (b) is demonstrated by movement shape Under state, human visual can pay close attention to emphatically changed part in the visual field, and ignore static constant part, wherein the portion changed It point more importantly acts on to judging that current kinetic classification has.

That is, in the video for including human body behavior, human body target is as the main part for being different from background Salient region.However it is different from the target identification based on image, during continuous movement, in the movement that we are more concerned about Changed part, as shown in Fig. 4 (b).Human body behavior in video is a dynamic process, because of human body behavior act Continuity and otherness feature, the difference section of human body behavior more can play row as salient region between successive video frames For the directive function of identification.For Human bodys' response, we are concerned with human region changed in movement, Wish that training network can give this higher concern in significant changes part.

The present invention introduces attention matrix in the input layer of building 3D convolutional neural networks as a result, can be continuous by calculating Vision noticing mechanism is applied to 3D convolutional neural networks human action by the distance between frame (building inter-frame difference channel) to be known In other model.

As shown in Figure 1, in specific embodiments of the present invention 1, the 3D convolutional neural networks frame that uses for：

1) network structure

3D convolutional Neural pessimistic concurrency control of the invention is as shown in Figure 1, it is the 3D convolution mind the present invention is based on attention mechanism Through network frame schematic diagram, it includes 1 binary channels input layer, 5 3D convolutional layers and 5 pond 3D layers intermesh it is interspersed, Classification results finally are obtained after 2 full articulamentums of connection, and there are 2 Dropout layers before 2 full articulamentums.

Its first layer is input layer, which includes two channels of original gradation figure and attention matrix, original gradation Data are made of the gray level image of continuous 15 adjacent video frames, and attention matrix is by formula (1) in this embodiment Calculated interframe is formed apart from normalization matrix.

As shown in Figure 5：The present invention introduces attention matrix in the input layer of building 3D convolutional neural networks, can be by calculating Vision noticing mechanism is applied to 3D convolutional neural networks human body by the distance between two continuous frames or building inter-frame difference channel In action recognition model.In this embodiment, the input expanding for being the neural network model that will be constructed is binary channels, and will Attention matrix is input in neural network model as another channel with original gradation video frame cube together.

Wherein, the attention matrix of this specific embodiment is obtained after normalization by calculating the distance between two continuous frames It arrives, the distance between two continuous frames can describe the difference of human action in motion process.Pass through the entire frame of attention matrix description The region that should be paid close attention in cube.

Wherein, attention matrix A can be calculated by formula once：

Wherein formula (3) is to calculate the distance between adjacent two frame, will be become without conspicuousness by the threshold value T in formula (2) Change region to reject, is adjusted the distance and be normalized by formula (1), finally obtain attention matrix, which can indicate Conspicuousness region of variation is acted in the human body behavior video of input.

In this embodiment, C1 to C5 be convolutional layer, every layer of convolution kernel is 3 × 3 × 3, convolution nuclear volume by 16 to 256 is incremented by successively, to generate further types of high-level characteristic from rudimentary feature combination；And at C1 layers, convolution kernel It is that binary channels convolution is carried out to attention matrix and original video frame.

S1 to S5 layers are down-sampling layer, are using maximum pond method, to reduce characteristic pattern in this embodiment Resolution ratio reduces characteristic pattern scale, reduces calculation amount, improves the tolerance to distort to input picture.Preferably, therein S2, S4 layers are to carry out down-sampling to time dimension and Spatial Dimension simultaneously using 2 × 2 × 2 window, and other layers then use 1 × 2 × 2 window only carries out down-sampling on Spatial Dimension.

D1 layers are full articulamentum (FC), include 256 neurons, the feature cubes of S5 layer output and 256 of D1 layers Neuron is connected, and the input video of continuous 15 frame is converted into the feature vector of 256 dimensions in this layer.And make between S5 and D1 It is to freeze S5 layers with 0.25 probability to connect with D1 layers of part in this embodiment with Dropout layers.

D2 layers are second full articulamentums while being also output layer (Output), neuron number 6 and target category number Mesh is identical；And D2 layers of each neuron are connect entirely with 256 neuron of D1 layer, are finally returned and are carried out by classifier softmax Classification, obtains the output for capableing of marking behavior classification.

2) activation primitive

3D convolutional neural networks of the invention are using ReLU (amendment linear unit, Rectified Linear Units) As activation primitive.Fig. 2 gives the image of Sigmoid, tanh and ReLU function.Although more common in traditional neural network Sigmoid system function is larger to the signal gain of central area, small to the signal gain of two lateral areas, maps in the feature space of signal On, there is good effect.But as the network number of plies is deepened, this kind of activation primitive calculation amount increases, and when close to saturation region, Slowly, derivative tends to 0, when error gradient is sought in backpropagation for transformation, the case where gradient disappears is easy to appear, to be unable to complete depth The training of layer network.Therefore as the network number of plies is gradually deepened, more can using this simple, quick linear activation primitive of ReLU Promote the performance of depth network.

And ReLU calculation formula is：ReLU (x)=max (0, x), is substantially piecewise linear model, forward calculation and It is all relatively simple that reversed gradient propagates calculating；Because ReLU does not have zone of saturation, thus gradient disappearance problem is less likely to occur；Due to ReLU is closed on the left of y-axis, so that the output of certain hidden neurons is 0, i.e. network becomes sparse, the activation that different structure generates Path can learn to arrive relatively sparse feature, alleviate over-fitting to a certain extent preferably from training data.Due to The 3D convolutional neural networks number of plies that the present invention constructs compared with deep, training sample scale is big, the training time is longer, trained acceleration and Gradient disappearance is inhibited to have very important meaning for the practicability of convolutional neural networks.

3)Dropout

Referring to the visable representation schematic diagram of Fig. 3 Dropout, 3D convolutional neural networks of the invention are a kind of depth nerves Network, and structure is complicated for deep neural network, as the network number of plies is deepened, the increase of the number of iterations, network is for training sample In include noise information also start to be fitted as feature, Dropout can then alleviate this problem to a certain extent. As shown in figure 3, Dropout refers to the neuron in network that according to certain probability, temporarily " freezing " connection is not involved in this Take turns propagated forward and reversed error calculation.

Dropout process is equivalent to pulls out the simpler network of structure from original network.And every wheel training The neuron connection probability being " frozen " is random, other neurons that this mode is forced single neuron and picked out at random are total With work, rather than excessively rely on the effect of certain specific neurons after training by successive ignition, this mode solves Overfitting problem caused by output excessively relies on certain middle layer neurons.

Therefore, pond layer S5 and full articulamentum in specific embodiments of the present invention, in 3D convolutional neural networks model It joined two Dropout layers between D1, between D1 and D2, be respectively provided with 0.5 and 0.25 Dropout probability.

Inventor tests the algorithm of the invention on KTH human body behavioral data collection.KTH database is included in 6 classes movement that lower 25 people of 4 different scenes complete (walking, jogging, running, boxing, hand waving and Hand clapping), amount to 600 videos, identical behavior has carried out 3 to 4 times in each video, can extract 2391 in total A video sample contains dimensional variation, clothing variation and illumination variation.The present embodiment is chosen 16 people in 25 people of data set and is made For training sample, 9 people are as test sample.

Experimental procedure includes：

1) experiment is first handled the human body behavior video in data set for grayscale mode；

2) original video of 3D convolutional Neural net that 15 frames are constructed as the present invention is extracted to input, having a size of 15 × 120 × 160；

3) attention matrix is calculated according to formula (1) to the 15 frame human body behavior video samples extracted,

4) using attention matrix as second channel of 3D convolutional Neural net and greyscale video data together as input It is trained.

In the 3D CNN structure that the embodiment of the present invention 1 constructs, 5 3D convolutional layers and down-sampling layer are staggered, connection two A layer that connects entirely obtains output result.Wherein, 3D convolutional layer C1 to C5 and full articulamentum D1 uses ReLU as activation primitive, output Layer D2 uses Softmax as activation primitive, and majorized function uses SGD (stochastic gradient descent) function, and loss function uses more Class intersects entropy function.Every 10 samples carry out a gradient calculating when training, train 50 times in total.

Table 1 gives the recognition accuracy of some common Human bodys' response methods on KTH data set.Present invention building 3D convolutional Neural pessimistic concurrency control recognition accuracy be 91.67%, the network accuracy rate that joined attention mechanism reaches 92.59%, it is higher than the 3D CNN model of Ji [3] building.It is further seen that using the engineers' such as HOG, light stream, SIFT The relatively accurate rate of the model of feature is higher, the reason is that such methods usually require adequately to pre-process video, so After carry out feature extraction, be difficult to extract the accurate feature for being enough to describe complex behavior in the video under complex environment.And this Various features of the inventive method independent of engineer, using the powerful self-learning capability of deep neural network, from a large amount of Voluntarily acquistion human body behavioural characteristic in training sample, as the number of plies is deepened, the feature learnt can be more abstracted, be better able to from Different human body behaviors is substantially described.After joined attention mechanism, network can be under the action of attention matrix, emphatically Changed part in human body behavior is paid close attention to, extraneous background is ignored, obtains more outstanding recognition capability.

Accuracy rate of the various Human bodys' response algorithms of table 1 on KTH data set

As shown in fig. 6, the present invention has carried out comparative experiments to the effect of neural network recognization ability to attention mechanism, it is real Line represents the 3D convolutional Neural pessimistic concurrency control that attention mechanism is not added that the present invention constructs, and dotted line representative joined attention mechanism 3D convolutional Neural pessimistic concurrency control.Fig. 6 (a) is the recognition accuracy curve in test data set, error when Fig. 6 (b) is training Curve.After attention mechanism is added, network has just reached higher accuracy rate, and error in trained former wheel epoch Decline is very fast, starts to restrain quickly.Under the action of attention matrix, network captures human body behavioural characteristic quickly, and does not have Have and has passed through the training of tens of wheels using the network of attention mechanism just gradually human body behavioural characteristic has been arrived in study.In 3D convolutional Neural Attention mechanism is introduced in net can promote the accuracy rate of Human bodys' response.

Another specific embodiment 2 of the invention has done further improvement to above-mentioned attention matrix calculation, and upper One specific embodiment main difference is that, the frame difference channel of the present embodiment is calculated using Three image difference, three Frame difference can describe the variation between present frame and before and after frames, and the two frame differences that a upper specific embodiment uses then only describe and work as Variation in previous frame and front or rear frame single one direction.And three frame difference methods of the present embodiment use be not as tradition three Frame difference method is the same, after finding out the difference of present frame and former frame and present frame and a later frame, continues that two frame differences is allowed to make the difference again, But the union of two frame differences has been sought, a movement can be fully described in this way to change in an of short duration time interval Region.

2 concrete scheme of embodiment is as follows：

The frame difference channel of the present embodiment is calculated using Three image difference, and calculation process is as shown in fig. 7, be by by phase Three adjacent frame images can preferably detect the front and back region of variation of intermediate frame as one group of progress difference again.Frame difference can To describe the difference of human action in motion process, pass through the area that should be paid close attention in the entire frame cube of frame difference matrix description Domain.

The attention matrix (Three image difference) of 1 the present embodiment：

3) two difference image logical "or"s will be obtained in one group, obtain the union of region of variation between two continuous frames, Obtain the front and back conspicuousness region of variation of the intermediate frame in three frame images, B (x, y).

B (x, y)=max (B₁(x,y),B₂(x,y)) (23)

2 network structures

The binary channels 3D convolutional Neural pessimistic concurrency control of the present embodiment building is as shown in figure 8, it equally includes 1 binary channels input Layer, 5 3D convolutional layers and 5 pond 3D layers intermesh interspersed, classification results are obtained after 2 full articulamentums of final connection, 2 A full articulamentum carries out Dropout operation twice respectively.

First layer is input layer.Input layer includes that original gradation video frame and inter-frame difference (attention matrix) two are logical Road, original gradation data are made of the gray level image of continuous 15 adjacent video frames, and frame difference channel is calculated by formula (21)-(24) Interframe out is formed apart from normalization matrix.Inputted video image and frame difference matrix are all treated as 120 × 160 pixels.

C1 to C5 is convolutional layer, and every layer of convolution kernel is 3 × 3 × 3, and convolution nuclear volume is incremented by successively by 16 to 256, with Just further types of high-level characteristic is generated from rudimentary feature combination.At C1 layers, convolution kernel is to attention matrix and original video Frame carries out binary channels convolution.

S1 to S5 layers are down-sampling layer, using maximum pond method, reduce the resolution ratio of characteristic pattern, reduce characteristic pattern rule Mould reduces calculation amount, improves the tolerance to distort to input picture.Wherein use 2 × 2 × 2 window right simultaneously for S2, S4 layers Time dimension and Spatial Dimension carry out down-sampling, other layers use 1 × 2 × 2 window, adopt under only carrying out on Spatial Dimension Sample.

D1 layers are full articulamentums, include 256 neurons.The feature cube and D1 layers of 256 nerves of S5 layers of output Member is connected, and the input video of continuous 15 frame is converted into the feature vector of 256 dimensions in this layer.

D2 layers are second full articulamentums while being also output layer, and neuron number 6 is identical with target category number. D2 layers of each neuron are connect entirely with 256 neuron of D1 layer, are finally classified by classifier softmax recurrence, are obtained energy Enough indicate the output of behavior classification.

It joined between the pond layer S5 and full articulamentum D1 in 3D convolutional neural networks model, between D1 and D2 Dropout operation, provided with 0.25 Dropout probability.

3 Human bodys' response algorithm flows (experimental analysis)

1) by the walking in KTH data set, jogging, running, boxing, hand waving, hand The human body behavior video input of six seed type of clapping is read frame by frame into computer, by OpenCV, is handled as grayscale mode；

2) the blank background frame for not including human region in video is rejected, the remaining video comprising human body behavior is used 15 frame key frames are extracted in equispaced sampling, the image extracted are saved, as the original of the 3D convolutional Neural net constructed herein Video input, having a size of 15 × 120 × 160；

3) inter-frame difference channel is calculated to the key frame sample extracted according to inter-frame difference path computation method；

4) the inter-frame difference image extracted and original gradation key frame are constituted into binary channels；

5) according to building binary channels 3D convolutional neural networks model.

6) it inputs into being trained in binary channels 3D convolutional neural networks model, single-wheel calculates that steps are as follows：

In first convolution sum down sample module, the set sequence of frames of video having a size of 15 × 120 × 160 passes through C1 After the convolution operation of 16 3 × 3 × 3 convolution kernels of layer, 16 15 × 120 × 160 feature cubes are obtained；By the 1 of S1 × 2 × 2 down-samplings obtain 16 15 × 60 × 80 feature cubes；

In second convolution sum down sample module, after the convolution operation of C2 layers of 32 3 × 3 × 3 convolution kernels, obtain 32 15 × 60 × 80 feature cubes；By 2 × 2 × 2 down-samplings of S2,32 7 × 30 × 40 features cube are obtained Body；

In third convolution sum down sample module, after the convolution operation of C3 layers of 64 3 × 3 × 3 convolution kernels, obtain 64 7 × 30 × 40 feature cubes；By 1 × 2 × 2 down-samplings of S3,64 7 × 15 × 20 features cube are obtained Body；

In 4th convolution sum down sample module, after the convolution operation of C4 layers of 128 3 × 3 × 3 convolution kernels, obtain To 128 7 × 15 × 20 feature cubes；By 2 × 2 × 2 down-samplings of S4, it is vertical to obtain 128 3 × 7 × 10 features Cube；

In 5th convolution sum down sample module, after the convolution operation of C5 layers of 256 3 × 3 × 3 convolution kernels, obtain To 256 3 × 7 × 10 feature cubes；By 1 × 2 × 2 down-samplings of S5,256 3 × 3 × 5 features cube are obtained Body；

Operated by five groups of convolution sum down-samplings, the feature cube graduation that will be obtained, obtain the features of 11520 dimensions to Amount, is then connected with the full articulamentum containing 256 neurons, finally connects the output layer containing 6 neurons, obtain 6 dimensions Output.

After a wheel fl transmission, the output corresponding with former behavior classification of the reality output of network is subjected to error calculation And backpropagation.

Network model uses log-likelihood cost function as loss function.Variance loss function is in training neural network It may result in the slack-off problem of training speed, i.e., initial output valve is remoter from true value, and training speed is slower.Cross entropy generation Valence function can solve this problem.When output layer uses softmax activation primitive, then log-likelihood cost function is used, Formula is：

C=- ∑_ky_kloga_k (5)

SGD (stochastic gradient descent) majorized function is used in network model, this method is the batch version of gradient decline. For training dataset, it is classified as n batch, each batch includes m sample.Each more new capital utilizes a batch Data, rather than entire training set.I.e.：

x_t+1=x_t+Δx_t (7)

Δx_t=-η g_t (8)

Wherein, η is learning rate, in this experiment, takes η=0.01.g_tFor x t moment gradient.Use stochastic gradient descent The advantages of optimization method, is：It is often unrealistic in time using the update of entire data set when amount of training data is big, The method of batch can reduce the pressure of machine, and can restrain faster.

Training obtains the classification mould of accuracy rate 90 or more percent after the completion of iteration 50 times trained on KTH data set Type；

By human body behavior video to be sorted according to step 1) -4), gray channel and inter-frame difference channel are extracted, is inputted Into trained binary channels 3D convolutional neural networks, the correspondence classification of behavior video can be obtained, reach people in identification video The purpose of body behavior.

Skilled person will appreciate that dawn, in the above-described embodiments, basic content be it is general, the main distinction exists In the attention matrix calculation in specific embodiment 2 is the improvement of the calculation provided in embodiment 1, is used Three frame differences can describe the variation between present frame and before and after frames, and two frame differences in embodiment 1 then only describe present frame with Variation on front or rear frame single one direction.Moreover, three frame difference method use be not as traditional three frame difference methods, After finding out the difference of present frame and former frame and present frame and a later frame, continues that two frame differences is allowed to make the difference again, but asked two The union of a frame difference can fully describe movement changed region in an of short duration time interval in this way.

Wherein convolution kernel size passes through experimental verification, more rapid convergence and can have higher identification standard at 3 × 3 × 3 True rate, best performance；5 × 5 × 5 convolution kernel can reach highest recognition accuracy, but parameter amount is big, restrain slower.Net Network layers number can suitably increase and decrease according to the complexity of data set to be identified, but 7 feature extraction layers at most are not to be exceeded, otherwise Parameter amount is excessive, is not easy to realize real-time.The size of input video can be adjusted according to different application, but should not be too large or It is too small.It is excessive to will lead to containing excessive redundant details information in video, it is time-consuming big when training, it is easy to happen over-fitting；It is too small then Missing information is too many, can not extract enough features to be identified.General recommendations is down-sampled by the biggish video of resolution ratio To within 500 × 500, wherein video content is more complicated can retain biggish resolution ratio, and video content is relatively simple It can be downsampled to lesser size, but be not less than 100 × 100.Dropout probability recommended setting is small between 0.25 to 0.5 Number, the heterogeneous network number for being set as generating when 0.5 are most.

In conclusion the method for this patent does not need complicated pretreatment, obtained without the help of traditional artificial experience Common feature excavates the deeper for including in original video, more abstract motion feature using depth convolutional neural networks, can The raw information in video is made full use of, the Human bodys' response of complex scene can be preferably adapted to.And it does not need to carry out Description son construction, directly extracts space-time characteristic using 3D convolutional neural networks, eliminates complicated video pre-filtering and manual features Extraction step improves efficiency.

The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, although referring to above-described embodiment pair The present invention is described in detail, it should be understood by a person of ordinary skill in the art that still can be to of the invention specific Embodiment is modified or replaced equivalently, and without departing from any modification of spirit and scope of the invention or equivalent replacement, It is intended to be within the scope of the claims of the invention.

Claims

1. a kind of Human bodys' response method based on attention mechanism and 3D convolutional neural networks, which is characterized in that the human body Activity recognition method constructs a 3D convolutional neural networks, the input layer of the 3D convolutional neural networks include original gradation figure and Two channels of attention matrix.

2. a kind of Human bodys' response side based on attention mechanism and 3D convolutional neural networks according to claim 1 Method, which is characterized in that the attention matrix is and to carry out normalizing by calculating the difference between two continuous frames or three frame images It is obtained after change.

3. a kind of Human bodys' response based on attention mechanism and 3D convolutional neural networks according to claim 2 Method, which is characterized in that three frame difference method is to find out the difference of present frame and former frame and present frame and a later frame respectively Then image takes " union " of difference result twice；This concept of union is poor by taking each pixel present frame and before and after frames The larger value of point result obtains, and the region of maximum variation occurs before and after so that the result is indicated present frame.

4. a kind of Human bodys' response side based on attention mechanism and 3D convolutional neural networks according to claim 2 Method, which is characterized in that in the two frame differences method, attention matrix A is to be calculated by the following formula to obtain：

Wherein, x, y are the coordinate of target pixel points, and t is current frame number, and t-1 indicates the former frame of present frame, and It is present frame In x, the gray value of y location, formula (3) is to calculate the distance between adjacent two frame D, will be without aobvious by the threshold value T in formula (2) Work property region of variation is rejected, and obtains conspicuousness region of variation ID, adjusting the distance by formula (1) is normalized, and finally obtains note Meaning power matrix A, wherein min and max is the minimum value and maximum in all pixels in conspicuousness region of variation ID in gray value Value, the three-dimensional matrice can indicate to act conspicuousness region of variation in the human body behavior video of input.

5. a kind of Human bodys' response side based on attention mechanism and 3D convolutional neural networks according to claim 3 Method, which is characterized in that the Three image difference step is：

1) the continuous three frames image I in selecting video frame sequence_t-1(x, y), I_t(x, y), I_t+1(x, y) calculates separately adjacent two frame The difference D of image_t-1,t(x, y), D_t,t+1(x,y)：

3) two difference image logical "or"s will be obtained in one group, be obtained the union of region of variation between two continuous frames, be obtained The front and back conspicuousness region of variation of intermediate frame in three frame images, B (x, y),

B (x, y)=max (B₁(x,y),B₂(x,y)) (23)

4) finally obtained difference image is normalized, obtains frame difference channel A (x, y), which can indicate defeated Conspicuousness region of variation is acted in the human body behavior video entered.

6. a kind of human body behavior based on attention mechanism and 3D convolutional neural networks according to claim 1-5 Recognition methods, which is characterized in that the 3D convolutional Neural pessimistic concurrency control of the 3D convolutional neural networks includes：

One binary channels input layer, multiple 3D convolutional layers and multiple ponds 3D layer intermesh interspersed, finally connect full articulamentum After obtain classification results, attention matrix is input to mind by the binary channels input layer with original gradation video frame cube together Through in network model.

7. a kind of Human bodys' response side based on attention mechanism and 3D convolutional neural networks according to claim 6 Method, which is characterized in that the full articulamentum is two, has one Dropout layers respectively before two full articulamentums.

8. a kind of Human bodys' response side based on attention mechanism and 3D convolutional neural networks according to claim 8 Method, which is characterized in that the Dropout probability is set as the decimal between 0.25 to 0.5.

9. a kind of Human bodys' response side based on attention mechanism and 3D convolutional neural networks according to claim 6 Method, which is characterized in that the 3D convolutional layer and the pond 3D layer are a for respectively 3-7；Of the 3D convolutional layer and the pond 3D layer Number is respectively 5.

10. a kind of human body row based on attention mechanism and 3D convolutional neural networks according to claim 1-6 3D convolutional Neural pessimistic concurrency control for recognition methods, the 3D convolutional neural networks includes：1 binary channels input layer, 5 3D convolutional layers It intermeshes with 5 pond 3D layers interspersed, obtains classification results after finally connecting 2 full articulamentums, in 2 full articulamentum difference Dropout twice is carried out to operate,

Wherein：

C1 to C5 be convolutional layer, every layer of convolution kernel is 3 × 3 × 3, and convolution nuclear volume is incremented by successively by 16 to 256, so as to from Rudimentary feature combination generates further types of high-level characteristic, at C1 layers, convolution kernel to attention matrix and original video frame into Row binary channels convolution,

S1 to S5 layers are down-sampling layer, using maximum pond method, reduce the resolution ratio of characteristic pattern, reduce characteristic pattern scale, subtract Small calculation amount improves the tolerance to distort to input picture；Wherein use 2 × 2 × 2 window while to the time for S2, S4 layers Dimension and Spatial Dimension carry out down-sampling, other layers use 1 × 2 × 2 window, down-sampling is only carried out on Spatial Dimension；

D1 layers are full articulamentums, include 256 neurons, the S5 layer feature cube exported and D1 layers of 256 neuron phases Even, the input video of continuous 15 frame is converted into the feature vector of 256 dimensions in this layer, Dropout has been used between S5 and D1 Layer, freezes S5 layers with 0.25 probability and connects with D1 layers of part；

D2 layers are second full articulamentums while being also output layer, and neuron number 6 is identical with target category number, D2 layers Each neuron is connect entirely with 256 neuron of D1 layer, is finally classified by classifier softmax recurrence, obtaining can mark The output of note behavior classification；

Wherein, 3D convolutional layer and full articulamentum D1 use ReLU as activation primitive and are promoted the performance of depth network, output layer Use Softmax as activation primitive, majorized function uses SGD function, and loss function intersects entropy function using multiclass.