CN108108699A

CN108108699A - Merge deep neural network model and the human motion recognition method of binary system Hash

Info

Publication number: CN108108699A
Application number: CN201711422702.9A
Authority: CN
Inventors: 李伟生; 冯晨; 肖斌
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2018-06-01

Abstract

The human motion recognition method that the present invention is a kind of deep neural network model and binary system Hash is combined, belongs to mode identification technology.This method includes:Pretreatment cutting framed sequence is carried out to action recognition database first, calculates light stream figure, and the coordinate of human joint points is calculated using Attitude estimation algorithm, video area frame is extracted using result coordinate；Secondly FC (Full Convolutional) feature is extracted respectively with light stream to the RGB streams of video using 16 network models of VGG of pre-training, key frame is chosen in sequence of frames of video, FC features corresponding to these key frames take difference；Binary conversion treatment is done to difference；The uniform characteristics for obtaining each video with binary hashing methods again represent；With obtaining the character representation of video using a variety of method for normalizing such as L1, L2 after PCNN Fusion Features；Finally it is identified using algorithm of support vector machine training grader human action video.The present invention has higher action recognition accuracy.

Description

Merge deep neural network model and the human motion recognition method of binary system Hash

Technical field

It is more particularly to a kind of to combine two based on deep neural network model the invention belongs to image/video processing technology field The human motion recognition method of system Hash.

Background technology

In recent years, research of the human action identification in fields such as pattern-recognition, image processing and analysis achieves very big Progress, at present existing part human body motion recognition system input actual use.Human action recognizer mainly includes action schedule Show with two steps of the classification of motion, how to encode human body action message is a very crucial step to the subsequent classification of motion.Reason In the case of thinking action represent algorithm not only will to the variation of human appearance, scale, complex background and responsiveness, but also comprising Enough information is supplied to grader to be divided for type of action.But the changeable sex chromosome mosaicism of complex background and human body in itself gives people body Action recognition brings great challenge.

A series of frame that short-sighted frequency is regarded as inputs by deep learning method is handled.It is obvious that using individual frame not It is enough the dynamic of effective capturing motion, and a large amount of frames need substantial amounts of parameter, so as to cause model over-fitting, it is necessary to bigger Training set, computation complexity also higher.This problem is existed in other popular CNN frameworks, such as Tran.D et al. The 3D convolutional networks of proposition.Therefore, state-of-the-art depth motion identification model is usually trained to from short video clipping generation Then useful feature is collected and generates whole sequence level descriptor, be then used for line of the training with specific action label Property grader.In the PCNN models proposed in Cheron et al., the output characteristics of the FC layers flowed by extracting video RGB and combination The character representation of video is obtained using min or max ponds method.But min or max ponds method is only captured between feature Level-one association, aggregation operator can more properly capture the High order correletion between CNN functions.

Although CNN in the function of frame grade may it is extremely complex, it is contemplated that using video frame change between pass Connection property can capture video unique feature this potentially contribute to improve video identification performance.

The content of the invention

Present invention seek to address that above problem of the prior art.It is deep to propose a kind of fusion with better recognition effect Spend the human motion recognition method of neural network model and binary system Hash.Technical scheme is as follows：

A kind of human motion recognition method for merging deep neural network model and binary system Hash, including following step Suddenly：

101st, the short-sighted frequency for including human action is obtained, and the short-sighted frequency is cut into sequence of frames of video；

102nd, using the light stream figure of adjacent video frames in 101 sequence of frames of video of optical flow algorithm calculation procedure；

103rd, the coordinate of human joint points is obtained using Attitude estimation algorithm to 101 sequence of frames of video；

104th, the RGB and light stream administrative division map at the body joint point coordinate interception different human body position obtained using step 103, is obtained The RGB frame sequence of video and light stream frame sequence；

105th, using the VGG-16 models and light stream net of Oxford University's visual geometric group (Visual Geometry Group) The full connection that network (FlowNet) model extracts the RGB frame sequence that step 104 obtains with each frame in light stream frame sequence (Full Connected) layer feature, this layer of characteristic dimension are 4096 dimensions；

106th, the FC features obtained using step 105 are carried out pondization operation and are assembled, and it is special to obtain the video tieed up n × 4096 Sign represents；

107th, the video features for obtaining step 106 carry out l₂Linear SVM grader is sent into after normalization to classify.

Further, the step 102 uses the light stream figure of 101 adjacent video frame sequence of optical flow algorithm calculation procedure, tool Body includes step：

Light stream vector between 201. two adjacent video frames of extraction；

Absolute value at all pixels point of the light stream vector of 202. pairs of generations horizontally and vertically is summed respectively, Obtain the sum of horizontally and vertically two light stream absolute values of frame；

203. generate the light stream absolute value of all frames and according to time sequence entire video level direction and vertical direction Light stream sequence.

Further, the step of RGB frame sequence of step 104 selecting video is closed with light stream frame sequence includes：

The sliding window size h of different sizes is chosen, and dynamically according to video frame number | F | the sample of acquisition S numbers Frame simultaneously extracts feature.f_TRepresent the frame in original video frame sequence, wherein original video shares T frames；It is crucial selected by expression A frame in frame sequence, key frame extraction use method shown in formula (2), choose a frame at interval of S frames, choose h frames altogether.

Further, the step 105 uses the convolution of two kinds of different frameworks to distinguish RGB sequences and light stream sequence Network model, each network contain five layers of convolutional layer and three layers of full articulamentum, use the defeated of second full articulamentum Go out as FC features i.e. video frame feature, input picture is uniformly adjusted to 224 × 224 size, can so obtain consistent FC layer features, we using min and max pondizations operation all frame features of one video are polymerize after just regarded The character representation of frequency.

Further, the FC features of the key frame to selection and corresponding 4096 dimension carry out adjacent mathematic interpolation, use 0,1 represents the variation tendency of feature, thus obtains the matrix of 4096 × h size, and each element is 0 or is 1 in matrix, Binary sequence of the extraction per a line is calculated using formula (3) and exported, thus obtained video corresponding 4096 as input The binary system Hash feature of dimension.

Further, the step 106 calculates video characteristic values and specifically includes：Compare two adjacent key framesWith Characteristic value changes, corresponding to the corresponding feature vector f of video frame_t ^p, more adjacent two frame is the same as the variation of characteristic value on dimension, increasing Adding and represented with 1, reduction is represented with 0, can so obtain the eigenvalue matrix M of a 4096*h, and matrix element only includes 0 or 1, For every a line feature vector [x of matrix_h-1,x_h-2,...,x₀] using the following formula (3) its binary system Hash mapping is calculated, The numeric string being made of 0 and 1 is converted into a signless integer by formula (3)；

The RGB streams of human body different parts and the binary system Hash feature of light stream frame changing features are finally obtained.

Further, step 107 is except using l₂Beyond normalization, fusion l is also used₁+β·l₂Feature normalization Mode, l₂Represent the second order normalized to feature, l₁It represents to normalize the single order of feature, β represents fusion normalization coefficient.When The mark sheet of video is obtained after the Fusion Features that finally feature extracted by deep neural network is obtained with binary system Hash Show p, since the characteristic value scale of separate sources has differences, normalize all characteristic values and reuse grader point to a scale Class.

Further, it is described to have used l₁+β·l₂The normalization mode of fusion, i.e.,

P=p/ (| | p | |₁+β·||p||₂) (4)

It advantages of the present invention and has the beneficial effect that：

The innovation of the present invention is：Depth network model and binary system hash method are blended.In view of in recent years Carry out depth convolutional neural networks to the validity and accuracy on objects in images characterization problems, so selection use covers 2 The VGG-16 network models of pre-training use bag to RGB frame sequential extraction procedures feature on the Imagenet data sets of more than ten thousand kinds of object The depth model for having contained pre-training on the UCF101 data sets of 101 kinds of actions extracts feature to light stream frame sequence.Use binary system The simple operations of hash method and high efficiency make at further high-order the static video frame and light stream frame feature of extraction Reason.With reference to after various features identification is trained using different method for normalizing.Thus know compared with traditional human action Other method has better recognition effect.

Description of the drawings

Fig. 1 is the output result figure that the present invention provides preferred embodiment Attitude estimation method；

Fig. 2 is the flow chart that the present invention provides preferred embodiment method；

Fig. 3 is binary system hash algorithm flow；

Fig. 4 is:The comparison figure of different method for normalizing.

Fig. 5 is that different size of hash window compares figure；

Fig. 6 is that different size of fusion coefficients compare figure.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, detailed Carefully describe.Described embodiment is only the part of the embodiment of the present invention.

The present invention solve above-mentioned technical problem technical solution be：

It is a kind of to be known based on the human action of depth network model and binary system hash method below in conjunction with the accompanying drawings shown in 1-2 Other method comprises the following steps：

1. extract the depth characteristic of video

Sample in the video library of experiment is divided into training set and test set, and to all sample extraction FC layers of features, it should Extracting method detailed step is as follows：

1) to input video cutting framing

In order to extract the local feature information of video, by the video slicing framed sequence for including human action of input.

2) light stream frame is calculated to RGB frame sequence using optical flow algorithm.

3) coordinate of the artis of human body is positioned using Attitude estimation algorithm.

4) region according to where more than body joint point coordinate extraction RGB frame sequence with human joint points in light stream frame sequence. Including head, shoulder, waist, ancon.

5) in order to distinguish RGB sequences and light stream sequence, we use the convolutional network model of two kinds of different frameworks, each net Network contains five layers of convolutional layer and three layers of full articulamentum.We use the output of second full articulamentum as FC features That is video frame feature.Input picture is uniformly adjusted to 224 × 224 size by we, and so we can obtain consistent FC Layer feature.We have just obtained video after polymerizeing using the operation of min and max pondizations to all frame features of a video Character representation.

2. calculate the binary system Hash feature of video

Observation is it can be found that the kinetic characteristic of video is sometimes what is distinguished by the transitory motions of Partial key.In order to further The kinetic characteristic of video is captured, we calculate the binary system Hash feature of video using following steps：

1) it is similar with extraction video depth feature.First to video slicing framing, light stream frame is calculated, extracts human joint points Coordinate calculates the corresponding FC features of frame sequence of different position of joints.

2) different videos has different frame numbers | F |, we define sliding window size as h, and step-length S is | F |/h.Every Corresponding step-length chooses key frame.As shown in Figure 3.

3) the FC features of the key frame to selection and corresponding 4096 dimension.We carry out adjacent mathematic interpolation, use 0,1 Represent the variation tendency of feature.So we just obtain the matrix of 4096 × h size, and each element is 0 or is in matrix 1. we extract the binary sequence of every a line as input, calculated and exported using formula (3).So we have just obtained video The binary system Hash feature of corresponding 4096 dimension.

3. merge depth characteristic and Hash feature

For the depth characteristic that above step 1,2 obtains and binary system Hash feature, we first have to carry out Fusion Features, SVM classifier is reused to classify.It is main to include step in detailed below：

1) depth characteristic and the fusion feature after Hash merging features are preserved.

2) normal form of eigenmatrix and L2 normal forms are calculated using the fusion feature of everything video.

3) to all elements divided by l in eigenmatrix₁Normal form, l₂Normal form obtains two kinds and different returns after being normalized One changes feature.

4) fusion factor β, l are defined₁+β·l₂Another normalization characteristic is obtained as the normalization normal form after fusion.

5) more than normalization characteristic and respective action class label are sent into SVM classifier, linear kernel is selected to be instructed Practice.

6) grader is trained to each video.Mark current class is positive sample, other all categories are negative sample This.The multiple graders of training.

7) for the video of test set, using each classifier calculated score, select highest scoring as accordingly moving Make classification.

One embodiment of the present of invention is as follows：

Using JHMDB and MPII-Cooking human actions storehouse as experimental data base.

JHMDB action datas collection includes 21 anthropoid actions, including combing one's hair, sitting, standing, running, waving.Each video is only One section of very short video is contained, includes 15-40 frames.Share 928 videos and 31838 frames marked.

MPII-Cooking action datas collection includes a series of action video that high-resolution humans are cooked in kitchen.Comprising The actions such as wash dishes, cut fruit, washing one's hands.Each video includes a kind of culinary art activity.The other culinary art action of 64 species is contained altogether, It is related to 3748 video segments and same background.

(1) JHDMB data sets are 80/20 point there are three types of different training sets/test set division, ratio.Guarantee can be covered Cover everything species.The accuracy rate of classification is calculated in each test division, using three kinds of divisions average achievement as commenting Price card is accurate.Specific test result is as shown in attached drawing 4, attached drawing 5.To be significantly better than using the effect of method for normalizing and use original spy Levy the result classified.Different size of hash window is chosen, in most cases l₁Normalization is better than l₂Normalization.

We equally compare under different hash windows difference fusion coefficients β to l₁+β·l₂Normalized influence.Experiment As a result as shown in Figure 6.

(2) we test classification effect using identical method on JHMDB data sets and MPII-Cooking data sets Fruit.As shown in table 1, classifying quality show to have merged depth network characterization and binary system Hash feature method be better than before base In the method for PCNN models.

Table 1:In JHDMB data sets method for normalizing combination Hash feature different from MPII-Cooking data sets to dividing The influence of class result

The above embodiment is interpreted as being merely to illustrate the present invention rather than limit the scope of the invention. After the content for having read the record of the present invention, technical staff can make various changes or modifications the present invention, these equivalent changes Change and modification equally falls into the scope of the claims in the present invention.

Claims

1. a kind of human motion recognition method for merging deep neural network model and binary system Hash, which is characterized in that including Following steps：

102nd, using the light stream figure of consecutive frame in 101 sequence of frames of video of optical flow algorithm calculation procedure；

104th, the RGB and light stream administrative division map at the body joint point coordinate interception different human body position obtained using step 103, obtains video RGB frame sequence and light stream frame sequence；

105th, the RGB frame obtained using the VGG-16 models of Oxford University's visual geometric group with light stream network model to step 104 Sequence and the full articulamentum feature of each frame extraction in light stream frame sequence, this layer of characteristic dimension are 4096 dimensions；

106th, the FC features obtained using step 105 are carried out pondization operation and are assembled, and obtain the video features table that n × 4096 are tieed up Show；

2. the human motion recognition method of fusion deep neural network model according to claim 1 and binary system Hash, It is characterized in that, the step 102 is specifically included using the light stream figure of 101 adjacent video frame sequence of optical flow algorithm calculation procedure Step：

Light stream vector between 201. two adjacent video frames of extraction；

Absolute value at all pixels point of the light stream vector of 202. pairs of generations horizontally and vertically is summed respectively, is obtained The sum of horizontally and vertically two light stream absolute values of frame；

203. generate the light stream absolute value of all frames and according to time sequence the light stream of entire video level direction and vertical direction Sequence.

3. the human motion recognition method of fusion deep neural network model according to claim 1 and binary system Hash, It is characterized in that, the RGB frame sequence of step 104 selecting video includes the step of being closed with light stream frame sequence：

The sliding window size h of different sizes is chosen, and dynamically according to video frame number | F | the sample frame of acquisition S numbers is simultaneously Extract feature, f_TRepresent the frame in original video frame sequence, wherein original video shares T frames；Key frame sequence selected by expression A frame in row, key frame extraction use method shown in formula (2), choose a frame at interval of S frames, choose h frames altogether；

<mrow> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mo>&lsqb;</mo> <msub> <mi>f</mi> <msub> <mi>t</mi> <mn>1</mn> </msub> </msub> <mo>,</mo> <msub> <mi>f</mi> <msub> <mi>t</mi> <mn>2</mn> </msub> </msub> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msub> <mi>f</mi> <msub> <mi>t</mi> <mi>h</mi> </msub> </msub> <mo>&rsqb;</mo> <mo>&SubsetEqual;</mo> <mi>F</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>F</mi> <mo>=</mo> <mo>&lsqb;</mo> <msub> <mi>f</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>f</mi> <mn>2</mn> </msub> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msub> <mi>f</mi> <mi>T</mi> </msub> <mo>&rsqb;</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

4. the human motion recognition method of fusion deep neural network model according to claim 3 and binary system Hash, It is characterized in that, the step 105 in order to distinguish RGB sequences and light stream sequence, uses the convolutional network mould of two kinds of different frameworks Type, each network contain five layers of convolutional layer and three layers of full articulamentum, using second full articulamentum output as Input picture is uniformly adjusted to 224 × 224 size by FC features, that is, video frame feature, can so obtain consistent FC layers Feature has just obtained the mark sheet of video after polymerizeing using the operation of min and max pondizations to all frame features of a video Show.

5. the human motion recognition method of fusion deep neural network model according to claim 4 and binary system Hash, It is characterized in that, the FC features of the key frame and corresponding 4096 dimension to selection carry out adjacent mathematic interpolation, represented using 0,1 The variation tendency of feature thus obtains the matrix of 4096 × h size, and each element is 0 or is 1 in matrix, and extraction is often The binary sequence of a line is calculated using formula (3) and exported, thus obtained the two of corresponding 4096 dimension of video as input System Hash feature.

6. the human motion recognition method of fusion deep neural network model according to claim 4 and binary system Hash, It is specifically included it is characterized in that, the step 106 calculates video characteristic values：Compare two adjacent key framesWithCharacteristic value Variation, corresponding to the corresponding feature vector f of video frame_t ^p, more adjacent two frame increases with the variation of characteristic value on dimension with 1 It representing, reduction is represented with 0, can so obtain the eigenvalue matrix M of a 4096*h, and matrix element only includes 0 or 1, for Every a line feature vector [x of matrix_h-1,x_h-2,...,x₀] using the following formula (3) calculate its binary system Hash mapping, formula (3) numeric string being made of 0 and 1 is converted into a signless integer；

<mrow> <mi>B</mi> <mn>2</mn> <msub> <mi>U</mi> <mi>w</mi> </msub> <mrow> <mo>(</mo> <mover> <mi>x</mi> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>w</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>&times;</mo> <msup> <mn>2</mn> <mi>i</mi> </msup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

7. the human motion recognition method of fusion deep neural network model according to claim 6 and binary system Hash, It is characterized in that, step 107 is except using l₂Beyond normalization, fusion l is also used₁+β·l₂Feature normalization mode, l₂ Represent the second order normalized to feature, l₁It represents to normalize the single order of feature, β represents fusion normalization coefficient.When final handle The character representation p of video is obtained after the Fusion Features obtained by the feature that deep neural network is extracted with binary system Hash, by It is had differences in the characteristic value scale of separate sources, normalizes all characteristic values and reuse grader classification to a scale.

8. the human motion recognition method of fusion deep neural network model according to claim 7 and binary system Hash, It is characterized in that, described used l₁+β·l₂The normalization mode of fusion, i.e.,

P=p/ (| | p | |₁+β·||p||₂) (4)。