Disclosure of Invention
The invention aims to provide an action identification method and system based on gradient boundary graph and multi-mode convolution fusion.
In order to achieve the purpose, the invention adopts the following technical scheme:
a motion identification method based on gradient boundary graph and multi-mode convolution fusion comprises the following steps:
s1 samples the original video to obtain a representative frame fpTake f from the original videop、fpS frame images and fpThe subsequent S frame images form a continuous frame set Sp=[fp-s,…,fp,…,fp+s](ii) a s is an empirical value, and the value range of s is 5-10; the original video is an original video training sample or an original video to be identified;
s2 calculating S
pObtaining a gradient boundary matrix according to a gradient boundary value between two adjacent frames of images, and obtaining a gradient boundary atlas according to the gradient boundary matrix; the gradient boundary matrix
P
t xAnd P
t yRespectively represents f
tAnd its subsequent adjacent frame image f
t+1A gradient boundary matrix between the image in the transverse direction and the image in the longitudinal direction, t being p-s, p-s + 1.
Pt xFrom the element Pt x(u, v) constitution, Pt x(u,v)=[ft+1(u+1,v)-ft+1(u,v)]-[ft(u+1,v)-ft(u,v)]Wherein, (u, v) represents pixel coordinates; pt x(u, v) denotes ftGradient boundary values of the middle pixel (u, v) in the image transverse direction; f. oft+1(u +1, v) denotes ft+1Middle imageThe gray value of the pixel (u +1, v); f. oft+1(u, v) denotes ft+1The gray value of the middle pixel (u, v); f. oft(u +1, v) denotes ftThe gray value of the middle pixel (u +1, v); f. oft(u, v) denotes ftThe gray value of the middle pixel (u, v);
Pt yfrom the element Pt y(u, v) constitution, Pt y(u,v)=[ft+1(u,v+1)-ft+1(u,v)]-[ft(u,v+1)-ft(u,v)]Wherein, (u, v) represents pixel coordinates; pt y(u, v) denotes ftGradient boundary values of the middle pixel (u, v) in the longitudinal direction of the image; f. oft+1(u, v +1) represents ft+1The gray value of the middle pixel (u, v + 1); f. oft+1(u, v) denotes ft+1The gray value of the middle pixel (u, v); f. oft(u, v +1) represents ftThe gray value of the middle pixel (u, v + 1); f. oft(u, v) denotes ftThe gray value of the middle pixel (u, v);
s3 calculating a set S of consecutive frames
pInter-frame optical flow between two adjacent frames of images, thereby obtaining an optical flow atlas; the optical flow diagram set
of
t xAnd of
t yDenotes f
tAnd f
t+1Inter-frame optical flows in the image transverse direction and the image longitudinal direction, t ═ p-s, p-s +1, a.
S4, training a convolutional neural network by adopting the representative frame, the gradient boundary atlas and the optical flow atlas of each original video training sample; using each original video training sample and the representative frame, the gradient boundary graph set and the optical flow graph set of the original video to be identified as input, and adopting the trained convolutional neural network to obtain the CNN characteristic C of each original video training sample and the representative frame of the original video to be identifiedrgbGradient boundary CNN feature CgbfAnd optical flow CNN feature Cof;
S5 Using C of each original video training samplergb、CgbfAnd CofTraining fusion formula Cfusion=ycatK and b in k + b, where k is a convolution kernel parameter(ii) a b is a bias parameter; y iscat=[Cgbf,Crgb,Cof](ii) a Fusing C of the original video to be recognized by adopting the trained fusion formulargb、CgbfAnd CofObtaining a fusion feature Cfusion;
S6 based on fusion feature CfusionAnd performing action recognition by adopting an action classification algorithm.
Secondly, an action recognition system based on gradient boundary graph and multi-mode convolution fusion comprises the following steps:
a continuous frame set forming module for sampling the original video to obtain a representative frame fpTake f from the original videop、fpS frame images and fpThe subsequent S frame images form a continuous frame set Sp=[fp-s,…,fp,…,fp+s](ii) a s is an empirical value, and the value range of s is 5-10; the original video is an original video training sample or an original video to be identified;
a gradient boundary map set obtaining module for calculating S
pObtaining a gradient boundary matrix according to a gradient boundary value between two adjacent frames of images, and obtaining a gradient boundary atlas according to the gradient boundary matrix; the gradient boundary matrix
P
t xAnd P
t yRespectively represents f
tAnd its subsequent adjacent frame image f
t+1A gradient boundary matrix between the image in the transverse direction and the image in the longitudinal direction, t being p-s, p-s + 1.
Pt xFrom the element Pt x(u, v) constitution, Pt x(u,v)=[ft+1(u+1,v)-ft+1(u,v)]-[ft(u+1,v)-ft(u,v)]Wherein, (u, v) represents pixel coordinates; pt x(u, v) denotes ftGradient boundary values of the middle pixel (u, v) in the image transverse direction; f. oft+1(u +1, v) denotes ft+1The gray value of the middle pixel (u +1, v); f. oft+1(u, v) denotes ft+1The gray value of the middle pixel (u, v); f. oft(u +1, v) denotes ftThe gray value of the middle pixel (u +1, v); f. oft(u, v) denotes ftThe gray value of the middle pixel (u, v);
Pt yfrom the element Pt y(u, v) constitution, Pt y(u,v)=[ft+1(u,v+1)-ft+1(u,v)]-[ft(u,v+1)-ft(u,v)]Wherein, (u, v) represents pixel coordinates; pt y(u, v) denotes ftGradient boundary values of the middle pixel (u, v) in the longitudinal direction of the image; f. oft+1(u, v +1) represents ft+1The gray value of the middle pixel (u, v + 1); f. oft+1(u, v) denotes ft+1The gray value of the middle pixel (u, v); f. oft(u, v +1) represents ftThe gray value of the middle pixel (u, v + 1); f. oft(u, v) denotes ftThe gray value of the middle pixel (u, v);
an optical flow map set acquisition module for calculating a set of successive frames S
pInter-frame optical flow between two adjacent frames of images, thereby obtaining an optical flow atlas; the optical flow diagram set
of
t xAnd of
t yDenotes f
tAnd f
t+1Inter-frame optical flows in the image transverse direction and the image longitudinal direction, t ═ p-s, p-s +1, a.
The CNN characteristic identification module is used for adopting the representative frame, the gradient boundary atlas and the optical flow atlas of each original video training sample to train a convolutional neural network; using each original video training sample and the representative frame, the gradient boundary graph set and the optical flow graph set of the original video to be identified as input, and adopting the trained convolutional neural network to obtain the CNN characteristic C of each original video training sample and the representative frame of the original video to be identifiedrgbGradient boundary CNN feature CgbfAnd optical flow CNN feature Cof;
A fusion module for adopting C of each original video training samplergb、CgbfAnd CofTraining fusion formula Cfusion=ycatParameters k and b in the x k + b, wherein k is a convolution kernel parameter; b is a bias parameter;ycat=[Cgbf,Crgb,Cof](ii) a Fusing C of the original video to be recognized by adopting the trained fusion formulargb、CgbfAnd CofObtaining a fusion feature Cfusion;
A motion recognition module for recognizing the motion based on the fusion feature CfusionAnd performing action recognition by adopting an action classification algorithm.
Compared with the prior art, the invention has the beneficial effects that:
the important motion spatiotemporal information of the gradient boundary graph is added, the multimode data convolution fusion method is provided, the consistency of multimode spatiotemporal feature fusion is ensured, the human motion feature description accuracy in the video is improved, and the human motion recognition rate is improved.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples for the purpose of facilitating understanding and practicing the invention by those of ordinary skill in the art, it being understood that the examples described herein are for the purpose of illustration and explanation, and are not to be construed as limiting the invention.
Referring to fig. 1, an action identification method based on a gradient boundary graph and multi-mode convolution fusion provided by the embodiment of the present invention specifically includes the following steps:
step 1: for original video F ═ F1,…,fi,…,fn]Sampling to obtain frame image fpRepresentative frame f as a representative frame of the original videopAnd the former S frame image and the latter S frame image form a continuous frame set Sp=[fp-s,…,fp,…,fp+s]. Wherein f isiThe image of the ith frame in the original video is represented, i is 1,2, …, n is the total frame number of the image in the original video; s is an empirical value, and the preferable value range is 5-10. In this embodiment, S is 5, and the obtained continuous frame set is denoted as Sp=[fp-5,…,fp,…,fp+5]。
The acquisition of the representative frame may be achieved using techniques customary in the art, for example, but not limited to, random sampling.
Step 2: based on a set S of consecutive frames
pCalculating a set S of consecutive frames
pObtaining a gradient boundary matrix by the gradient boundary value between two adjacent frames of images
Wherein, P
t xAnd P
t yRepresenting a set S of consecutive frames
pIn f
tAnd its subsequent adjacent frame image f
t+1A gradient boundary matrix, t ═ p-5, p-4, p +4, in the image transverse direction (X direction) and the image longitudinal direction (Y direction), respectively.
To Pt xAny one element P in (1)t x(u, v) is calculated as follows:
Pt x(u,v)=[ft+1(u+1,v)-ft+1(u,v)]-[ft(u+1,v)-ft(u,v)] (1)
wherein, (u, v) represents pixel coordinates; pt x(u, v) represents the frame image ftGradient boundary values of the middle pixel (u, v) in the X direction; f. oft+1(u +1, v) represents the frame image ft+1The gray value of the middle pixel (u +1, v); f. oft+1(u, v) represents the frame image ft+1The gray value of the middle pixel (u, v); f. oft(u +1, v) represents the frame image ftThe gray value of the middle pixel (u +1, v); f. oft(u, v) represents the frame image ftThe gray value of the middle pixel (u, v).
Accordingly, Pt yAny one of the elements Pt y(u, v) is calculated as follows:
Pt y(u,v)=[ft+1(u,v+1)-ft+1(u,v)]-[ft(u,v+1)-ft(u,v)] (2)
wherein, (u, v) represents pixel coordinates; pt y(u, v) represents the frame image ftGradient boundary values of the middle pixel (u, v) in the Y direction; f. oft+1(u, v +1) represents the frame image ft+1The gray value of the middle pixel (u, v + 1); f. oft+1(u, v) shows a frame mapImage ft+1The gray value of the middle pixel (u, v); f. oft(u, v +1) represents the frame image ftThe gray value of the middle pixel (u, v + 1); f. oft(u, v) represents the frame image ftThe gray value of the middle pixel (u, v).
For each P
t x、P
t yLinearly scaling its value to [0,255 ]]To obtain a gradient boundary atlas
And step 3: computing a set of successive frames S
pLinearly scaling the inter-frame optical flow to [0,255 ] between two adjacent frames of images]Integer value of between, obtaining an optical flow atlas
of
t xAnd of
t yRepresenting a set S of consecutive frames
pIn f
tAnd f
t+1Inter-frame optical flow, t ═ p-5, p-4, p +4, in the X and Y directions, respectively.
And 4, step 4: respectively learning features based on a Convolutional Neural Network (CNN) by using multimode data as input data to obtain a gradient boundary CNN feature CgbfRepresentative frame CNN feature CrgbAnd optical flow CNN feature Cof. The multi-mode data comprises a gradient boundary atlas GBF and a representative frame fpAnd an optical flow atlas OF.
And 5: for the gradient boundary CNN feature CgbfOriginal frame CNN feature CrgbAnd optical flow CNN feature CofPerforming multimode CNN feature fusion to obtain fusion feature Cfusion。
The fusion formula is:
Cfusion=ycat*k+b (3)
wherein k is a convolution kernel parameter; b is a bias parameter; y iscat=[Cgbf,Crgb,Cof]. The convolution kernel parameter k and the bias parameter b are obtained in the CNN parameter training process.
Step 6: based on fusion characteristics CfusionUsing a motion classification algorithmAnd performing action recognition.
The method of the invention is divided into two stages of training and action recognition. And in the training stage, training the weight parameter, the convolution kernel parameter k and the bias parameter b of the CNN by adopting a training sample. And in the action recognition stage, a trained CNN network and a fusion formula are adopted to extract fusion characteristics, and classification results are given based on the fusion characteristics.
It should be understood that parts of the specification not set forth in detail are well within the prior art.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.