CN108288016B

CN108288016B - Action identification method and system based on gradient boundary graph and multi-mode convolution fusion

Info

Publication number: CN108288016B
Application number: CN201710018537.4A
Authority: CN
Inventors: 胡瑞敏; 陈军; 陈华锋; 李红阳; 徐增敏; 吴华; 柴笑宇; 柯亨进; 马宁
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2017-01-10
Filing date: 2017-01-10
Publication date: 2021-09-03
Anticipated expiration: 2037-01-10
Also published as: CN108288016A

Abstract

The invention discloses an action recognition method and system based on gradient boundary map and multi-mode convolution fusion, comprising the steps of: S1 constructing a continuous frame set based on the original video; S2 calculating the gradient boundary value between two adjacent frame images in the continuous frame set , so as to obtain the gradient boundary atlas; S3 calculates the inter-frame optical flow between two adjacent frame images in the continuous frame set, thereby obtaining the optical flow atlas; S4 takes the representative frame of the original video, the gradient boundary atlas and the optical flow atlas as input , using the convolutional neural network to obtain the multimodal CNN features of the original video; S5 fuses the multimodal CNN features of the original video to obtain the fusion features; S6 uses the action classification algorithm for action recognition based on the fusion features. The invention increases the important action spatiotemporal information of the gradient boundary map, and proposes a multi-mode data convolution fusion method, which ensures the consistency of multi-mode spatio-temporal feature fusion, improves the accuracy of human action feature description in the video, and improves the Human action recognition rate.

Description

Action identification method and system based on gradient boundary graph and multi-mode convolution fusion

Technical Field

The invention belongs to the technical field of automatic video analysis, and relates to a motion recognition method and system based on gradient boundary graph and multi-mode convolution fusion.

Background

With the development of computer technology, how to automatically analyze and understand videos by using a computer is more and more urgent. The human body is a main object concerned by people in video data, and the purpose of recognizing human body behaviors in the video and generating high-level semantic information which is easier to understand is to analyze and understand the main content of the video by a computer. From the application perspective, as an important research content in the field of computer vision, human behavior recognition can meet the requirements of tasks such as intelligent video monitoring, intelligent monitoring, content-based video analysis and the like on automatic analysis and intellectualization, and social development progress is promoted.

Disclosure of Invention

The invention aims to provide an action identification method and system based on gradient boundary graph and multi-mode convolution fusion.

In order to achieve the purpose, the invention adopts the following technical scheme:

a motion identification method based on gradient boundary graph and multi-mode convolution fusion comprises the following steps:

s1 samples the original video to obtain a representative frame f_pTake f from the original video_p、f_pS frame images and f_pThe subsequent S frame images form a continuous frame set S_p＝[f_p-s,…,f_p,…,f_p+s](ii) a s is an empirical value, and the value range of s is 5-10; the original video is an original video training sample or an original video to be identified;

s2 calculating S_pObtaining a gradient boundary matrix according to a gradient boundary value between two adjacent frames of images, and obtaining a gradient boundary atlas according to the gradient boundary matrix; the gradient boundary matrix

P_t ^xAnd P_t ^yRespectively represents f_tAnd its subsequent adjacent frame image f_t+1A gradient boundary matrix between the image in the transverse direction and the image in the longitudinal direction, t being p-s, p-s + 1.

P_t ^xFrom the element P_t ^x(u, v) constitution, P_t ^x(u,v)＝[f_t+1(u+1,v)-f_t+1(u,v)]-[f_t(u+1,v)-f_t(u,v)]Wherein, (u, v) represents pixel coordinates; p_t ^x(u, v) denotes f_tGradient boundary values of the middle pixel (u, v) in the image transverse direction; f. of_t+1(u +1, v) denotes f_t+1Middle imageThe gray value of the pixel (u +1, v); f. of_t+1(u, v) denotes f_t+1The gray value of the middle pixel (u, v); f. of_t(u +1, v) denotes f_tThe gray value of the middle pixel (u +1, v); f. of_t(u, v) denotes f_tThe gray value of the middle pixel (u, v);

P_t ^yfrom the element P_t ^y(u, v) constitution, P_t ^y(u,v)＝[f_t+1(u,v+1)-f_t+1(u,v)]-[f_t(u,v+1)-f_t(u,v)]Wherein, (u, v) represents pixel coordinates; p_t ^y(u, v) denotes f_tGradient boundary values of the middle pixel (u, v) in the longitudinal direction of the image; f. of_t+1(u, v +1) represents f_t+1The gray value of the middle pixel (u, v + 1); f. of_t+1(u, v) denotes f_t+1The gray value of the middle pixel (u, v); f. of_t(u, v +1) represents f_tThe gray value of the middle pixel (u, v + 1); f. of_t(u, v) denotes f_tThe gray value of the middle pixel (u, v);

s3 calculating a set S of consecutive frames_pInter-frame optical flow between two adjacent frames of images, thereby obtaining an optical flow atlas; the optical flow diagram set

of_t ^xAnd of_t ^yDenotes f_tAnd f_t+1Inter-frame optical flows in the image transverse direction and the image longitudinal direction, t ═ p-s, p-s +1, a.

S4, training a convolutional neural network by adopting the representative frame, the gradient boundary atlas and the optical flow atlas of each original video training sample; using each original video training sample and the representative frame, the gradient boundary graph set and the optical flow graph set of the original video to be identified as input, and adopting the trained convolutional neural network to obtain the CNN characteristic C of each original video training sample and the representative frame of the original video to be identified_rgbGradient boundary CNN feature C_gbfAnd optical flow CNN feature C_of；

S5 Using C of each original video training sample_rgb、C_gbfAnd C_ofTraining fusion formula C_fusion＝y^catK and b in k + b, where k is a convolution kernel parameter(ii) a b is a bias parameter; y is^cat＝[C_gbf,C_rgb，C_of](ii) a Fusing C of the original video to be recognized by adopting the trained fusion formula_rgb、C_gbfAnd C_ofObtaining a fusion feature C_fusion；

S6 based on fusion feature C_fusionAnd performing action recognition by adopting an action classification algorithm.

Secondly, an action recognition system based on gradient boundary graph and multi-mode convolution fusion comprises the following steps:

a continuous frame set forming module for sampling the original video to obtain a representative frame f_pTake f from the original video_p、f_pS frame images and f_pThe subsequent S frame images form a continuous frame set S_p＝[f_p-_s,…,f_p,…,f_p+s](ii) a s is an empirical value, and the value range of s is 5-10; the original video is an original video training sample or an original video to be identified;

a gradient boundary map set obtaining module for calculating S_pObtaining a gradient boundary matrix according to a gradient boundary value between two adjacent frames of images, and obtaining a gradient boundary atlas according to the gradient boundary matrix; the gradient boundary matrix

P_t ^xFrom the element P_t ^x(u, v) constitution, P_t ^x(u,v)＝[f_t+1(u+1,v)-f_t+1(u,v)]-[f_t(u+1,v)-f_t(u,v)]Wherein, (u, v) represents pixel coordinates; p_t ^x(u, v) denotes f_tGradient boundary values of the middle pixel (u, v) in the image transverse direction; f. of_t+1(u +1, v) denotes f_t+1The gray value of the middle pixel (u +1, v); f. of_t+1(u, v) denotes f_t+1The gray value of the middle pixel (u, v); f. of_t(u +1, v) denotes f_tThe gray value of the middle pixel (u +1, v); f. of_t(u, v) denotes f_tThe gray value of the middle pixel (u, v);

an optical flow map set acquisition module for calculating a set of successive frames S_pInter-frame optical flow between two adjacent frames of images, thereby obtaining an optical flow atlas; the optical flow diagram set

The CNN characteristic identification module is used for adopting the representative frame, the gradient boundary atlas and the optical flow atlas of each original video training sample to train a convolutional neural network; using each original video training sample and the representative frame, the gradient boundary graph set and the optical flow graph set of the original video to be identified as input, and adopting the trained convolutional neural network to obtain the CNN characteristic C of each original video training sample and the representative frame of the original video to be identified_rgbGradient boundary CNN feature C_gbfAnd optical flow CNN feature C_of；

A fusion module for adopting C of each original video training sample_rgb、C_gbfAnd C_ofTraining fusion formula C_fusion＝y^catParameters k and b in the x k + b, wherein k is a convolution kernel parameter; b is a bias parameter；y^cat＝[C_gbf,C_rgb，C_of](ii) a Fusing C of the original video to be recognized by adopting the trained fusion formula_rgb、C_gbfAnd C_ofObtaining a fusion feature C_fusion；

A motion recognition module for recognizing the motion based on the fusion feature C_fusionAnd performing action recognition by adopting an action classification algorithm.

Compared with the prior art, the invention has the beneficial effects that:

the important motion spatiotemporal information of the gradient boundary graph is added, the multimode data convolution fusion method is provided, the consistency of multimode spatiotemporal feature fusion is ensured, the human motion feature description accuracy in the video is improved, and the human motion recognition rate is improved.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples for the purpose of facilitating understanding and practicing the invention by those of ordinary skill in the art, it being understood that the examples described herein are for the purpose of illustration and explanation, and are not to be construed as limiting the invention.

Referring to fig. 1, an action identification method based on a gradient boundary graph and multi-mode convolution fusion provided by the embodiment of the present invention specifically includes the following steps:

step 1: for original video F ═ F₁,…,f_i,…,f_n]Sampling to obtain frame image f_pRepresentative frame f as a representative frame of the original video_pAnd the former S frame image and the latter S frame image form a continuous frame set S_p＝[f_p-s,…,f_p,…,f_p+s]. Wherein f is_iThe image of the ith frame in the original video is represented, i is 1,2, …, n is the total frame number of the image in the original video; s is an empirical value, and the preferable value range is 5-10. In this embodiment, S is 5, and the obtained continuous frame set is denoted as S_p＝[f_p-5,…,f_p,…,f_p+5]。

The acquisition of the representative frame may be achieved using techniques customary in the art, for example, but not limited to, random sampling.

Step 2: based on a set S of consecutive frames_pCalculating a set S of consecutive frames_pObtaining a gradient boundary matrix by the gradient boundary value between two adjacent frames of images

Wherein, P_t ^xAnd P_t ^yRepresenting a set S of consecutive frames_pIn f_tAnd its subsequent adjacent frame image f_t+1A gradient boundary matrix, t ═ p-5, p-4, p +4, in the image transverse direction (X direction) and the image longitudinal direction (Y direction), respectively.

To P_t ^xAny one element P in (1)_t ^x(u, v) is calculated as follows:

P_t ^x(u,v)＝[f_t+1(u+1,v)-f_t+1(u,v)]-[f_t(u+1,v)-f_t(u,v)] (1)

wherein, (u, v) represents pixel coordinates; p_t ^x(u, v) represents the frame image f_tGradient boundary values of the middle pixel (u, v) in the X direction; f. of_t+1(u +1, v) represents the frame image f_t+1The gray value of the middle pixel (u +1, v); f. of_t+1(u, v) represents the frame image f_t+1The gray value of the middle pixel (u, v); f. of_t(u +1, v) represents the frame image f_tThe gray value of the middle pixel (u +1, v); f. of_t(u, v) represents the frame image f_tThe gray value of the middle pixel (u, v).

Accordingly, P_t ^yAny one of the elements P_t ^y(u, v) is calculated as follows:

P_t ^y(u,v)＝[f_t+1(u,v+1)-f_t+1(u,v)]-[f_t(u,v+1)-f_t(u,v)] (2)

wherein, (u, v) represents pixel coordinates; p_t ^y(u, v) represents the frame image f_tGradient boundary values of the middle pixel (u, v) in the Y direction; f. of_t+1(u, v +1) represents the frame image f_t+1The gray value of the middle pixel (u, v + 1); f. of_t+1(u, v) shows a frame mapImage f_t+1The gray value of the middle pixel (u, v); f. of_t(u, v +1) represents the frame image f_tThe gray value of the middle pixel (u, v + 1); f. of_t(u, v) represents the frame image f_tThe gray value of the middle pixel (u, v).

For each P_t ^x、P_t ^yLinearly scaling its value to [0,255 ]]To obtain a gradient boundary atlas

And step 3: computing a set of successive frames S_pLinearly scaling the inter-frame optical flow to [0,255 ] between two adjacent frames of images]Integer value of between, obtaining an optical flow atlas

of_t ^xAnd of_t ^yRepresenting a set S of consecutive frames_pIn f_tAnd f_t+1Inter-frame optical flow, t ═ p-5, p-4, p +4, in the X and Y directions, respectively.

And 4, step 4: respectively learning features based on a Convolutional Neural Network (CNN) by using multimode data as input data to obtain a gradient boundary CNN feature C_gbfRepresentative frame CNN feature C_rgbAnd optical flow CNN feature C_of. The multi-mode data comprises a gradient boundary atlas GBF and a representative frame f_pAnd an optical flow atlas OF.

And 5: for the gradient boundary CNN feature C_gbfOriginal frame CNN feature C_rgbAnd optical flow CNN feature C_ofPerforming multimode CNN feature fusion to obtain fusion feature C_fusion。

The fusion formula is:

C_fusion＝y^cat*k+b (3)

wherein k is a convolution kernel parameter; b is a bias parameter; y is^cat＝[C_gbf,C_rgb，C_of]. The convolution kernel parameter k and the bias parameter b are obtained in the CNN parameter training process.

Step 6: based on fusion characteristics C_fusionUsing a motion classification algorithmAnd performing action recognition.

The method of the invention is divided into two stages of training and action recognition. And in the training stage, training the weight parameter, the convolution kernel parameter k and the bias parameter b of the CNN by adopting a training sample. And in the action recognition stage, a trained CNN network and a fusion formula are adopted to extract fusion characteristics, and classification results are given based on the fusion characteristics.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The motion recognition method based on the gradient boundary graph and the multimode convolution fusion is characterized by comprising the following steps of:

P_t ^xAnd P_t ^yRespectively represents f_tAnd its subsequent adjacent frame image f_t+1A gradient boundary matrix between the image transverse direction and the image longitudinal direction, t ═ p-s, p-s + 1..,p+s-1；

S4, training a convolutional neural network by adopting the representative frame, the gradient boundary atlas and the optical flow atlas of each original video training sample; using each original video training sample and representative frame, gradient boundary diagram set and optical flow diagram set of original video to be identified as outputFirstly, obtaining training samples of each original video and CNN characteristics C of representative frames of the original video to be identified by adopting the trained convolutional neural network_rgbGradient boundary CNN feature C_gbfAnd optical flow CNN feature Co_f；

S5 Using C of each original video training sample_rgb、C_gbfAnd C_ofTraining fusion formula C_fusion＝y^catParameters k and b in the x k + b, wherein k is a convolution kernel parameter; b is a bias parameter; y is^cat＝[C_gbf,C_rgb，C_of](ii) a Fusing C of the original video to be recognized by adopting the trained fusion formula_rgb、C_gbfAnd C_ofObtaining a fusion feature C_fusion；

2. The motion recognition system based on the fusion of the gradient boundary graph and the multimode convolution is characterized by comprising the following steps:

a continuous frame set forming module for sampling the original video to obtain a representative frame f_pTake f from the original video_p、f_pS frame images and f_pThe subsequent S frame images form a continuous frame set S_p＝[f_p-s,…,f_p,…,f_p+s](ii) a s is an empirical value, and the value range of s is 5-10; the original video is an original video training sample or an original video to be identified;

The CNN characteristic identification module is used for adopting the representative frame, the gradient boundary atlas and the optical flow atlas of each original video training sample to train a convolutional neural network; training samples by each original video and representative frames and gradient boundaries of the original video to be recognizedUsing the graph set and the optical flow graph set as input, and obtaining training samples of each original video and a representative frame CNN characteristic C of the original video to be identified by adopting the trained convolutional neural network_rgbGradient boundary CNN feature C_gbfAnd optical flow CNN feature C_of；

A fusion module for adopting C of each original video training sample_rgb、C_gbfAnd C_ofTraining fusion formula C_fusion＝y^catParameters k and b in the x k + b, wherein k is a convolution kernel parameter; b is a bias parameter; y is^cat＝[C_gbf,C_rgb，C_of](ii) a Fusing C of the original video to be recognized by adopting the trained fusion formula_rgb、C_gbfAnd C_ofObtaining a fusion feature C_fusion；