CN111626245B

CN111626245B - Human behavior identification method based on video key frame

Info

Publication number: CN111626245B
Application number: CN202010482943.8A
Authority: CN
Inventors: 丁转莲; 盛漫锦; 石家毓; 孙登第; 吴心悦; 沈心怡
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2023-04-07
Anticipated expiration: 2040-06-01
Also published as: CN111626245A

Abstract

The invention relates to a human behavior identification method based on video key frames, which comprises the following steps: acquiring a classified video set; dividing the video clips based on the information amount; constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network; extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method; and inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result. The invention improves the identification accuracy by improving the quality of the frame to be detected; by adopting the double-current convolutional neural network, the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; the greedy algorithm is adopted for solving, and the video segment division can be completed by adopting simple condition circulation, so that the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the largest contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.

Description

Human behavior recognition method based on video key frame

Technical Field

The invention relates to the technical field of machine learning, in particular to a human behavior identification method based on video key frames.

Background

In the age of rapid development of the internet, video data in daily life is increased explosively, and videos contain a large amount of digital information. Compared with the behavior recognition of pictures, the behavior recognition of videos has the defects of large calculation amount, redundant information and the like, but the behavior recognition of videos has more practical significance and wider application prospects, such as human-computer interaction, intelligent monitoring, video classification and the like, so that the improvement of the accuracy rate of the behavior recognition of videos is an important difficult problem.

Since video is composed of consecutive still images, there is redundancy, which requires screening of video information. How to screen the video information and what kind of network is adopted for identification is an important direction for improving the accuracy rate of video behavior identification. In the current identification methods, the neural network is directly adopted for identification, or the video segment is equally divided to extract key frames for identification. However, these methods are computationally intensive and key frame extraction is not representative.

Disclosure of Invention

The invention aims to provide a human body behavior identification method based on video key frames, which divides segments based on the information content of video segments, extracts useful key frames from the video segments and utilizes the key frames and a space-time double-flow network to more quickly and effectively identify behaviors.

In order to achieve the purpose, the invention adopts the following technical scheme: a human behavior recognition method based on video key frames comprises the following steps:

(1) Acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, and split1 is an experimental video set; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos;

(2) Dividing video clips based on information content: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;

(3) Constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network;

(4) Extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method;

(5) And inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result.

The step (2) specifically comprises the following steps: calculating the motion information amount M of the whole video, namely:

in the formula, flow ² (x, y, c) represents the corresponding two-channel optical flow image within the segment, c represents the channel of the optical flow;

dividing video into N segments to obtain average information amount

For M/N, minimize the variance of the segments

In the formula, M _i For the motion information amount of the ith segment of the video, based on the comparison result>

Is the average segment information content;

approximate solution is solved by minimizing the segment variance using a greedy algorithm: the method comprises the steps of carrying out segment division on a training video (No. 1) and a testing video (No. 2), firstly carrying out frame sampling on the videos to obtain a video frame set, initializing N divided segments, wherein the content of the divided segments is empty, and calculating the motion information amount M of the ith divided segment _i If is compared

Small, adding the first frame of the video frame set to the segment and deleting it in the video frame set, calculating the information content of the segment again until the motion information content of the segment is greater than the average information content for the first time>

And finishing the division of the segment i, calculating the (i + 1) th division segment, repeating the greedy algorithm until the video frame set is empty, and finishing the division of the whole video.

The step (3) specifically comprises the following steps: the double-current convolution neural network comprises a spatial feature network and a time feature network, and the construction of the double-current convolution neural network refers to: the spatial feature network adopts a BN-inclusion structure, the temporal feature network sums convolution kernel parameters of a first convolution layer along channels on the basis of the BN-inclusion structure, the obtained parameters are divided by the number of the target channels, the parameters are copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is 10 fixed channels, and 5 frames of optical flow images are stacked in the xy direction to obtain the network;

the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from divided segments to serve as the input of a double-current convolutional neural network, the spatio-temporal characteristics, namely convolutional layer output, obtained by each segment are respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.

The step (4) specifically comprises:

(4a) Extracting PHOG characteristics with the dimension of 680 from each frame of a divided segment of a test video, and combining all the frame characteristics into a high-level semantic matrix X of the segment video;

(4b) Solving the high-level semantic matrix X reconstruction coefficient by adopting an iterative closed solution solving method to obtain a reconstruction coefficient matrix W of the video segment;

(4c) Superposing and summing each row of the obtained reconstruction coefficient matrix W, sequencing all rows from large to small according to the importance of the ith row of the reconstruction coefficient matrix W and the ith frame representing the video clip, selecting the largest frame of the row as the spatial characteristic of the video clip, and selecting the first five frames of the row as the temporal characteristic of the video clip;

(4d) And (4) repeating the steps (4 a) to (4 c) and solving the corresponding space-time characteristics for each segment of the test video.

The step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and completing behavior recognition.

The step (4 b) specifically comprises the following steps:

the construction coefficient reconstruction formula is as follows:

wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L ₂₁ The norm matrices are added to form, the formula is optimized for solving W, relaxation constraints W = C and X-XW = E are added, and the formula is converted into the following ALM target:

wherein, Λ ₁ 、Λ ₂ And Λ ₃ Is the Lagrange multiplier, mu>0 is a punishment parameter, respectively calculating the partial derivative of each variable and each parameter of the formula (3), then making the partial derivative equal to 0, and calculating the closed solution of each parameter, wherein the solution of W is firstly calculated according to the method as follows:

W＝(2X ^T X+μ(I+11 ^T )) ^-1 (2U ^T X+μ(P+1Q)) (4)

wherein

The closed solution for C and E is then calculated by the same calculation method as follows:

wherein,

finally, the Lagrangian lambda is calculated ₁ 、Λ ₂ And Λ ₃ The solution of (a) is as follows:

Λ ₁ ＝Λ ₁ +μ(W-C) (7)

Λ ₂ ＝Λ ₂ +μ(1 ^T W-1 ^T ) (8)

Λ ₃ ＝Λ ₃ +μ(X-XE-W) (9)

initializing parameters W = C =0, Λ ₁ ，Λ ₂ ，Λ ₃ ＝0，μ＝10 ^-6 ，ρ＝1.1，max _μ ＝10 ¹⁰ Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max) _μ ) (ii) a Setting threshold ε =10 ^-8 If | W-C-<ε，|1 ^T W-1 ^T |<ε，|X-XE-W|<If epsilon is established, all parameters are stable, and a reconstruction coefficient matrix W is obtained.

According to the technical scheme, the beneficial effects of the invention are as follows: firstly, the method for identifying the key frame of the video is adopted, and the identification accuracy is improved by improving the quality of the frame to be detected; secondly, the double-current convolutional neural network is adopted, so that the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; thirdly, the method calculates dense optical flow for the motion subject, minimizes the segment variance, and makes the information content of the divided video segments uniform; the division of the video band can be completed by using a greedy algorithm and adopting simple conditional circulation, and the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the maximum contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

fig. 2 is a flow diagram of video feature extraction.

Detailed Description

As shown in fig. 1, a method for recognizing human body behavior based on video keyframes includes the following steps:

(1) Acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, split1 is an experimental video set, and the data set is grouped into a training set test set which is most commonly used; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos; the UCF101 is a large outdoor human behavior and action video data set, and has great diversity in action acquisition, including camera operation, appearance change, posture change, object proportion change, background change, optical fiber change and the like, wherein the test video is split1;

(2) Dividing video clips based on information quantity: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;

The step (2) specifically comprises the following steps: the motion information of the behaviors is not uniform in the time domain, the behavior bodies occupy the vast information quantity of the video, and the motion states of the bodies are expressed by dense optical flows. Since the whole information amount of the video is fixed, the average segment information amount under an ideal condition can be obtained, and in order to divide the video into N segments according to the information amount, the difference between the information amount of the divided segments and the average information amount needs to be reduced, namely, the segment variance D is minimized;

calculating the motion information amount M of the whole video, namely:

in the formula, flow ² (x, y, c) represents the corresponding two-channel optical flow image within the segment, c represents the channel of the optical flow; dividing video into N segments to obtain average information amount

For M/N, minimize the segment variance->

In the formula, M _i Amount of motion information for the ith segment of video, based on the number of frames in a video frame>

Is the average segment information content;

The double-current convolutional neural network understands video information by simulating a human visual process, understands time sequence information in a video frame sequence on the basis of processing environmental space information in a video image, and divides an abnormal behavior classification task into two different parts in order to better understand the information. The double-current convolution neural network constructed by the invention is divided into a spatial feature network and a temporal feature network, wherein the spatial feature network is input into a single-frame RGB image with a fixed size, and the temporal feature network is input into a 5-frame optical flow image. The dual-stream convolutional neural network is trained with a training set of labeled UCFs 101.

The step (3) specifically comprises the following steps: the double-current convolutional neural network comprises a spatial feature network and a time feature network, and the constructing of the double-current convolutional neural network is as follows:

constructing a spatial feature network: the spatial feature network adopts a BN-inclusion structure, namely an image network, and the network selects the BN-inclusion structure which has good performance on an image classification task. According to the network, two convolutions of 3x3 are used for replacing a large convolution of 5x5, the number of parameters is reduced, overfitting is reduced, a BN layer is introduced, the BN method is a very effective regularization method, the training speed of the large convolutional network can be accelerated by many times, and meanwhile the classification accuracy rate after convergence can be greatly improved.

Constructing a time characteristic network: the time characteristic network is an optical flow network, and the network adopts a modified BN-inclusion structure. The time characteristic network is based on a BN-inclusion structure, convolution kernel parameters of a first convolution layer are summed along channels, the obtained parameters are divided by the number of the target channels, the sum is copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is a fixed 10-channel, and 5 frames of optical flow images are stacked in the xy direction to obtain the time characteristic network;

the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from the divided segments to serve as the input of a double-current convolutional neural network, the output of a convolutional layer, which is the space-time characteristic obtained by each segment, is respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.

The test video is divided into N segments, and one frame is randomly extracted from each segment to be used as the input of a spatial network, and 5 frames are randomly extracted to be used as the input of a temporal characteristic network. The output of the last convolution layer corresponding to the network is used as the output of the network, the output of the convolution layer, which is the space-time characteristic obtained by each segment, is respectively subjected to mean pooling, and the pooled result is used as the input of a loss function to calculate loss and perform back propagation. Where the loss function uses softmax cross entropy and is trained using random gradient descent.

And (4) repeatedly training each video random fragment feature until all test samples are trained, and completing parameter debugging of the double-current convolutional neural network. Random sampling training enables the double-current convolution neural network to learn more characteristics and have higher fault tolerance rate.

The invention provides a linear reconstruction frame to extract video key frames, and the key idea is to use linear combination of a small number of basis vectors to represent all frame feature vectors. Extracting high-level semantic information from each frame of the video frequency band, combining the high-level semantic information into a video coefficient matrix, wherein the video key frame is characterized in that the content of the whole video can be represented by the frame or the frame sequence, and therefore, the contribution of each frame of the video to the whole video is calculated, and the high contribution degree is taken as the video key frame. The frame is made up of two parts: the linear reconstruction function and the regularizer are solved by structured sparse characterization of the norm. As shown in fig. 2, the step (4) specifically includes:

video clips are pre-sampled to reduce the computational load of subsequent algorithms and to turn the video into successive still picture frames. Suppose that the ith segment v of video v _i There are j frames in total, and each frame is encoded using the phog descriptor to obtain 680-dimensional feature quantities. Integrating the characteristic quantity of each frame to obtain a high-level semantic matrix X of the matrix video, wherein the size j of the X is 680, and the ith row of the matrix represents a video segment v _i The ith frame characteristic of (2).

The step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and finishing behavior recognition.

The step (4 b) specifically comprises the following steps:

the construction coefficient reconstruction formula is as follows:

wherein, Λ ₁ 、Λ ₂ And Λ ₃ Is the Lagrange multiplier, mu>0 is a punishment parameter, the partial derivatives of each variable and each parameter in the formula (3) are respectively calculated, then the partial derivatives are equal to 0, the closed solution of each parameter is solved, and according to the method, the solution of W is firstly solved as follows:

W＝(2X ^T X+μ(I+11 ^T )) ^-1 (2U ^T X+μ(P+1Q)) (4)

wherein

wherein,

Λ ₁ ＝Λ ₁ +μ(W-C) (7)

Λ ₂ ＝Λ ₂ +μ(1 ^T W-1 ^T ) (8)

Λ ₃ ＝Λ ₃ +μ(X-XE-W) (9)

initializing parameters W = C =0, Λ ₁ ，Λ ₂ ，Λ ₃ ＝0，μ＝10 ^-6 ，ρ＝1.1，max _μ ＝10 ¹⁰ Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max) _μ ) (ii) a Setting threshold ε =10 ^-8 If W-C noncash<ε，|1 ^T W-1 ^T |<ε，|X-XE-W|<If epsilon is established, all parameters are stable, and a reconstruction coefficient matrix W is obtained through solution.

The sum of the ith row of W represents the importance of the ith frame to the entire video, W is a sparse matrix, and each row of W is summed and then sorted from large to small. And extracting the largest frame k1 as a segment image key frame, and the largest five frames k1 to k5 as optical flow key frames.

And (5) repeating the steps (4 a) to (4 c) for each segment of the test video, and solving the space-time characteristics of each video segment and numbering.

And inputting the obtained space-time characteristic pair into a trained double-current convolutional neural network, performing mean pooling on the obtained classification results, and finally obtaining a prediction result through an argmax function to finish the visual human behavior identification. And (4) if the video to be tested still exists, returning to the step (4).

In summary, there are many kinds of extraction of video features, and the invention adopts video key frames as features for extraction; the invention improves the identification accuracy by improving the quality of the frame to be detected; by adopting the double-current convolutional neural network, the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; the greedy algorithm is adopted for solving, and the video segment division can be completed by adopting simple condition circulation, so that the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the maximum contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.

Claims

1. A human behavior identification method based on video key frames is characterized in that: the method comprises the following steps in sequence:

(5) Inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result;

the step (4) specifically comprises:

(4d) Repeating the steps (4 a) to (4 c), and solving corresponding space-time characteristics for each segment of the test video;

the step (4 b) specifically comprises the following steps:

the construction coefficient reconstruction formula is as follows:

wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L ₂₁ The norm matrices are added to form, the formula is optimized for solving for W, relaxation constraints W = C and X-XW = E are added, and the formula is transformed into the following ALM objective:

W＝(2X ^T X+μ(I+11 ^T )) ^-1 (2U ^T X+μ(P+1Q)) (4)

wherein

wherein,

Λ ₁ ＝Λ ₁ +μ(W-C) (7)

Λ ₂ ＝Λ ₂ +μ(1 ^T W-1 ^T ) (8)

Λ ₃ ＝Λ ₃ +μ(X-XE-W) (9)

2. The human behavior recognition method based on video keyframes according to claim 1, characterized in that: the step (2) specifically comprises the following steps: calculating the motion information amount M of the whole video, namely:

in the formula, flow ² (x, y, c) denotes intra-fragment correspondenceC represents the channel of the optical flow;

dividing video into N segments to obtain average information content

For M/N, minimize the variance of the segments

Is the average clip information content;

3. The human behavior recognition method based on video keyframes according to claim 1, wherein: the step (3) specifically comprises the following steps: the double-current convolutional neural network comprises a spatial feature network and a time feature network, and the constructing of the double-current convolutional neural network is as follows: the spatial feature network adopts a BN-inclusion structure, the temporal feature network sums convolution kernel parameters of a first convolution layer along channels on the basis of the BN-inclusion structure, the obtained parameters are divided by the number of the target channels, the parameters are copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is 10 fixed channels, and 5 frames of optical flow images are stacked in the xy direction to obtain the network;

4. The human behavior recognition method based on video keyframes according to claim 1, wherein: the step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and completing behavior recognition.