CN111626245B - Human behavior identification method based on video key frame - Google Patents

Human behavior identification method based on video key frame Download PDF

Info

Publication number
CN111626245B
CN111626245B CN202010482943.8A CN202010482943A CN111626245B CN 111626245 B CN111626245 B CN 111626245B CN 202010482943 A CN202010482943 A CN 202010482943A CN 111626245 B CN111626245 B CN 111626245B
Authority
CN
China
Prior art keywords
video
segment
neural network
double
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010482943.8A
Other languages
Chinese (zh)
Other versions
CN111626245A (en
Inventor
丁转莲
盛漫锦
石家毓
孙登第
吴心悦
沈心怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202010482943.8A priority Critical patent/CN111626245B/en
Publication of CN111626245A publication Critical patent/CN111626245A/en
Application granted granted Critical
Publication of CN111626245B publication Critical patent/CN111626245B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a human behavior identification method based on video key frames, which comprises the following steps: acquiring a classified video set; dividing the video clips based on the information amount; constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network; extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method; and inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result. The invention improves the identification accuracy by improving the quality of the frame to be detected; by adopting the double-current convolutional neural network, the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; the greedy algorithm is adopted for solving, and the video segment division can be completed by adopting simple condition circulation, so that the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the largest contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.

Description

Human behavior recognition method based on video key frame
Technical Field
The invention relates to the technical field of machine learning, in particular to a human behavior identification method based on video key frames.
Background
In the age of rapid development of the internet, video data in daily life is increased explosively, and videos contain a large amount of digital information. Compared with the behavior recognition of pictures, the behavior recognition of videos has the defects of large calculation amount, redundant information and the like, but the behavior recognition of videos has more practical significance and wider application prospects, such as human-computer interaction, intelligent monitoring, video classification and the like, so that the improvement of the accuracy rate of the behavior recognition of videos is an important difficult problem.
Since video is composed of consecutive still images, there is redundancy, which requires screening of video information. How to screen the video information and what kind of network is adopted for identification is an important direction for improving the accuracy rate of video behavior identification. In the current identification methods, the neural network is directly adopted for identification, or the video segment is equally divided to extract key frames for identification. However, these methods are computationally intensive and key frame extraction is not representative.
Disclosure of Invention
The invention aims to provide a human body behavior identification method based on video key frames, which divides segments based on the information content of video segments, extracts useful key frames from the video segments and utilizes the key frames and a space-time double-flow network to more quickly and effectively identify behaviors.
In order to achieve the purpose, the invention adopts the following technical scheme: a human behavior recognition method based on video key frames comprises the following steps:
(1) Acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, and split1 is an experimental video set; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos;
(2) Dividing video clips based on information content: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;
(3) Constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network;
(4) Extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method;
(5) And inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result.
The step (2) specifically comprises the following steps: calculating the motion information amount M of the whole video, namely:
Figure GDA0004044502380000021
in the formula, flow 2 (x, y, c) represents the corresponding two-channel optical flow image within the segment, c represents the channel of the optical flow;
dividing video into N segments to obtain average information amount
Figure GDA0004044502380000022
For M/N, minimize the variance of the segments
Figure GDA0004044502380000023
In the formula, M i For the motion information amount of the ith segment of the video, based on the comparison result>
Figure GDA0004044502380000024
Is the average segment information content;
approximate solution is solved by minimizing the segment variance using a greedy algorithm: the method comprises the steps of carrying out segment division on a training video (No. 1) and a testing video (No. 2), firstly carrying out frame sampling on the videos to obtain a video frame set, initializing N divided segments, wherein the content of the divided segments is empty, and calculating the motion information amount M of the ith divided segment i If is compared
Figure GDA0004044502380000025
Small, adding the first frame of the video frame set to the segment and deleting it in the video frame set, calculating the information content of the segment again until the motion information content of the segment is greater than the average information content for the first time>
Figure GDA0004044502380000026
And finishing the division of the segment i, calculating the (i + 1) th division segment, repeating the greedy algorithm until the video frame set is empty, and finishing the division of the whole video.
The step (3) specifically comprises the following steps: the double-current convolution neural network comprises a spatial feature network and a time feature network, and the construction of the double-current convolution neural network refers to: the spatial feature network adopts a BN-inclusion structure, the temporal feature network sums convolution kernel parameters of a first convolution layer along channels on the basis of the BN-inclusion structure, the obtained parameters are divided by the number of the target channels, the parameters are copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is 10 fixed channels, and 5 frames of optical flow images are stacked in the xy direction to obtain the network;
the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from divided segments to serve as the input of a double-current convolutional neural network, the spatio-temporal characteristics, namely convolutional layer output, obtained by each segment are respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.
The step (4) specifically comprises:
(4a) Extracting PHOG characteristics with the dimension of 680 from each frame of a divided segment of a test video, and combining all the frame characteristics into a high-level semantic matrix X of the segment video;
(4b) Solving the high-level semantic matrix X reconstruction coefficient by adopting an iterative closed solution solving method to obtain a reconstruction coefficient matrix W of the video segment;
(4c) Superposing and summing each row of the obtained reconstruction coefficient matrix W, sequencing all rows from large to small according to the importance of the ith row of the reconstruction coefficient matrix W and the ith frame representing the video clip, selecting the largest frame of the row as the spatial characteristic of the video clip, and selecting the first five frames of the row as the temporal characteristic of the video clip;
(4d) And (4) repeating the steps (4 a) to (4 c) and solving the corresponding space-time characteristics for each segment of the test video.
The step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and completing behavior recognition.
The step (4 b) specifically comprises the following steps:
the construction coefficient reconstruction formula is as follows:
Figure GDA0004044502380000031
wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L 21 The norm matrices are added to form, the formula is optimized for solving W, relaxation constraints W = C and X-XW = E are added, and the formula is converted into the following ALM target:
Figure GDA0004044502380000032
wherein, Λ 1 、Λ 2 And Λ 3 Is the Lagrange multiplier, mu>0 is a punishment parameter, respectively calculating the partial derivative of each variable and each parameter of the formula (3), then making the partial derivative equal to 0, and calculating the closed solution of each parameter, wherein the solution of W is firstly calculated according to the method as follows:
W=(2X T X+μ(I+11 T )) -1 (2U T X+μ(P+1Q)) (4)
wherein
Figure GDA0004044502380000041
The closed solution for C and E is then calculated by the same calculation method as follows:
Figure GDA0004044502380000042
Figure GDA0004044502380000043
wherein,
Figure GDA0004044502380000044
finally, the Lagrangian lambda is calculated 1 、Λ 2 And Λ 3 The solution of (a) is as follows:
Λ 1 =Λ 1 +μ(W-C) (7)
Λ 2 =Λ 2 +μ(1 T W-1 T ) (8)
Λ 3 =Λ 3 +μ(X-XE-W) (9)
initializing parameters W = C =0, Λ 1 ,Λ 2 ,Λ 3 =0,μ=10 -6 ,ρ=1.1,max μ =10 10 Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max) μ ) (ii) a Setting threshold ε =10 -8 If | W-C-<ε,|1 T W-1 T |<ε,|X-XE-W|<If epsilon is established, all parameters are stable, and a reconstruction coefficient matrix W is obtained.
According to the technical scheme, the beneficial effects of the invention are as follows: firstly, the method for identifying the key frame of the video is adopted, and the identification accuracy is improved by improving the quality of the frame to be detected; secondly, the double-current convolutional neural network is adopted, so that the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; thirdly, the method calculates dense optical flow for the motion subject, minimizes the segment variance, and makes the information content of the divided video segments uniform; the division of the video band can be completed by using a greedy algorithm and adopting simple conditional circulation, and the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the maximum contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
fig. 2 is a flow diagram of video feature extraction.
Detailed Description
As shown in fig. 1, a method for recognizing human body behavior based on video keyframes includes the following steps:
(1) Acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, split1 is an experimental video set, and the data set is grouped into a training set test set which is most commonly used; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos; the UCF101 is a large outdoor human behavior and action video data set, and has great diversity in action acquisition, including camera operation, appearance change, posture change, object proportion change, background change, optical fiber change and the like, wherein the test video is split1;
(2) Dividing video clips based on information quantity: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;
(3) Constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network;
(4) Extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method;
(5) And inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result.
The step (2) specifically comprises the following steps: the motion information of the behaviors is not uniform in the time domain, the behavior bodies occupy the vast information quantity of the video, and the motion states of the bodies are expressed by dense optical flows. Since the whole information amount of the video is fixed, the average segment information amount under an ideal condition can be obtained, and in order to divide the video into N segments according to the information amount, the difference between the information amount of the divided segments and the average information amount needs to be reduced, namely, the segment variance D is minimized;
calculating the motion information amount M of the whole video, namely:
Figure GDA0004044502380000051
in the formula, flow 2 (x, y, c) represents the corresponding two-channel optical flow image within the segment, c represents the channel of the optical flow; dividing video into N segments to obtain average information amount
Figure GDA0004044502380000061
For M/N, minimize the segment variance->
Figure GDA0004044502380000062
In the formula, M i Amount of motion information for the ith segment of video, based on the number of frames in a video frame>
Figure GDA0004044502380000063
Is the average segment information content;
approximate solution is solved by minimizing the segment variance using a greedy algorithm: the method comprises the steps of carrying out segment division on a training video (No. 1) and a testing video (No. 2), firstly carrying out frame sampling on the videos to obtain a video frame set, initializing N divided segments, wherein the content of the divided segments is empty, and calculating the motion information amount M of the ith divided segment i If is compared
Figure GDA0004044502380000064
Small, adding the first frame of the video frame set to the segment and deleting it in the video frame set, calculating the information content of the segment again until the motion information content of the segment is greater than the average information content for the first time>
Figure GDA0004044502380000065
And finishing the division of the segment i, calculating the (i + 1) th division segment, repeating the greedy algorithm until the video frame set is empty, and finishing the division of the whole video.
The double-current convolutional neural network understands video information by simulating a human visual process, understands time sequence information in a video frame sequence on the basis of processing environmental space information in a video image, and divides an abnormal behavior classification task into two different parts in order to better understand the information. The double-current convolution neural network constructed by the invention is divided into a spatial feature network and a temporal feature network, wherein the spatial feature network is input into a single-frame RGB image with a fixed size, and the temporal feature network is input into a 5-frame optical flow image. The dual-stream convolutional neural network is trained with a training set of labeled UCFs 101.
The step (3) specifically comprises the following steps: the double-current convolutional neural network comprises a spatial feature network and a time feature network, and the constructing of the double-current convolutional neural network is as follows:
constructing a spatial feature network: the spatial feature network adopts a BN-inclusion structure, namely an image network, and the network selects the BN-inclusion structure which has good performance on an image classification task. According to the network, two convolutions of 3x3 are used for replacing a large convolution of 5x5, the number of parameters is reduced, overfitting is reduced, a BN layer is introduced, the BN method is a very effective regularization method, the training speed of the large convolutional network can be accelerated by many times, and meanwhile the classification accuracy rate after convergence can be greatly improved.
Constructing a time characteristic network: the time characteristic network is an optical flow network, and the network adopts a modified BN-inclusion structure. The time characteristic network is based on a BN-inclusion structure, convolution kernel parameters of a first convolution layer are summed along channels, the obtained parameters are divided by the number of the target channels, the sum is copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is a fixed 10-channel, and 5 frames of optical flow images are stacked in the xy direction to obtain the time characteristic network;
the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from the divided segments to serve as the input of a double-current convolutional neural network, the output of a convolutional layer, which is the space-time characteristic obtained by each segment, is respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.
The test video is divided into N segments, and one frame is randomly extracted from each segment to be used as the input of a spatial network, and 5 frames are randomly extracted to be used as the input of a temporal characteristic network. The output of the last convolution layer corresponding to the network is used as the output of the network, the output of the convolution layer, which is the space-time characteristic obtained by each segment, is respectively subjected to mean pooling, and the pooled result is used as the input of a loss function to calculate loss and perform back propagation. Where the loss function uses softmax cross entropy and is trained using random gradient descent.
And (4) repeatedly training each video random fragment feature until all test samples are trained, and completing parameter debugging of the double-current convolutional neural network. Random sampling training enables the double-current convolution neural network to learn more characteristics and have higher fault tolerance rate.
The invention provides a linear reconstruction frame to extract video key frames, and the key idea is to use linear combination of a small number of basis vectors to represent all frame feature vectors. Extracting high-level semantic information from each frame of the video frequency band, combining the high-level semantic information into a video coefficient matrix, wherein the video key frame is characterized in that the content of the whole video can be represented by the frame or the frame sequence, and therefore, the contribution of each frame of the video to the whole video is calculated, and the high contribution degree is taken as the video key frame. The frame is made up of two parts: the linear reconstruction function and the regularizer are solved by structured sparse characterization of the norm. As shown in fig. 2, the step (4) specifically includes:
(4a) Extracting PHOG characteristics with the dimension of 680 from each frame of a divided segment of a test video, and combining all the frame characteristics into a high-level semantic matrix X of the segment video;
video clips are pre-sampled to reduce the computational load of subsequent algorithms and to turn the video into successive still picture frames. Suppose that the ith segment v of video v i There are j frames in total, and each frame is encoded using the phog descriptor to obtain 680-dimensional feature quantities. Integrating the characteristic quantity of each frame to obtain a high-level semantic matrix X of the matrix video, wherein the size j of the X is 680, and the ith row of the matrix represents a video segment v i The ith frame characteristic of (2).
(4b) Solving the high-level semantic matrix X reconstruction coefficient by adopting an iterative closed solution solving method to obtain a reconstruction coefficient matrix W of the video segment;
(4c) Superposing and summing each row of the obtained reconstruction coefficient matrix W, sequencing all rows from large to small according to the importance of the ith row of the reconstruction coefficient matrix W and the ith frame representing the video clip, selecting the largest frame of the row as the spatial characteristic of the video clip, and selecting the first five frames of the row as the temporal characteristic of the video clip;
(4d) And (4) repeating the steps (4 a) to (4 c) and solving the corresponding space-time characteristics for each segment of the test video.
The step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and finishing behavior recognition.
The step (4 b) specifically comprises the following steps:
the construction coefficient reconstruction formula is as follows:
Figure GDA0004044502380000081
wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L 21 The norm matrices are added to form, the formula is optimized for solving W, relaxation constraints W = C and X-XW = E are added, and the formula is converted into the following ALM target:
Figure GDA0004044502380000082
wherein, Λ 1 、Λ 2 And Λ 3 Is the Lagrange multiplier, mu>0 is a punishment parameter, the partial derivatives of each variable and each parameter in the formula (3) are respectively calculated, then the partial derivatives are equal to 0, the closed solution of each parameter is solved, and according to the method, the solution of W is firstly solved as follows:
W=(2X T X+μ(I+11 T )) -1 (2U T X+μ(P+1Q)) (4)
wherein
Figure GDA0004044502380000083
The closed solution for C and E is then calculated by the same calculation method as follows:
Figure GDA0004044502380000084
Figure GDA0004044502380000091
wherein,
Figure GDA0004044502380000092
finally, the Lagrangian lambda is calculated 1 、Λ 2 And Λ 3 The solution of (a) is as follows:
Λ 1 =Λ 1 +μ(W-C) (7)
Λ 2 =Λ 2 +μ(1 T W-1 T ) (8)
Λ 3 =Λ 3 +μ(X-XE-W) (9)
initializing parameters W = C =0, Λ 1 ,Λ 2 ,Λ 3 =0,μ=10 -6 ,ρ=1.1,max μ =10 10 Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max) μ ) (ii) a Setting threshold ε =10 -8 If W-C noncash<ε,|1 T W-1 T |<ε,|X-XE-W|<If epsilon is established, all parameters are stable, and a reconstruction coefficient matrix W is obtained through solution.
The sum of the ith row of W represents the importance of the ith frame to the entire video, W is a sparse matrix, and each row of W is summed and then sorted from large to small. And extracting the largest frame k1 as a segment image key frame, and the largest five frames k1 to k5 as optical flow key frames.
And (5) repeating the steps (4 a) to (4 c) for each segment of the test video, and solving the space-time characteristics of each video segment and numbering.
And inputting the obtained space-time characteristic pair into a trained double-current convolutional neural network, performing mean pooling on the obtained classification results, and finally obtaining a prediction result through an argmax function to finish the visual human behavior identification. And (4) if the video to be tested still exists, returning to the step (4).
In summary, there are many kinds of extraction of video features, and the invention adopts video key frames as features for extraction; the invention improves the identification accuracy by improving the quality of the frame to be detected; by adopting the double-current convolutional neural network, the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; the greedy algorithm is adopted for solving, and the video segment division can be completed by adopting simple condition circulation, so that the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the maximum contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.

Claims (4)

1. A human behavior identification method based on video key frames is characterized in that: the method comprises the following steps in sequence:
(1) Acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, and split1 is an experimental video set; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos;
(2) Dividing video clips based on information content: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;
(3) Constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network;
(4) Extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method;
(5) Inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result;
the step (4) specifically comprises:
(4a) Extracting PHOG characteristics with the dimension of 680 from each frame of a divided segment of a test video, and combining all the frame characteristics into a high-level semantic matrix X of the segment video;
(4b) Solving the high-level semantic matrix X reconstruction coefficient by adopting an iterative closed solution solving method to obtain a reconstruction coefficient matrix W of the video segment;
(4c) Superposing and summing each row of the obtained reconstruction coefficient matrix W, sequencing all rows from large to small according to the importance of the ith row of the reconstruction coefficient matrix W and the ith frame representing the video clip, selecting the largest frame of the row as the spatial characteristic of the video clip, and selecting the first five frames of the row as the temporal characteristic of the video clip;
(4d) Repeating the steps (4 a) to (4 c), and solving corresponding space-time characteristics for each segment of the test video;
the step (4 b) specifically comprises the following steps:
the construction coefficient reconstruction formula is as follows:
Figure FDA0004044502370000011
wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L 21 The norm matrices are added to form, the formula is optimized for solving for W, relaxation constraints W = C and X-XW = E are added, and the formula is transformed into the following ALM objective:
Figure FDA0004044502370000021
wherein, Λ 1 、Λ 2 And Λ 3 Is the Lagrange multiplier, mu>0 is a punishment parameter, respectively calculating the partial derivative of each variable and each parameter of the formula (3), then making the partial derivative equal to 0, and calculating the closed solution of each parameter, wherein the solution of W is firstly calculated according to the method as follows:
W=(2X T X+μ(I+11 T )) -1 (2U T X+μ(P+1Q)) (4)
wherein
Figure FDA0004044502370000022
The closed solution for C and E is then calculated by the same calculation method as follows:
Figure FDA0004044502370000023
Figure FDA0004044502370000024
wherein,
Figure FDA0004044502370000025
finally, the Lagrangian lambda is calculated 1 、Λ 2 And Λ 3 The solution of (a) is as follows:
Λ 1 =Λ 1 +μ(W-C) (7)
Λ 2 =Λ 2 +μ(1 T W-1 T ) (8)
Λ 3 =Λ 3 +μ(X-XE-W) (9)
initializing parameters W = C =0, Λ 1 ,Λ 2 ,Λ 3 =0,μ=10 -6 ,ρ=1.1,max μ =10 10 Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max) μ ) (ii) a Setting threshold ε =10 -8 If W-C noncash<ε,|1 T W-1 T |<ε,|X-XE-W|<If epsilon is established, all parameters are stable, and a reconstruction coefficient matrix W is obtained through solution.
2. The human behavior recognition method based on video keyframes according to claim 1, characterized in that: the step (2) specifically comprises the following steps: calculating the motion information amount M of the whole video, namely:
Figure FDA0004044502370000031
in the formula, flow 2 (x, y, c) denotes intra-fragment correspondenceC represents the channel of the optical flow;
dividing video into N segments to obtain average information content
Figure FDA0004044502370000032
For M/N, minimize the variance of the segments
Figure FDA0004044502370000033
In the formula, M i For the motion information amount of the ith segment of the video, based on the comparison result>
Figure FDA0004044502370000034
Is the average clip information content;
approximate solution is solved by minimizing the segment variance using a greedy algorithm: the method comprises the steps of carrying out segment division on a training video (No. 1) and a testing video (No. 2), firstly carrying out frame sampling on the videos to obtain a video frame set, initializing N divided segments, wherein the content of the divided segments is empty, and calculating the motion information amount M of the ith divided segment i If is compared
Figure FDA0004044502370000035
Small, adding the first frame of the video frame set to the segment and deleting it in the video frame set, calculating the information content of the segment again until the motion information content of the segment is greater than the average information content for the first time>
Figure FDA0004044502370000036
And finishing the division of the segment i, calculating the (i + 1) th division segment, repeating the greedy algorithm until the video frame set is empty, and finishing the division of the whole video.
3. The human behavior recognition method based on video keyframes according to claim 1, wherein: the step (3) specifically comprises the following steps: the double-current convolutional neural network comprises a spatial feature network and a time feature network, and the constructing of the double-current convolutional neural network is as follows: the spatial feature network adopts a BN-inclusion structure, the temporal feature network sums convolution kernel parameters of a first convolution layer along channels on the basis of the BN-inclusion structure, the obtained parameters are divided by the number of the target channels, the parameters are copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is 10 fixed channels, and 5 frames of optical flow images are stacked in the xy direction to obtain the network;
the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from the divided segments to serve as the input of a double-current convolutional neural network, the output of a convolutional layer, which is the space-time characteristic obtained by each segment, is respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.
4. The human behavior recognition method based on video keyframes according to claim 1, wherein: the step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and completing behavior recognition.
CN202010482943.8A 2020-06-01 2020-06-01 Human behavior identification method based on video key frame Active CN111626245B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010482943.8A CN111626245B (en) 2020-06-01 2020-06-01 Human behavior identification method based on video key frame

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010482943.8A CN111626245B (en) 2020-06-01 2020-06-01 Human behavior identification method based on video key frame

Publications (2)

Publication Number Publication Date
CN111626245A CN111626245A (en) 2020-09-04
CN111626245B true CN111626245B (en) 2023-04-07

Family

ID=72271841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010482943.8A Active CN111626245B (en) 2020-06-01 2020-06-01 Human behavior identification method based on video key frame

Country Status (1)

Country Link
CN (1) CN111626245B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016506B (en) * 2020-09-07 2022-10-11 重庆邮电大学 Classroom attitude detection model parameter training method capable of quickly adapting to new scene
CN112329738B (en) * 2020-12-01 2024-08-16 厦门大学 Long video motion recognition method based on significant segment sampling
CN112528823B (en) * 2020-12-04 2022-08-19 燕山大学 Method and system for analyzing batcharybus movement behavior based on key frame detection and semantic component segmentation
CN113239822A (en) * 2020-12-28 2021-08-10 武汉纺织大学 Dangerous behavior detection method and system based on space-time double-current convolutional neural network
CN112733695B (en) * 2021-01-04 2023-04-25 电子科技大学 Unsupervised keyframe selection method in pedestrian re-identification field
CN113469142B (en) * 2021-03-12 2022-01-14 山西长河科技股份有限公司 Classification method, device and terminal for monitoring video time-space information fusion
CN113239869B (en) * 2021-05-31 2023-08-11 西安电子科技大学 Two-stage behavior recognition method and system based on key frame sequence and behavior information
CN113642499B (en) * 2021-08-23 2024-05-24 中国人民解放军火箭军工程大学 Human body behavior recognition method based on computer vision
CN114373194A (en) * 2022-01-14 2022-04-19 南京邮电大学 Human behavior identification method based on key frame and attention mechanism
CN114550047B (en) * 2022-02-22 2024-04-05 西安交通大学 Behavior rate guided video behavior recognition method
CN114973020A (en) * 2022-06-15 2022-08-30 北京鹏鹄物宇科技发展有限公司 Abnormal behavior analysis method based on satellite monitoring video
CN115393660B (en) * 2022-10-28 2023-02-24 松立控股集团股份有限公司 Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598598A (en) * 2019-08-30 2019-12-20 西安理工大学 Double-current convolution neural network human behavior identification method based on finite sample set

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9510787B2 (en) * 2014-12-11 2016-12-06 Mitsubishi Electric Research Laboratories, Inc. Method and system for reconstructing sampled signals

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598598A (en) * 2019-08-30 2019-12-20 西安理工大学 Double-current convolution neural network human behavior identification method based on finite sample set

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
田曼 ; 张艺 ; .多模型融合动作识别研究.电子测量技术.2018,(20),全文. *
贾迪 ; 朱宁丹 ; 杨宁华 ; 吴思 ; 李玉秀 ; 赵明远 ; .图像匹配方法研究综述.中国图象图形学报.2019,(05),全文. *

Also Published As

Publication number Publication date
CN111626245A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
CN111626245B (en) Human behavior identification method based on video key frame
Oh et al. Crowd counting with decomposed uncertainty
CN107341452B (en) Human behavior identification method based on quaternion space-time convolution neural network
Zhou et al. Anomalynet: An anomaly detection network for video surveillance
CN113221641B (en) Video pedestrian re-identification method based on generation of antagonism network and attention mechanism
CN108133188A (en) A kind of Activity recognition method based on motion history image and convolutional neural networks
CN108446589B (en) Face recognition method based on low-rank decomposition and auxiliary dictionary in complex environment
CN111526434B (en) Converter-based video abstraction method
CN113239869B (en) Two-stage behavior recognition method and system based on key frame sequence and behavior information
CN111738363A (en) Alzheimer disease classification method based on improved 3D CNN network
CN112200096B (en) Method, device and storage medium for realizing real-time abnormal behavior identification based on compressed video
Song et al. A new recurrent plug-and-play prior based on the multiple self-similarity network
CN115393231B (en) Defect image generation method and device, electronic equipment and storage medium
CN115131558B (en) Semantic segmentation method in environment with few samples
Choo et al. Multi-scale recurrent encoder-decoder network for dense temporal classification
CN111242068A (en) Behavior recognition method and device based on video, electronic equipment and storage medium
Sun et al. Video snapshot compressive imaging using residual ensemble network
Jaisurya et al. Attention-based single image dehazing using improved cyclegan
CN111898614A (en) Neural network system, image signal and data processing method
CN117834852A (en) Space-time video quality evaluation method based on cross-attention multi-scale visual transformer
CN113111945A (en) Confrontation sample defense method based on transform self-encoder
CN112347965A (en) Video relation detection method and system based on space-time diagram
CN113762007A (en) Abnormal behavior detection method based on appearance and action characteristic double prediction
CN111401209A (en) Action recognition method based on deep learning
CN113963421B (en) Dynamic sequence unconstrained expression recognition method based on hybrid feature enhanced network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant