CN111626245B - Human behavior identification method based on video key frame - Google Patents
Human behavior identification method based on video key frame Download PDFInfo
- Publication number
- CN111626245B CN111626245B CN202010482943.8A CN202010482943A CN111626245B CN 111626245 B CN111626245 B CN 111626245B CN 202010482943 A CN202010482943 A CN 202010482943A CN 111626245 B CN111626245 B CN 111626245B
- Authority
- CN
- China
- Prior art keywords
- video
- segment
- neural network
- double
- convolutional neural
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 239000011159 matrix material Substances 0.000 claims abstract description 36
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 35
- 238000012360 testing method Methods 0.000 claims abstract description 24
- 238000012549 training Methods 0.000 claims abstract description 21
- 230000003287 optical effect Effects 0.000 claims abstract description 19
- 238000005070 sampling Methods 0.000 claims abstract description 8
- 238000011176 pooling Methods 0.000 claims description 11
- 230000009471 action Effects 0.000 claims description 8
- 230000002123 temporal effect Effects 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 4
- 101100477520 Homo sapiens SHOX gene Proteins 0.000 claims description 3
- 102000048489 Short Stature Homeobox Human genes 0.000 claims description 3
- 108700025071 Short Stature Homeobox Proteins 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 239000012634 fragment Substances 0.000 claims description 2
- 230000006399 behavior Effects 0.000 description 25
- 230000006870 function Effects 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 4
- 239000013598 vector Substances 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 206010000117 Abnormal behaviour Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a human behavior identification method based on video key frames, which comprises the following steps: acquiring a classified video set; dividing the video clips based on the information amount; constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network; extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method; and inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result. The invention improves the identification accuracy by improving the quality of the frame to be detected; by adopting the double-current convolutional neural network, the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; the greedy algorithm is adopted for solving, and the video segment division can be completed by adopting simple condition circulation, so that the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the largest contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a human behavior identification method based on video key frames.
Background
In the age of rapid development of the internet, video data in daily life is increased explosively, and videos contain a large amount of digital information. Compared with the behavior recognition of pictures, the behavior recognition of videos has the defects of large calculation amount, redundant information and the like, but the behavior recognition of videos has more practical significance and wider application prospects, such as human-computer interaction, intelligent monitoring, video classification and the like, so that the improvement of the accuracy rate of the behavior recognition of videos is an important difficult problem.
Since video is composed of consecutive still images, there is redundancy, which requires screening of video information. How to screen the video information and what kind of network is adopted for identification is an important direction for improving the accuracy rate of video behavior identification. In the current identification methods, the neural network is directly adopted for identification, or the video segment is equally divided to extract key frames for identification. However, these methods are computationally intensive and key frame extraction is not representative.
Disclosure of Invention
The invention aims to provide a human body behavior identification method based on video key frames, which divides segments based on the information content of video segments, extracts useful key frames from the video segments and utilizes the key frames and a space-time double-flow network to more quickly and effectively identify behaviors.
In order to achieve the purpose, the invention adopts the following technical scheme: a human behavior recognition method based on video key frames comprises the following steps:
(1) Acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, and split1 is an experimental video set; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos;
(2) Dividing video clips based on information content: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;
(3) Constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network;
(4) Extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method;
(5) And inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result.
The step (2) specifically comprises the following steps: calculating the motion information amount M of the whole video, namely:
in the formula, flow 2 (x, y, c) represents the corresponding two-channel optical flow image within the segment, c represents the channel of the optical flow;
dividing video into N segments to obtain average information amountFor M/N, minimize the variance of the segmentsIn the formula, M i For the motion information amount of the ith segment of the video, based on the comparison result>Is the average segment information content;
approximate solution is solved by minimizing the segment variance using a greedy algorithm: the method comprises the steps of carrying out segment division on a training video (No. 1) and a testing video (No. 2), firstly carrying out frame sampling on the videos to obtain a video frame set, initializing N divided segments, wherein the content of the divided segments is empty, and calculating the motion information amount M of the ith divided segment i If is comparedSmall, adding the first frame of the video frame set to the segment and deleting it in the video frame set, calculating the information content of the segment again until the motion information content of the segment is greater than the average information content for the first time>And finishing the division of the segment i, calculating the (i + 1) th division segment, repeating the greedy algorithm until the video frame set is empty, and finishing the division of the whole video.
The step (3) specifically comprises the following steps: the double-current convolution neural network comprises a spatial feature network and a time feature network, and the construction of the double-current convolution neural network refers to: the spatial feature network adopts a BN-inclusion structure, the temporal feature network sums convolution kernel parameters of a first convolution layer along channels on the basis of the BN-inclusion structure, the obtained parameters are divided by the number of the target channels, the parameters are copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is 10 fixed channels, and 5 frames of optical flow images are stacked in the xy direction to obtain the network;
the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from divided segments to serve as the input of a double-current convolutional neural network, the spatio-temporal characteristics, namely convolutional layer output, obtained by each segment are respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.
The step (4) specifically comprises:
(4a) Extracting PHOG characteristics with the dimension of 680 from each frame of a divided segment of a test video, and combining all the frame characteristics into a high-level semantic matrix X of the segment video;
(4b) Solving the high-level semantic matrix X reconstruction coefficient by adopting an iterative closed solution solving method to obtain a reconstruction coefficient matrix W of the video segment;
(4c) Superposing and summing each row of the obtained reconstruction coefficient matrix W, sequencing all rows from large to small according to the importance of the ith row of the reconstruction coefficient matrix W and the ith frame representing the video clip, selecting the largest frame of the row as the spatial characteristic of the video clip, and selecting the first five frames of the row as the temporal characteristic of the video clip;
(4d) And (4) repeating the steps (4 a) to (4 c) and solving the corresponding space-time characteristics for each segment of the test video.
The step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and completing behavior recognition.
The step (4 b) specifically comprises the following steps:
the construction coefficient reconstruction formula is as follows:
wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L 21 The norm matrices are added to form, the formula is optimized for solving W, relaxation constraints W = C and X-XW = E are added, and the formula is converted into the following ALM target:
wherein, Λ 1 、Λ 2 And Λ 3 Is the Lagrange multiplier, mu>0 is a punishment parameter, respectively calculating the partial derivative of each variable and each parameter of the formula (3), then making the partial derivative equal to 0, and calculating the closed solution of each parameter, wherein the solution of W is firstly calculated according to the method as follows:
W=(2X T X+μ(I+11 T )) -1 (2U T X+μ(P+1Q)) (4)
whereinThe closed solution for C and E is then calculated by the same calculation method as follows:
wherein,finally, the Lagrangian lambda is calculated 1 、Λ 2 And Λ 3 The solution of (a) is as follows:
Λ 1 =Λ 1 +μ(W-C) (7)
Λ 2 =Λ 2 +μ(1 T W-1 T ) (8)
Λ 3 =Λ 3 +μ(X-XE-W) (9)
initializing parameters W = C =0, Λ 1 ,Λ 2 ,Λ 3 =0,μ=10 -6 ,ρ=1.1,max μ =10 10 Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max) μ ) (ii) a Setting threshold ε =10 -8 If | W-C-<ε,|1 T W-1 T |<ε,|X-XE-W|<If epsilon is established, all parameters are stable, and a reconstruction coefficient matrix W is obtained.
According to the technical scheme, the beneficial effects of the invention are as follows: firstly, the method for identifying the key frame of the video is adopted, and the identification accuracy is improved by improving the quality of the frame to be detected; secondly, the double-current convolutional neural network is adopted, so that the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; thirdly, the method calculates dense optical flow for the motion subject, minimizes the segment variance, and makes the information content of the divided video segments uniform; the division of the video band can be completed by using a greedy algorithm and adopting simple conditional circulation, and the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the maximum contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
fig. 2 is a flow diagram of video feature extraction.
Detailed Description
As shown in fig. 1, a method for recognizing human body behavior based on video keyframes includes the following steps:
(1) Acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, split1 is an experimental video set, and the data set is grouped into a training set test set which is most commonly used; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos; the UCF101 is a large outdoor human behavior and action video data set, and has great diversity in action acquisition, including camera operation, appearance change, posture change, object proportion change, background change, optical fiber change and the like, wherein the test video is split1;
(2) Dividing video clips based on information quantity: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;
(3) Constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network;
(4) Extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method;
(5) And inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result.
The step (2) specifically comprises the following steps: the motion information of the behaviors is not uniform in the time domain, the behavior bodies occupy the vast information quantity of the video, and the motion states of the bodies are expressed by dense optical flows. Since the whole information amount of the video is fixed, the average segment information amount under an ideal condition can be obtained, and in order to divide the video into N segments according to the information amount, the difference between the information amount of the divided segments and the average information amount needs to be reduced, namely, the segment variance D is minimized;
calculating the motion information amount M of the whole video, namely:
in the formula, flow 2 (x, y, c) represents the corresponding two-channel optical flow image within the segment, c represents the channel of the optical flow; dividing video into N segments to obtain average information amountFor M/N, minimize the segment variance->In the formula, M i Amount of motion information for the ith segment of video, based on the number of frames in a video frame>Is the average segment information content;
approximate solution is solved by minimizing the segment variance using a greedy algorithm: the method comprises the steps of carrying out segment division on a training video (No. 1) and a testing video (No. 2), firstly carrying out frame sampling on the videos to obtain a video frame set, initializing N divided segments, wherein the content of the divided segments is empty, and calculating the motion information amount M of the ith divided segment i If is comparedSmall, adding the first frame of the video frame set to the segment and deleting it in the video frame set, calculating the information content of the segment again until the motion information content of the segment is greater than the average information content for the first time>And finishing the division of the segment i, calculating the (i + 1) th division segment, repeating the greedy algorithm until the video frame set is empty, and finishing the division of the whole video.
The double-current convolutional neural network understands video information by simulating a human visual process, understands time sequence information in a video frame sequence on the basis of processing environmental space information in a video image, and divides an abnormal behavior classification task into two different parts in order to better understand the information. The double-current convolution neural network constructed by the invention is divided into a spatial feature network and a temporal feature network, wherein the spatial feature network is input into a single-frame RGB image with a fixed size, and the temporal feature network is input into a 5-frame optical flow image. The dual-stream convolutional neural network is trained with a training set of labeled UCFs 101.
The step (3) specifically comprises the following steps: the double-current convolutional neural network comprises a spatial feature network and a time feature network, and the constructing of the double-current convolutional neural network is as follows:
constructing a spatial feature network: the spatial feature network adopts a BN-inclusion structure, namely an image network, and the network selects the BN-inclusion structure which has good performance on an image classification task. According to the network, two convolutions of 3x3 are used for replacing a large convolution of 5x5, the number of parameters is reduced, overfitting is reduced, a BN layer is introduced, the BN method is a very effective regularization method, the training speed of the large convolutional network can be accelerated by many times, and meanwhile the classification accuracy rate after convergence can be greatly improved.
Constructing a time characteristic network: the time characteristic network is an optical flow network, and the network adopts a modified BN-inclusion structure. The time characteristic network is based on a BN-inclusion structure, convolution kernel parameters of a first convolution layer are summed along channels, the obtained parameters are divided by the number of the target channels, the sum is copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is a fixed 10-channel, and 5 frames of optical flow images are stacked in the xy direction to obtain the time characteristic network;
the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from the divided segments to serve as the input of a double-current convolutional neural network, the output of a convolutional layer, which is the space-time characteristic obtained by each segment, is respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.
The test video is divided into N segments, and one frame is randomly extracted from each segment to be used as the input of a spatial network, and 5 frames are randomly extracted to be used as the input of a temporal characteristic network. The output of the last convolution layer corresponding to the network is used as the output of the network, the output of the convolution layer, which is the space-time characteristic obtained by each segment, is respectively subjected to mean pooling, and the pooled result is used as the input of a loss function to calculate loss and perform back propagation. Where the loss function uses softmax cross entropy and is trained using random gradient descent.
And (4) repeatedly training each video random fragment feature until all test samples are trained, and completing parameter debugging of the double-current convolutional neural network. Random sampling training enables the double-current convolution neural network to learn more characteristics and have higher fault tolerance rate.
The invention provides a linear reconstruction frame to extract video key frames, and the key idea is to use linear combination of a small number of basis vectors to represent all frame feature vectors. Extracting high-level semantic information from each frame of the video frequency band, combining the high-level semantic information into a video coefficient matrix, wherein the video key frame is characterized in that the content of the whole video can be represented by the frame or the frame sequence, and therefore, the contribution of each frame of the video to the whole video is calculated, and the high contribution degree is taken as the video key frame. The frame is made up of two parts: the linear reconstruction function and the regularizer are solved by structured sparse characterization of the norm. As shown in fig. 2, the step (4) specifically includes:
(4a) Extracting PHOG characteristics with the dimension of 680 from each frame of a divided segment of a test video, and combining all the frame characteristics into a high-level semantic matrix X of the segment video;
video clips are pre-sampled to reduce the computational load of subsequent algorithms and to turn the video into successive still picture frames. Suppose that the ith segment v of video v i There are j frames in total, and each frame is encoded using the phog descriptor to obtain 680-dimensional feature quantities. Integrating the characteristic quantity of each frame to obtain a high-level semantic matrix X of the matrix video, wherein the size j of the X is 680, and the ith row of the matrix represents a video segment v i The ith frame characteristic of (2).
(4b) Solving the high-level semantic matrix X reconstruction coefficient by adopting an iterative closed solution solving method to obtain a reconstruction coefficient matrix W of the video segment;
(4c) Superposing and summing each row of the obtained reconstruction coefficient matrix W, sequencing all rows from large to small according to the importance of the ith row of the reconstruction coefficient matrix W and the ith frame representing the video clip, selecting the largest frame of the row as the spatial characteristic of the video clip, and selecting the first five frames of the row as the temporal characteristic of the video clip;
(4d) And (4) repeating the steps (4 a) to (4 c) and solving the corresponding space-time characteristics for each segment of the test video.
The step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and finishing behavior recognition.
The step (4 b) specifically comprises the following steps:
the construction coefficient reconstruction formula is as follows:
wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L 21 The norm matrices are added to form, the formula is optimized for solving W, relaxation constraints W = C and X-XW = E are added, and the formula is converted into the following ALM target:
wherein, Λ 1 、Λ 2 And Λ 3 Is the Lagrange multiplier, mu>0 is a punishment parameter, the partial derivatives of each variable and each parameter in the formula (3) are respectively calculated, then the partial derivatives are equal to 0, the closed solution of each parameter is solved, and according to the method, the solution of W is firstly solved as follows:
W=(2X T X+μ(I+11 T )) -1 (2U T X+μ(P+1Q)) (4)
whereinThe closed solution for C and E is then calculated by the same calculation method as follows:
wherein,finally, the Lagrangian lambda is calculated 1 、Λ 2 And Λ 3 The solution of (a) is as follows:
Λ 1 =Λ 1 +μ(W-C) (7)
Λ 2 =Λ 2 +μ(1 T W-1 T ) (8)
Λ 3 =Λ 3 +μ(X-XE-W) (9)
initializing parameters W = C =0, Λ 1 ,Λ 2 ,Λ 3 =0,μ=10 -6 ,ρ=1.1,max μ =10 10 Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max) μ ) (ii) a Setting threshold ε =10 -8 If W-C noncash<ε,|1 T W-1 T |<ε,|X-XE-W|<If epsilon is established, all parameters are stable, and a reconstruction coefficient matrix W is obtained through solution.
The sum of the ith row of W represents the importance of the ith frame to the entire video, W is a sparse matrix, and each row of W is summed and then sorted from large to small. And extracting the largest frame k1 as a segment image key frame, and the largest five frames k1 to k5 as optical flow key frames.
And (5) repeating the steps (4 a) to (4 c) for each segment of the test video, and solving the space-time characteristics of each video segment and numbering.
And inputting the obtained space-time characteristic pair into a trained double-current convolutional neural network, performing mean pooling on the obtained classification results, and finally obtaining a prediction result through an argmax function to finish the visual human behavior identification. And (4) if the video to be tested still exists, returning to the step (4).
In summary, there are many kinds of extraction of video features, and the invention adopts video key frames as features for extraction; the invention improves the identification accuracy by improving the quality of the frame to be detected; by adopting the double-current convolutional neural network, the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; the greedy algorithm is adopted for solving, and the video segment division can be completed by adopting simple condition circulation, so that the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the maximum contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.
Claims (4)
1. A human behavior identification method based on video key frames is characterized in that: the method comprises the following steps in sequence:
(1) Acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, and split1 is an experimental video set; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos;
(2) Dividing video clips based on information content: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;
(3) Constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network;
(4) Extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method;
(5) Inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result;
the step (4) specifically comprises:
(4a) Extracting PHOG characteristics with the dimension of 680 from each frame of a divided segment of a test video, and combining all the frame characteristics into a high-level semantic matrix X of the segment video;
(4b) Solving the high-level semantic matrix X reconstruction coefficient by adopting an iterative closed solution solving method to obtain a reconstruction coefficient matrix W of the video segment;
(4c) Superposing and summing each row of the obtained reconstruction coefficient matrix W, sequencing all rows from large to small according to the importance of the ith row of the reconstruction coefficient matrix W and the ith frame representing the video clip, selecting the largest frame of the row as the spatial characteristic of the video clip, and selecting the first five frames of the row as the temporal characteristic of the video clip;
(4d) Repeating the steps (4 a) to (4 c), and solving corresponding space-time characteristics for each segment of the test video;
the step (4 b) specifically comprises the following steps:
the construction coefficient reconstruction formula is as follows:
wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L 21 The norm matrices are added to form, the formula is optimized for solving for W, relaxation constraints W = C and X-XW = E are added, and the formula is transformed into the following ALM objective:
wherein, Λ 1 、Λ 2 And Λ 3 Is the Lagrange multiplier, mu>0 is a punishment parameter, respectively calculating the partial derivative of each variable and each parameter of the formula (3), then making the partial derivative equal to 0, and calculating the closed solution of each parameter, wherein the solution of W is firstly calculated according to the method as follows:
W=(2X T X+μ(I+11 T )) -1 (2U T X+μ(P+1Q)) (4)
whereinThe closed solution for C and E is then calculated by the same calculation method as follows:
wherein,finally, the Lagrangian lambda is calculated 1 、Λ 2 And Λ 3 The solution of (a) is as follows:
Λ 1 =Λ 1 +μ(W-C) (7)
Λ 2 =Λ 2 +μ(1 T W-1 T ) (8)
Λ 3 =Λ 3 +μ(X-XE-W) (9)
initializing parameters W = C =0, Λ 1 ,Λ 2 ,Λ 3 =0,μ=10 -6 ,ρ=1.1,max μ =10 10 Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max) μ ) (ii) a Setting threshold ε =10 -8 If W-C noncash<ε,|1 T W-1 T |<ε,|X-XE-W|<If epsilon is established, all parameters are stable, and a reconstruction coefficient matrix W is obtained through solution.
2. The human behavior recognition method based on video keyframes according to claim 1, characterized in that: the step (2) specifically comprises the following steps: calculating the motion information amount M of the whole video, namely:
in the formula, flow 2 (x, y, c) denotes intra-fragment correspondenceC represents the channel of the optical flow;
dividing video into N segments to obtain average information contentFor M/N, minimize the variance of the segmentsIn the formula, M i For the motion information amount of the ith segment of the video, based on the comparison result>Is the average clip information content;
approximate solution is solved by minimizing the segment variance using a greedy algorithm: the method comprises the steps of carrying out segment division on a training video (No. 1) and a testing video (No. 2), firstly carrying out frame sampling on the videos to obtain a video frame set, initializing N divided segments, wherein the content of the divided segments is empty, and calculating the motion information amount M of the ith divided segment i If is comparedSmall, adding the first frame of the video frame set to the segment and deleting it in the video frame set, calculating the information content of the segment again until the motion information content of the segment is greater than the average information content for the first time>And finishing the division of the segment i, calculating the (i + 1) th division segment, repeating the greedy algorithm until the video frame set is empty, and finishing the division of the whole video.
3. The human behavior recognition method based on video keyframes according to claim 1, wherein: the step (3) specifically comprises the following steps: the double-current convolutional neural network comprises a spatial feature network and a time feature network, and the constructing of the double-current convolutional neural network is as follows: the spatial feature network adopts a BN-inclusion structure, the temporal feature network sums convolution kernel parameters of a first convolution layer along channels on the basis of the BN-inclusion structure, the obtained parameters are divided by the number of the target channels, the parameters are copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is 10 fixed channels, and 5 frames of optical flow images are stacked in the xy direction to obtain the network;
the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from the divided segments to serve as the input of a double-current convolutional neural network, the output of a convolutional layer, which is the space-time characteristic obtained by each segment, is respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.
4. The human behavior recognition method based on video keyframes according to claim 1, wherein: the step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and completing behavior recognition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010482943.8A CN111626245B (en) | 2020-06-01 | 2020-06-01 | Human behavior identification method based on video key frame |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010482943.8A CN111626245B (en) | 2020-06-01 | 2020-06-01 | Human behavior identification method based on video key frame |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111626245A CN111626245A (en) | 2020-09-04 |
CN111626245B true CN111626245B (en) | 2023-04-07 |
Family
ID=72271841
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010482943.8A Active CN111626245B (en) | 2020-06-01 | 2020-06-01 | Human behavior identification method based on video key frame |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111626245B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112016506B (en) * | 2020-09-07 | 2022-10-11 | 重庆邮电大学 | Classroom attitude detection model parameter training method capable of quickly adapting to new scene |
CN112329738B (en) * | 2020-12-01 | 2024-08-16 | 厦门大学 | Long video motion recognition method based on significant segment sampling |
CN112528823B (en) * | 2020-12-04 | 2022-08-19 | 燕山大学 | Method and system for analyzing batcharybus movement behavior based on key frame detection and semantic component segmentation |
CN113239822A (en) * | 2020-12-28 | 2021-08-10 | 武汉纺织大学 | Dangerous behavior detection method and system based on space-time double-current convolutional neural network |
CN112733695B (en) * | 2021-01-04 | 2023-04-25 | 电子科技大学 | Unsupervised keyframe selection method in pedestrian re-identification field |
CN113469142B (en) * | 2021-03-12 | 2022-01-14 | 山西长河科技股份有限公司 | Classification method, device and terminal for monitoring video time-space information fusion |
CN113239869B (en) * | 2021-05-31 | 2023-08-11 | 西安电子科技大学 | Two-stage behavior recognition method and system based on key frame sequence and behavior information |
CN113642499B (en) * | 2021-08-23 | 2024-05-24 | 中国人民解放军火箭军工程大学 | Human body behavior recognition method based on computer vision |
CN114373194A (en) * | 2022-01-14 | 2022-04-19 | 南京邮电大学 | Human behavior identification method based on key frame and attention mechanism |
CN114550047B (en) * | 2022-02-22 | 2024-04-05 | 西安交通大学 | Behavior rate guided video behavior recognition method |
CN114973020A (en) * | 2022-06-15 | 2022-08-30 | 北京鹏鹄物宇科技发展有限公司 | Abnormal behavior analysis method based on satellite monitoring video |
CN115393660B (en) * | 2022-10-28 | 2023-02-24 | 松立控股集团股份有限公司 | Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110598598A (en) * | 2019-08-30 | 2019-12-20 | 西安理工大学 | Double-current convolution neural network human behavior identification method based on finite sample set |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9510787B2 (en) * | 2014-12-11 | 2016-12-06 | Mitsubishi Electric Research Laboratories, Inc. | Method and system for reconstructing sampled signals |
-
2020
- 2020-06-01 CN CN202010482943.8A patent/CN111626245B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110598598A (en) * | 2019-08-30 | 2019-12-20 | 西安理工大学 | Double-current convolution neural network human behavior identification method based on finite sample set |
Non-Patent Citations (2)
Title |
---|
田曼 ; 张艺 ; .多模型融合动作识别研究.电子测量技术.2018,(20),全文. * |
贾迪 ; 朱宁丹 ; 杨宁华 ; 吴思 ; 李玉秀 ; 赵明远 ; .图像匹配方法研究综述.中国图象图形学报.2019,(05),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111626245A (en) | 2020-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111626245B (en) | Human behavior identification method based on video key frame | |
Oh et al. | Crowd counting with decomposed uncertainty | |
CN107341452B (en) | Human behavior identification method based on quaternion space-time convolution neural network | |
Zhou et al. | Anomalynet: An anomaly detection network for video surveillance | |
CN113221641B (en) | Video pedestrian re-identification method based on generation of antagonism network and attention mechanism | |
CN108133188A (en) | A kind of Activity recognition method based on motion history image and convolutional neural networks | |
CN108446589B (en) | Face recognition method based on low-rank decomposition and auxiliary dictionary in complex environment | |
CN111526434B (en) | Converter-based video abstraction method | |
CN113239869B (en) | Two-stage behavior recognition method and system based on key frame sequence and behavior information | |
CN111738363A (en) | Alzheimer disease classification method based on improved 3D CNN network | |
CN112200096B (en) | Method, device and storage medium for realizing real-time abnormal behavior identification based on compressed video | |
Song et al. | A new recurrent plug-and-play prior based on the multiple self-similarity network | |
CN115393231B (en) | Defect image generation method and device, electronic equipment and storage medium | |
CN115131558B (en) | Semantic segmentation method in environment with few samples | |
Choo et al. | Multi-scale recurrent encoder-decoder network for dense temporal classification | |
CN111242068A (en) | Behavior recognition method and device based on video, electronic equipment and storage medium | |
Sun et al. | Video snapshot compressive imaging using residual ensemble network | |
Jaisurya et al. | Attention-based single image dehazing using improved cyclegan | |
CN111898614A (en) | Neural network system, image signal and data processing method | |
CN117834852A (en) | Space-time video quality evaluation method based on cross-attention multi-scale visual transformer | |
CN113111945A (en) | Confrontation sample defense method based on transform self-encoder | |
CN112347965A (en) | Video relation detection method and system based on space-time diagram | |
CN113762007A (en) | Abnormal behavior detection method based on appearance and action characteristic double prediction | |
CN111401209A (en) | Action recognition method based on deep learning | |
CN113963421B (en) | Dynamic sequence unconstrained expression recognition method based on hybrid feature enhanced network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |