CN111626245A - Human behavior identification method based on video key frame - Google Patents

Human behavior identification method based on video key frame Download PDF

Info

Publication number
CN111626245A
CN111626245A CN202010482943.8A CN202010482943A CN111626245A CN 111626245 A CN111626245 A CN 111626245A CN 202010482943 A CN202010482943 A CN 202010482943A CN 111626245 A CN111626245 A CN 111626245A
Authority
CN
China
Prior art keywords
video
segment
neural network
double
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010482943.8A
Other languages
Chinese (zh)
Other versions
CN111626245B (en
Inventor
丁转莲
盛漫锦
石家毓
孙登第
吴心悦
沈心怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202010482943.8A priority Critical patent/CN111626245B/en
Publication of CN111626245A publication Critical patent/CN111626245A/en
Application granted granted Critical
Publication of CN111626245B publication Critical patent/CN111626245B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a human behavior identification method based on video key frames, which comprises the following steps: acquiring a classified video set; dividing the video clips based on the information amount; constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network; extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method; and inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result. The invention improves the identification accuracy by improving the quality of the frame to be detected; by adopting the double-current convolutional neural network, the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; the greedy algorithm is adopted for solving, and the video segment division can be completed by adopting simple condition circulation, so that the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the maximum contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.

Description

Human behavior identification method based on video key frame
Technical Field
The invention relates to the technical field of machine learning, in particular to a human behavior identification method based on video key frames.
Background
In the age of rapid development of the internet, video data in daily life is increased explosively, and videos contain a large amount of digital information. Compared with the behavior recognition of pictures, the behavior recognition of videos has the defects of large calculation amount, redundant information and the like, but the behavior recognition of videos has more practical significance and wider application prospects, such as human-computer interaction, intelligent monitoring, video classification and the like, so that the improvement of the accuracy rate of the behavior recognition of videos is an important difficult problem.
Since video is composed of consecutive still images, there is redundancy, which requires screening of video information. How to screen the video information and what kind of network is adopted for identification is an important direction for improving the accuracy rate of video behavior identification. In the current identification methods, the neural network is directly adopted for identification, or the video segment is equally divided to extract key frames for identification. However, these methods are computationally intensive and key frame extraction is not representative.
Disclosure of Invention
The invention aims to provide a human body behavior identification method based on video key frames, which divides segments based on the information content of video segments, extracts useful key frames from the video segments and utilizes the key frames and a space-time double-flow network to more quickly and effectively identify behaviors.
In order to achieve the purpose, the invention adopts the following technical scheme: a human behavior recognition method based on video key frames comprises the following steps:
(1) acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, and split1 is an experimental video set; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos;
(2) dividing video clips based on information content: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;
(3) constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network;
(4) extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method;
(5) and inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result.
The step (2) specifically comprises the following steps: calculating the motion information amount M of the whole video, namely:
Figure BDA0002517919160000021
in the formula, flow2(x, y, c) represents the corresponding two-channel optical flow image within the segment, c represents the channel of the optical flow;
dividing video into N segments to obtain average information content
Figure BDA0002517919160000022
Hook M/N, minimize segment variance
Figure BDA0002517919160000023
In the formula, MiThe motion information amount of the ith segment of the video,
Figure BDA0002517919160000024
is the average segment information content;
approximate solution is solved by minimizing the segment variance using a greedy algorithm: the method comprises the steps of carrying out segment division on a training video (No. 1) and a testing video (No. 2), firstly carrying out frame sampling on the videos to obtain a video frame set, and firstly carrying out segmentation on the video frame setInitializing N divided segments, wherein the content of the divided segments is empty, and calculating the motion information quantity M of the ith divided segmentiIf is compared
Figure BDA0002517919160000025
Adding the first frame of the video frame set to the segment and deleting the frame in the video frame set, and calculating the information content of the segment again until the motion information content of the segment is larger than the average information content for the first time
Figure BDA0002517919160000026
And finishing the division of the segment i, calculating the (i + 1) th division segment, repeating the greedy algorithm until the video frame set is empty, and finishing the division of the whole video.
The step (3) specifically comprises the following steps: the double-current convolutional neural network comprises a spatial feature network and a time feature network, and the construction of the double-current convolutional neural network is as follows: the spatial feature network adopts a BN-inclusion structure, the temporal feature network sums convolution kernel parameters of a first convolution layer along channels on the basis of the BN-inclusion structure, the obtained parameters are divided by the number of the target channels, the parameters are copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is 10 fixed channels, and 5 frames of optical flow images are stacked in the xy direction to obtain the network;
the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from divided segments to serve as the input of a double-current convolutional neural network, the spatio-temporal characteristics, namely convolutional layer output, obtained by each segment are respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.
The step (4) specifically comprises:
(4a) extracting PHOG characteristics with the dimension of 680 from each frame of a divided segment of a test video, and combining all the frame characteristics into a high-level semantic matrix X of the segment video;
(4b) solving the high-level semantic matrix X reconstruction coefficient by adopting an iterative closed solution solving method to obtain a reconstruction coefficient matrix W of the video segment;
(4c) superposing and summing each row of the obtained reconstruction coefficient matrix W, sequencing all rows from large to small according to the importance of the ith row of the reconstruction coefficient matrix W and the ith frame representing the video clip, selecting the largest frame of the row as the spatial characteristic of the video clip, and selecting the first five frames of the row as the temporal characteristic of the video clip;
(4d) and (4) repeating the steps (4a) to (4c) and solving the corresponding space-time characteristics for each segment of the test video.
The step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and finishing behavior recognition.
The step (4b) specifically comprises the following steps:
the construction coefficient reconstruction formula is as follows:
Figure BDA0002517919160000031
wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L21The norm matrix is added to form, the formula is optimized for solving W, relaxation constraints W-C and X-XW-E are added, and the formula is converted into the following ALM target:
Figure BDA0002517919160000032
wherein, Λ1、Λ2And Λ3Is Lagrange multiplier, mu is more than 0 is a punishment parameter, the partial derivatives of the variables and the parameters of the formula (3) are respectively calculated, then the partial derivatives are enabled to be equal to 0, the closed solution of the parameters is solved, and the solution of W is firstly solved according to the method:
W=(2XTX+μ(I+11T))-1(2UTX+μ(P+1Q)) (4)
wherein
Figure BDA0002517919160000041
The closed solution for C and E is then calculated by the same calculation method as follows:
Figure BDA0002517919160000042
Figure BDA0002517919160000043
wherein the content of the first and second substances,
Figure BDA0002517919160000044
finally, the Lagrangian is calculated Λ1、Λ2And Λ3The solution of (a) is as follows:
Λ1=Λ1+μ(W-C) (7)
Λ2=Λ2+μ(1TW-1T) (8)
Λ3=Λ3+μ(X-XE-W) (9)
initializing each parameter W ═ C ═ 0, Λ1,Λ2,Λ3=0,μ=10-6,ρ=1.1,maxμ=1010Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max)μ) (ii) a Set threshold 10-8If W-C is less than 1TW-1TIf < X-XE-W < then all parameters are stable, and the reconstruction coefficient matrix W is obtained by solving.
According to the technical scheme, the beneficial effects of the invention are as follows: firstly, the method for identifying the key frame of the video is adopted, and the identification accuracy is improved by improving the quality of the frame to be detected; secondly, the double-current convolutional neural network is adopted, so that the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; thirdly, the method calculates dense optical flow for the motion subject, minimizes the segment variance, and makes the information content of the divided video segments uniform; the division of the video band can be completed by using a greedy algorithm and adopting simple conditional circulation, and the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the maximum contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
fig. 2 is a flow diagram of video feature extraction.
Detailed Description
As shown in fig. 1, a method for recognizing human body behavior based on video keyframes includes the following steps:
(1) acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, split1 is an experimental video set, and the data set is grouped into a training set test set which is most commonly used; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos; the UCF101 is a large outdoor human behavior and action video data set, and has great diversity in action collection, including camera operation, appearance change, posture change, object proportion change, background change, optical fiber change and the like, and the test video is split 1;
(2) dividing video clips based on information content: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;
(3) constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network;
(4) extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method;
(5) and inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result.
The step (2) specifically comprises the following steps: the motion information of the behaviors is not uniform in the time domain, the behavior bodies occupy the vast information quantity of the video, and the motion states of the bodies are expressed by dense optical flows. Since the whole information amount of the video is fixed, the average segment information amount under an ideal condition can be obtained, and in order to divide the video into N segments according to the information amount, the difference between the information amount of the divided segments and the average information amount needs to be reduced, namely, the segment variance D is minimized;
calculating the motion information amount M of the whole video, namely:
Figure BDA0002517919160000051
in the formula, flow2(x, y, c) represents the corresponding two-channel optical flow image within the segment, c represents the channel of the optical flow;
dividing video into N segments to obtain average information content
Figure BDA0002517919160000061
For M/N, minimize the variance of the segments
Figure BDA0002517919160000062
In the formula, MiThe motion information amount of the ith segment of the video,
Figure BDA0002517919160000063
is the average segment information content;
approximate solution is solved by minimizing the segment variance using a greedy algorithm: the method comprises the steps of carrying out segment division on a training video (No. 1) and a testing video (No. 2), firstly carrying out frame sampling on the videos to obtain a video frame set, initializing N divided segments, wherein the content of the divided segments is empty, and calculating the motion information amount M of the ith divided segmentiIf is compared
Figure BDA0002517919160000064
Adding the first frame of the video frame set to the segment and deleting the frame in the video frame set, and calculating the information content of the segment again until the motion information content of the segment is larger than the average information content for the first time
Figure BDA0002517919160000065
And finishing the division of the segment i, calculating the (i + 1) th division segment, repeating the greedy algorithm until the video frame set is empty, and finishing the division of the whole video.
The double-current convolutional neural network understands video information by simulating a human visual process, understands time sequence information in a video frame sequence on the basis of processing environmental space information in a video image, and divides an abnormal behavior classification task into two different parts in order to better understand the information. The double-current convolution neural network constructed by the invention is divided into a spatial feature network and a temporal feature network, wherein the spatial feature network is input into a single-frame RGB image with a fixed size, and the temporal feature network is input into a 5-frame optical flow image. The dual-stream convolutional neural network is trained with a training set of labeled UCFs 101.
The step (3) specifically comprises the following steps: the double-current convolutional neural network comprises a spatial feature network and a time feature network, and the construction of the double-current convolutional neural network is as follows:
constructing a spatial feature network: the spatial feature network adopts a BN-inclusion structure, namely an image network, and the network selects the BN-inclusion structure which has good performance on an image classification task. The network uses two convolutions of 3x3 to replace a large convolution of 5x5, the number of parameters is reduced, overfitting is reduced, a BN layer is introduced, the BN method is an effective regularization method, the training speed of the large-scale convolutional network can be accelerated by many times, and meanwhile, the classification accuracy after convergence can be greatly improved.
Constructing a time characteristic network: the time characteristic network is an optical flow network, and the network adopts a modified BN-inclusion structure. The time characteristic network is based on a BN-inclusion structure, convolution kernel parameters of a first convolution layer are summed along channels, the obtained parameters are divided by the number of the target channels, the sum is copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is a fixed 10-channel, and 5 frames of optical flow images are stacked in the xy direction to obtain the time characteristic network;
the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from divided segments to serve as the input of a double-current convolutional neural network, the spatio-temporal characteristics, namely convolutional layer output, obtained by each segment are respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.
The test video is divided into N segments, and one frame is randomly extracted from each segment to be used as the input of a spatial network, and 5 frames are randomly extracted to be used as the input of a temporal characteristic network. And taking the output of the last convolution layer corresponding to the network as the output of the network, respectively performing mean value pooling on the space-time characteristics, namely the convolution layer output, obtained by each segment, and calculating loss by taking a pooling result as the input of a loss function to perform back propagation. Where the loss function uses softmax cross entropy and is trained using random gradient descent.
And (4) repeatedly training each video random fragment feature until all test samples are trained, and completing parameter debugging of the double-current convolutional neural network. Random sampling training enables the double-current convolution neural network to learn more characteristics and have higher fault tolerance rate.
The invention provides a linear reconstruction frame to extract video key frames, and the key idea is to use linear combination of a small number of basis vectors to represent all frame feature vectors. Extracting high-level semantic information from each frame of the video frequency band, combining the high-level semantic information into a video coefficient matrix, wherein the video key frame is characterized in that the content of the whole video can be represented by the frame or the frame sequence, and therefore, the contribution of each frame of the video to the whole video is calculated, and the high contribution degree is taken as the video key frame. The frame is made up of two parts: the linear reconstruction function and the regularizer are solved by structured sparse characterization of the norm. As shown in fig. 2, the step (4) specifically includes:
(4a) extracting PHOG characteristics with the dimension of 680 from each frame of a divided segment of a test video, and combining all the frame characteristics into a high-level semantic matrix X of the segment video;
video clips are pre-sampled to reduce the computational load of subsequent algorithms and to turn the video into successive still picture frames. Suppose that the ith segment v of a video viThere are j frames in total, and each frame is encoded using the phog descriptor to obtain 680-dimensional feature quantities. Each one is to beIntegrating the characteristic quantity of the frame to obtain a high-level semantic matrix X of the matrix video, wherein the size j of the X is 680, and the ith row of the matrix represents a video segment viThe ith frame characteristic of (1).
(4b) Solving the high-level semantic matrix X reconstruction coefficient by adopting an iterative closed solution solving method to obtain a reconstruction coefficient matrix W of the video segment;
(4c) superposing and summing each row of the obtained reconstruction coefficient matrix W, sequencing all rows from large to small according to the importance of the ith row of the reconstruction coefficient matrix W and the ith frame representing the video clip, selecting the largest frame of the row as the spatial characteristic of the video clip, and selecting the first five frames of the row as the temporal characteristic of the video clip;
(4d) and (4) repeating the steps (4a) to (4c) and solving the corresponding space-time characteristics for each segment of the test video.
The step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and finishing behavior recognition.
The step (4b) specifically comprises the following steps:
the construction coefficient reconstruction formula is as follows:
Figure BDA0002517919160000081
wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L21The norm matrix is added to form, the formula is optimized for solving W, relaxation constraints W-C and X-XW-E are added, and the formula is converted into the following ALM target:
Figure BDA0002517919160000082
wherein, Λ1、Λ2And Λ3Is Lagrange multiplier, mu is more than 0 is a punishment parameter, the partial derivatives of the variables and the parameters of the formula (3) are respectively calculated, then the partial derivatives are made to be equal to 0, and the closure of the parameters is solvedAccording to the method, firstly, the solution of W is obtained as follows:
W=(2XTX+μ(I+11T))-1(2UTX+μ(P+1Q)) (4)
wherein
Figure BDA0002517919160000083
The closed solution for C and E is then calculated by the same calculation method as follows:
Figure BDA0002517919160000084
Figure BDA0002517919160000091
wherein the content of the first and second substances,
Figure BDA0002517919160000092
finally, the Lagrangian is calculated Λ1、Λ2And Λ3The solution of (a) is as follows:
Λ1=Λ1+μ(W-C) (7)
Λ2=Λ2+μ(1TW-1T) (8)
Λ3=Λ3+μ(X-XE-W) (9)
initializing each parameter W ═ C ═ 0, Λ1,Λ2,Λ3=0,μ=10-6,ρ=1.1,maxμ=1010Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max)μ) (ii) a Set threshold 10-8If W-C is less than 1TW-1TIf < X-XE-W < then all parameters are stable, and the reconstruction coefficient matrix W is obtained by solving.
The sum of the ith row of W represents the importance of the ith frame to the entire video, W is a sparse matrix, and each row of W is summed and then sorted from large to small. The largest frame k1 is extracted as a segment image key frame, and the largest five frames k1 to k5 are extracted as optical flow key frames.
And (4) repeating the steps (4a) to (4c) for each segment of the test video, and solving the space-time characteristics of each video segment and numbering the space-time characteristics.
And inputting the obtained space-time characteristic pair into a trained double-current convolutional neural network, performing mean pooling on the obtained classification results, and finally obtaining a prediction result through an argmax function to finish the visual human behavior identification. And (4) if the video to be tested still exists, returning to the step (4).
In summary, there are many kinds of extraction of video features, and the invention adopts video key frames as features for extraction; the invention improves the identification accuracy by improving the quality of the frame to be detected; by adopting the double-current convolutional neural network, the utilization of video time information is greatly improved, and the accuracy of behavior identification is effectively improved; the greedy algorithm is adopted for solving, and the video segment division can be completed by adopting simple condition circulation, so that the method is simple and accurate; in the optical flow time characteristic, the invention adopts 5 frames with the maximum contribution degree selected by a reconstruction matrix instead of the traditional continuous frames as optical flow key frames.

Claims (6)

1. A human behavior identification method based on video key frames is characterized in that: the method comprises the following steps in sequence:
(1) acquiring a classified video set: downloading a UCF101 video set, wherein the video set comprises 13320 videos and 101 types of actions as a data set for action behavior recognition, and split1 is an experimental video set; selecting 25 videos with the serial numbers of 1 from each type of videos in the video set as training videos, and selecting 5 videos with the serial numbers of 2 as test videos;
(2) dividing video clips based on information content: detecting a behavior main body of a video, calculating the motion information quantity M of the whole video, dividing the video into N sections, minimizing the variance D of the sections, and dividing the video into N sections by using a greedy algorithm;
(3) constructing a double-current convolutional neural network and randomly sampling and training the double-current convolutional neural network;
(4) extracting the space-time characteristics of the test video based on a coefficient reconstruction matrix method;
(5) and inputting the obtained space-time characteristics into a well-trained double-current convolutional neural network to obtain a behavior recognition result.
2. The human behavior recognition method based on video keyframes according to claim 1, wherein: the step (2) specifically comprises the following steps: calculating the motion information amount M of the whole video, namely:
Figure FDA0002517919150000011
in the formula, flow2(x, y, c) represents the corresponding two-channel optical flow image within the segment, c represents the channel of the optical flow;
dividing video into N segments to obtain average information content
Figure FDA0002517919150000012
For M/N, minimize the variance of the segments
Figure FDA0002517919150000013
In the formula, MiThe motion information amount of the ith segment of the video,
Figure FDA0002517919150000014
is the average segment information content;
approximate solution is solved by minimizing the segment variance using a greedy algorithm: the method comprises the steps of carrying out segment division on a training video (No. 1) and a testing video (No. 2), firstly carrying out frame sampling on the videos to obtain a video frame set, initializing N divided segments, wherein the content of the divided segments is empty, and calculating the motion information amount M of the ith divided segmentiIf is compared
Figure FDA0002517919150000015
Adding the first frame of the video frame set to the segment and deleting the frame in the video frame set, and calculating the information content of the segment again until the motion information content of the segment is larger than the average information content for the first time
Figure FDA0002517919150000016
Complete segmentAnd i, calculating the (i + 1) th division section, repeating the greedy algorithm until the video frame set is empty, and completing the division of the whole video.
3. The human behavior recognition method based on video keyframes according to claim 1, wherein: the step (3) specifically comprises the following steps: the double-current convolutional neural network comprises a spatial feature network and a time feature network, and the construction of the double-current convolutional neural network is as follows: the spatial feature network adopts a BN-inclusion structure, the temporal feature network sums convolution kernel parameters of a first convolution layer along channels on the basis of the BN-inclusion structure, the obtained parameters are divided by the number of the target channels, the parameters are copied and superposed along the channels to serve as parameters of a new conv1 layer, the input of the network is 10 fixed channels, and 5 frames of optical flow images are stacked in the xy direction to obtain the network;
the training of the double-current convolutional neural network is as follows: for a test video, a single video frame and a video sequence are randomly selected from divided segments to serve as the input of a double-current convolutional neural network, the spatio-temporal characteristics, namely convolutional layer output, obtained by each segment are respectively subjected to mean pooling, the pooling result is used as the input of a loss function to calculate loss, back propagation is carried out, and all training video sets are trained to obtain the final double-current convolutional neural network parameters.
4. The human behavior recognition method based on video keyframes according to claim 1, wherein: the step (4) specifically comprises:
(4a) extracting PHOG characteristics with the dimension of 680 from each frame of a divided segment of a test video, and combining all the frame characteristics into a high-level semantic matrix X of the segment video;
(4b) solving the high-level semantic matrix X reconstruction coefficient by adopting an iterative closed solution solving method to obtain a reconstruction coefficient matrix W of the video segment;
(4c) superposing and summing each row of the obtained reconstruction coefficient matrix W, sequencing all rows from large to small according to the importance of the ith row of the reconstruction coefficient matrix W and the ith frame representing the video clip, selecting the largest frame of the row as the spatial characteristic of the video clip, and selecting the first five frames of the row as the temporal characteristic of the video clip;
(4d) and (4) repeating the steps (4a) to (4c) and solving the corresponding space-time characteristics for each segment of the test video.
5. The human behavior recognition method based on video keyframes according to claim 1, wherein: the step (5) specifically comprises the following steps: and inputting the obtained space-time characteristics into the trained double-current convolutional neural network, pooling the obtained classification result mean, obtaining a prediction result through an argmax function, and finishing behavior recognition.
6. The human behavior recognition method based on video keyframes according to claim 4, wherein: the step (4b) specifically comprises the following steps:
the construction coefficient reconstruction formula is as follows:
Figure FDA0002517919150000031
wherein, X is a high-level semantic matrix, W represents a reconstruction coefficient matrix, gamma is a sparsity control parameter, and the formula comprises two L21The norm matrix is added to form, the formula is optimized for solving W, relaxation constraints W-C and X-XW-E are added, and the formula is converted into the following ALM target:
Figure FDA0002517919150000032
wherein, Λ1、Λ2And Λ3Is Lagrange multiplier, mu is more than 0 is a punishment parameter, the partial derivatives of the variables and the parameters of the formula (3) are respectively calculated, then the partial derivatives are enabled to be equal to 0, the closed solution of the parameters is solved, and the solution of W is firstly solved according to the method:
W=(2XTX+μ(I+11T))-1(2UTX+μ(P+1Q)) (4)
wherein
Figure FDA0002517919150000033
The closed solution for C and E is then calculated by the same calculation method as follows:
Figure FDA0002517919150000034
Figure FDA0002517919150000035
wherein the content of the first and second substances,
Figure FDA0002517919150000036
finally, the Lagrangian is calculated Λ1、Λ2And Λ3The solution of (a) is as follows:
Λ1=Λ1+μ(W-C) (7)
Λ2=Λ2+μ(1TW-1T) (8)
Λ3=Λ3+μ(X-XE-W) (9)
initializing each parameter W ═ C ═ 0, Λ1,Λ2,Λ3=0,μ=10-6,ρ=1.1,maxμ=1010Continuously iterating each parameter to obtain a closed solution, wherein mu is updated to min (rho mu, max)μ) (ii) a Set threshold 10-8If W-C is less than 1TW-1TIf < X-XE-W < then all parameters are stable, and the reconstruction coefficient matrix W is obtained by solving.
CN202010482943.8A 2020-06-01 2020-06-01 Human behavior identification method based on video key frame Active CN111626245B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010482943.8A CN111626245B (en) 2020-06-01 2020-06-01 Human behavior identification method based on video key frame

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010482943.8A CN111626245B (en) 2020-06-01 2020-06-01 Human behavior identification method based on video key frame

Publications (2)

Publication Number Publication Date
CN111626245A true CN111626245A (en) 2020-09-04
CN111626245B CN111626245B (en) 2023-04-07

Family

ID=72271841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010482943.8A Active CN111626245B (en) 2020-06-01 2020-06-01 Human behavior identification method based on video key frame

Country Status (1)

Country Link
CN (1) CN111626245B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016506A (en) * 2020-09-07 2020-12-01 重庆邮电大学 Classroom attitude detection model parameter training method capable of rapidly adapting to new scene
CN112329738A (en) * 2020-12-01 2021-02-05 厦门大学 Long video motion recognition method based on significant segment sampling
CN112528823A (en) * 2020-12-04 2021-03-19 燕山大学 Striped shark movement behavior analysis method and system based on key frame detection and semantic component segmentation
CN112733695A (en) * 2021-01-04 2021-04-30 电子科技大学 Unsupervised key frame selection method in pedestrian re-identification field
CN113239822A (en) * 2020-12-28 2021-08-10 武汉纺织大学 Dangerous behavior detection method and system based on space-time double-current convolutional neural network
CN113239869A (en) * 2021-05-31 2021-08-10 西安电子科技大学 Two-stage behavior identification method and system based on key frame sequence and behavior information
CN113469142A (en) * 2021-03-12 2021-10-01 山西长河科技股份有限公司 Classification method, device and terminal for monitoring video time-space information fusion
CN113642499A (en) * 2021-08-23 2021-11-12 中国人民解放军火箭军工程大学 Human behavior recognition method based on computer vision
CN114550047A (en) * 2022-02-22 2022-05-27 西安交通大学 Behavior rate guided video behavior identification method
CN115393660A (en) * 2022-10-28 2022-11-25 松立控股集团股份有限公司 Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160173736A1 (en) * 2014-12-11 2016-06-16 Mitsubishi Electric Research Laboratories, Inc. Method and System for Reconstructing Sampled Signals
CN110598598A (en) * 2019-08-30 2019-12-20 西安理工大学 Double-current convolution neural network human behavior identification method based on finite sample set

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160173736A1 (en) * 2014-12-11 2016-06-16 Mitsubishi Electric Research Laboratories, Inc. Method and System for Reconstructing Sampled Signals
CN110598598A (en) * 2019-08-30 2019-12-20 西安理工大学 Double-current convolution neural network human behavior identification method based on finite sample set

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
田曼;张艺;: "多模型融合动作识别研究" *
贾迪;朱宁丹;杨宁华;吴思;李玉秀;赵明远;: "图像匹配方法研究综述" *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016506A (en) * 2020-09-07 2020-12-01 重庆邮电大学 Classroom attitude detection model parameter training method capable of rapidly adapting to new scene
CN112016506B (en) * 2020-09-07 2022-10-11 重庆邮电大学 Classroom attitude detection model parameter training method capable of quickly adapting to new scene
CN112329738A (en) * 2020-12-01 2021-02-05 厦门大学 Long video motion recognition method based on significant segment sampling
CN112528823A (en) * 2020-12-04 2021-03-19 燕山大学 Striped shark movement behavior analysis method and system based on key frame detection and semantic component segmentation
CN113239822A (en) * 2020-12-28 2021-08-10 武汉纺织大学 Dangerous behavior detection method and system based on space-time double-current convolutional neural network
CN112733695A (en) * 2021-01-04 2021-04-30 电子科技大学 Unsupervised key frame selection method in pedestrian re-identification field
CN112733695B (en) * 2021-01-04 2023-04-25 电子科技大学 Unsupervised keyframe selection method in pedestrian re-identification field
CN113469142B (en) * 2021-03-12 2022-01-14 山西长河科技股份有限公司 Classification method, device and terminal for monitoring video time-space information fusion
CN113469142A (en) * 2021-03-12 2021-10-01 山西长河科技股份有限公司 Classification method, device and terminal for monitoring video time-space information fusion
CN113239869A (en) * 2021-05-31 2021-08-10 西安电子科技大学 Two-stage behavior identification method and system based on key frame sequence and behavior information
CN113239869B (en) * 2021-05-31 2023-08-11 西安电子科技大学 Two-stage behavior recognition method and system based on key frame sequence and behavior information
CN113642499A (en) * 2021-08-23 2021-11-12 中国人民解放军火箭军工程大学 Human behavior recognition method based on computer vision
CN113642499B (en) * 2021-08-23 2024-05-24 中国人民解放军火箭军工程大学 Human body behavior recognition method based on computer vision
CN114550047A (en) * 2022-02-22 2022-05-27 西安交通大学 Behavior rate guided video behavior identification method
CN114550047B (en) * 2022-02-22 2024-04-05 西安交通大学 Behavior rate guided video behavior recognition method
CN115393660A (en) * 2022-10-28 2022-11-25 松立控股集团股份有限公司 Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism
CN115393660B (en) * 2022-10-28 2023-02-24 松立控股集团股份有限公司 Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism

Also Published As

Publication number Publication date
CN111626245B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111626245B (en) Human behavior identification method based on video key frame
CN107341452B (en) Human behavior identification method based on quaternion space-time convolution neural network
CN108805015B (en) Crowd abnormity detection method for weighted convolution self-coding long-short term memory network
CN113221641B (en) Video pedestrian re-identification method based on generation of antagonism network and attention mechanism
Rahmon et al. Motion U-Net: Multi-cue encoder-decoder network for motion segmentation
CN111526434B (en) Converter-based video abstraction method
CN110321805B (en) Dynamic expression recognition method based on time sequence relation reasoning
CN113239869B (en) Two-stage behavior recognition method and system based on key frame sequence and behavior information
CN111738363A (en) Alzheimer disease classification method based on improved 3D CNN network
CN112200096B (en) Method, device and storage medium for realizing real-time abnormal behavior identification based on compressed video
Song et al. A new recurrent plug-and-play prior based on the multiple self-similarity network
CN110930378A (en) Emphysema image processing method and system based on low data demand
Cai et al. Video based emotion recognition using CNN and BRNN
CN115131558B (en) Semantic segmentation method in environment with few samples
CN111723667A (en) Human body joint point coordinate-based intelligent lamp pole crowd behavior identification method and device
Tan et al. DC programming for solving a sparse modeling problem of video key frame extraction
CN111242068A (en) Behavior recognition method and device based on video, electronic equipment and storage medium
CN111898614A (en) Neural network system, image signal and data processing method
Sun et al. Video snapshot compressive imaging using residual ensemble network
CN111144220B (en) Personnel detection method, device, equipment and medium suitable for big data
Liu et al. Combined CNN/RNN video privacy protection evaluation method for monitoring home scene violence
CN112347965A (en) Video relation detection method and system based on space-time diagram
CN111401209A (en) Action recognition method based on deep learning
CN113963421B (en) Dynamic sequence unconstrained expression recognition method based on hybrid feature enhanced network
CN109800719B (en) Low-resolution face recognition method based on sparse representation of partial component and compression dictionary

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant