CN112949544A

CN112949544A - Action time sequence detection method based on 3D convolutional network

Info

Publication number: CN112949544A
Application number: CN202110285908.1A
Authority: CN
Inventors: 马世伟; 刘燕燕; 刘望
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2021-06-11

Abstract

The invention relates to a 3D convolutional network-based action time sequence detection method, which comprises the steps of extracting key frames with obviously changed actions through K-means clustering, extracting action characteristics by using a 3D convolutional network, fusing a 3D convolutional deconvolution network with a space-time characteristic pyramid structure to realize multi-scale action frame level prediction, and fusing prediction results by using Kalman filtering to achieve the purpose of predicting action time sequences. The method of the invention predicts the frame level of the action which occurs at any position and has any duration, thereby achieving the effect of real-time performance; information difference between action key frames is maximized through K-means clustering, so that the 3D convolutional network can more effectively extract rich action characteristic information, and the classification accuracy is improved; the multi-scale fusion scheme of the 3D convolution deconvolution network and the space-time characteristic pyramid network overcomes the problem of low prediction precision under a single scale, the prediction result has the action integrity and the action detail information, and the detection precision is obviously improved.

Description

Action time sequence detection method based on 3D convolutional network

Technical Field

The invention relates to the technical field of human motion feature extraction and classification prediction in video images, in particular to a motion time sequence detection method based on a 3D convolutional network.

Background

With the rapid development of the acquisition capability of the visual sensor and the image processing capability of the computer, the computer acquires image video information through the visual sensor, and the human body action behaviors in the image can be understood by analyzing the image content through the image processing, the mode recognition, the machine learning and other artificial intelligence technologies. Effective human body action timing sequence detection technology is needed to analyze and understand action behaviors from large-scale video data. The action time sequence detection refers to a video processing method for searching a plurality of action segments in a section of original video and predicting the starting and ending time and the action category of action occurrence. The method is a technology for intelligently detecting and classifying and identifying human body actions in video images by a computer, needs to simultaneously process two-dimensional image information and three-dimensional space-time information in video data, and has important application value in the fields of safety monitoring, intelligent monitoring, medical care, video retrieval, human-computer interaction, intelligent robots and the like.

The action time sequence detection comprises two stages of action characteristic extraction and action time sequence proposal, the existing method not only depends heavily on the understanding and identifying capability of the action, but also has the problems that the time sequence proposal method is difficult to detect the target action time sequence area due to the complex video data structure and the different target action duration time lengths. The problems of effective extraction of action features in large-scale video data and high-precision time sequence detection meeting frame level boundary judgment need to be solved.

Disclosure of Invention

The invention provides an action time sequence detection method based on a 3D convolutional network, which is used for carrying out feature extraction and classification, identification and prediction on human body actions in a video image. The method of the invention is the basis for realizing the technologies of safety monitoring, intelligent monitoring, man-machine interaction, intelligent robot and the like.

In order to achieve the purpose of the invention, the invention adopts the following inventive concept:

aiming at time sequence information of video detection action with any unlimited duration and judging action types, a key frame-based action extraction method is designed, and multi-scale fusion is performed by combining a 3D convolutional network and a space-time characteristic pyramid structure to generate prediction of the whole action and details thereof.

Firstly, extracting key frames with obviously changed actions through K-means clustering, extracting action characteristics by using a 3D convolution network, and fusing the 3D convolution deconvolution network with a space-time characteristic pyramid structure to realize multi-scale action frame level prediction; and then, fusing the prediction results by using Kalman filtering so as to achieve the purpose of predicting the action time sequence.

According to the inventive concept, the invention adopts the following technical scheme:

a motion time sequence detection method based on a 3D convolutional network is characterized in that: extracting key frames with obviously changed actions through K-means clustering, and extracting action characteristics by utilizing a 3D (three-dimensional) convolutional network;

then fusing the 3D convolution deconvolution network with the space-time characteristic pyramid structure to perform multi-scale action frame level prediction;

and finally, fusing the prediction results by Kalman filtering, and predicting the action time sequence to generate a proposal.

Preferably, the motion feature extraction method includes the steps of:

1) dividing the video clip into a training video and a testing video, and taking the training video and the testing video as input in a training stage and a testing stage respectively;

2) clustering similar motion frames in the video by using the K-mean value, and selecting a frame of video frame in each cluster as a key frame;

3) and inputting the obtained action key frame sequence into a 3D convolutional network, and extracting the space-time action characteristics.

Preferably, the action schedule proposal comprises the following steps:

inputting feature data obtained by extracting action features into a 3D convolution deconvolution network, and restoring the features to the original input length through time dimension upsampling to meet the prediction of a frame level;

independently outputting motion predictions of different scales to the intermediate process of the 3D convolution deconvolution network by using the multi-scale characteristics of the space-time pyramid, and realizing overall prediction of the motion;

and thirdly, performing time sequence filtering on the characteristics obtained by each sliding window through Kalman filtering to improve the continuity of the predicted action between adjacent windows and generate a time sequence detection action proposal.

Compared with the prior art, the invention has the following prominent substantive characteristics and remarkable progress:

1. the invention adopts the action time sequence detection method based on the 3D convolution network, so that the action which occurs at any position and has any duration can be predicted at the frame level, and the real-time effect is achieved;

2. according to the method, the information difference between the action key frames is maximized through K-means clustering, so that the 3D convolutional network can more effectively extract rich action characteristic information, and the classification accuracy is improved;

3. the method disclosed by the invention integrates the frame level prediction of the 3D convolution deconvolution network and the multi-scale characteristics of the space-time pyramid network, and integrates the frame level prediction result of the action and the overall prediction result of the action, so that the time sequence position of the action can be accurately detected, and the detection precision is remarkably improved compared with that of single scale prediction.

Drawings

Fig. 1 is a block diagram of the structure of the method for detecting an action timing sequence based on a 3D convolutional network according to the present invention.

FIG. 2 is a schematic diagram of key frame extraction according to the method of the present invention.

FIG. 3 is a schematic diagram of the method of the present invention for extracting the motion characteristics.

FIG. 4 is a diagram illustrating multi-scale frame-level motion prediction in accordance with the present invention.

FIG. 5 is a schematic diagram of the proposed generation of timing action detection according to the method of the present invention.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

Example one

In this embodiment, referring to fig. 1, a method for detecting an action timing sequence based on a 3D convolutional network extracts a key frame whose action changes significantly through K-means clustering, and extracts action features by using the 3D convolutional network;

The method is used for carrying out feature extraction, classification, identification and prediction on human body actions in the video images, and can realize safety monitoring, intelligent monitoring and human-computer interaction functions.

Example two

This embodiment is substantially the same as the first embodiment, and is characterized in that:

in this embodiment, the motion feature extraction method includes the steps of:

In this embodiment, the action sequence proposal includes the following steps:

According to the method, key frames with obviously changed actions are extracted through K-means clustering, action features are extracted through a 3D convolution network, then the 3D convolution deconvolution network is fused with a space-time feature pyramid structure to achieve multi-scale action frame level prediction, and prediction results are fused through Kalman filtering to achieve the purpose of predicting action time sequences. The method for detecting the action time sequence of the 3D convolutional network is used for predicting the frame level of the action which occurs at any position and has any duration, and achieves the effect of real-time performance; information difference between action key frames is maximized through K-means clustering, so that the 3D convolutional network can more effectively extract rich action characteristic information, and the classification accuracy is improved; by adopting a multi-scale fusion scheme of a 3D convolution deconvolution network and a space-time characteristic pyramid network, the problem of low prediction precision under a single scale is solved, the prediction result has the action integrity and the action detail information, and the detection precision is obviously improved.

EXAMPLE III

This embodiment is substantially the same as the above embodiment, and is characterized in that:

in this embodiment, as shown in fig. 1, a method for detecting an action timing based on a 3D convolutional network includes the following steps:

step 1: a video clip is generated for the input video sliding window. The length of a real natural video in a time dimension is very long, so for detecting an action time sequence of a video with unlimited length, a sliding window with a fixed length needs to be performed on the video, so as to perform subsequent operation on each sliding window.

Step 2: and extracting video action key frames. After all frame images of the video sequence are extracted according to the video sampling rate, clustering is carried out on the similar motion frames through a K-mean value, and one frame of video frame is selected from each cluster as a key frame to obtain a key frame sequence of the video. The extraction of the video action key frame can remove redundancy of a lengthy video, eliminate similar redundant frames and adjust the length of the video on the premise of ensuring the action integrity.

And step 3: the motion characteristics are extracted by a 3D convolutional network. And inputting the obtained action key frame sequence into a 3D convolutional network for extracting space-time action characteristics. The initial parameters of the network adopt a pre-training model, and the characteristics are extracted after fine tuning. In the training stage, loss obtained by the neural network through the softmax output layer is reversely propagated layer by layer, and network parameters are adjusted layer by layer through a gradient descent method, so that the 3D convolutional network can adaptively learn characteristics of actions of the input video. In the testing stage, the input action key frame sequence passes through the 5 th layer pooling layer of the network to obtain action characteristics for subsequent classification prediction tasks.

And 4, step 4: multi-scale frame-level prediction based on 3D convolution deconvolution and a spatio-temporal feature pyramid. And inputting the characteristic data obtained by the steps into a 3D convolution deconvolution network, sampling on a time dimension while sampling on a space dimension, and restoring the time dimension. On the basis, aiming at the problem that the overall action information is possibly lost based on single-scale network frame prediction, a space-time feature pyramid structure is introduced, action predictions of different scales are independently output in the middle process of the 3D convolution and deconvolution network, and multi-scale features are fused to obtain the final frame-level action prediction.

And 5: a time series operation prediction proposal is generated. The motion between adjacent windows of the frame-level motion prediction result generated in the above steps is divided by the sliding window, so that the completeness of motion proposal generation is influenced. And performing time sequence filtering on the frame level prediction result by using Kalman filtering, and making the optimal estimation of the current frame by combining the state value of the historical sequence and the observation value of the current frame to achieve the optimal action proposal generation result.

As shown in fig. 2, the method for extracting a video motion key frame according to this embodiment includes the following steps:

clustering similar motion frames of the video sequence through a K-mean value, and selecting one frame of video frame in each cluster as a key frame to obtain a key frame sequence of the video.

As shown in fig. 3, the steps of the motion feature extraction method adopted in this embodiment are as follows:

and inputting the obtained action key frame sequence into a 3D convolutional network for extracting space-time action characteristics.

As shown in fig. 4, the steps of the multi-scale frame level motion prediction method adopted in the present embodiment are as follows:

and inputting the motion characteristic data into a 3D convolution deconvolution network and a space-time characteristic pyramid structure to obtain motion predictions of different scales, and fusing multi-scale characteristics to obtain a final frame-level motion prediction.

As shown in fig. 5, the steps of the method for generating a time series motion detection proposal according to this embodiment are as follows:

and performing time sequence filtering on the frame level motion prediction by using Kalman filtering to make the optimal estimation of the current frame so as to achieve the optimal motion proposal generation result.

In the embodiment, the action time sequence detection method based on the 3D convolutional network is adopted, so that actions which occur at any position and have any duration can be predicted at the frame level, and the real-time effect is achieved; according to the method, the information difference between the action key frames is maximized through K-means clustering, so that the 3D convolutional network can more effectively extract rich action characteristic information, and the classification accuracy is improved; the method integrates the frame level prediction of the 3D convolution deconvolution network and the multi-scale characteristics of the space-time pyramid network, integrates the frame level prediction result of the action and the overall prediction result of the action, can accurately detect the time sequence position of the action, and remarkably improves the detection precision compared with single scale prediction.

The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.

Claims

1. A motion time sequence detection method based on a 3D convolutional network is characterized in that: extracting key frames with obviously changed actions through K-means clustering, and extracting action characteristics by utilizing a 3D (three-dimensional) convolutional network;

2. The 3D convolutional network-based action timing detection method according to claim 1, wherein: the action feature extraction method comprises the following steps:

3. The 3D convolutional network-based action timing detection method according to claim 1, wherein: the action sequence proposal comprises the following steps: