CN112949544A - Action time sequence detection method based on 3D convolutional network - Google Patents

Action time sequence detection method based on 3D convolutional network Download PDF

Info

Publication number
CN112949544A
CN112949544A CN202110285908.1A CN202110285908A CN112949544A CN 112949544 A CN112949544 A CN 112949544A CN 202110285908 A CN202110285908 A CN 202110285908A CN 112949544 A CN112949544 A CN 112949544A
Authority
CN
China
Prior art keywords
action
network
video
time
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110285908.1A
Other languages
Chinese (zh)
Inventor
马世伟
刘燕燕
刘望
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202110285908.1A priority Critical patent/CN112949544A/en
Publication of CN112949544A publication Critical patent/CN112949544A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a 3D convolutional network-based action time sequence detection method, which comprises the steps of extracting key frames with obviously changed actions through K-means clustering, extracting action characteristics by using a 3D convolutional network, fusing a 3D convolutional deconvolution network with a space-time characteristic pyramid structure to realize multi-scale action frame level prediction, and fusing prediction results by using Kalman filtering to achieve the purpose of predicting action time sequences. The method of the invention predicts the frame level of the action which occurs at any position and has any duration, thereby achieving the effect of real-time performance; information difference between action key frames is maximized through K-means clustering, so that the 3D convolutional network can more effectively extract rich action characteristic information, and the classification accuracy is improved; the multi-scale fusion scheme of the 3D convolution deconvolution network and the space-time characteristic pyramid network overcomes the problem of low prediction precision under a single scale, the prediction result has the action integrity and the action detail information, and the detection precision is obviously improved.

Description

Action time sequence detection method based on 3D convolutional network
Technical Field
The invention relates to the technical field of human motion feature extraction and classification prediction in video images, in particular to a motion time sequence detection method based on a 3D convolutional network.
Background
With the rapid development of the acquisition capability of the visual sensor and the image processing capability of the computer, the computer acquires image video information through the visual sensor, and the human body action behaviors in the image can be understood by analyzing the image content through the image processing, the mode recognition, the machine learning and other artificial intelligence technologies. Effective human body action timing sequence detection technology is needed to analyze and understand action behaviors from large-scale video data. The action time sequence detection refers to a video processing method for searching a plurality of action segments in a section of original video and predicting the starting and ending time and the action category of action occurrence. The method is a technology for intelligently detecting and classifying and identifying human body actions in video images by a computer, needs to simultaneously process two-dimensional image information and three-dimensional space-time information in video data, and has important application value in the fields of safety monitoring, intelligent monitoring, medical care, video retrieval, human-computer interaction, intelligent robots and the like.
The action time sequence detection comprises two stages of action characteristic extraction and action time sequence proposal, the existing method not only depends heavily on the understanding and identifying capability of the action, but also has the problems that the time sequence proposal method is difficult to detect the target action time sequence area due to the complex video data structure and the different target action duration time lengths. The problems of effective extraction of action features in large-scale video data and high-precision time sequence detection meeting frame level boundary judgment need to be solved.
Disclosure of Invention
The invention provides an action time sequence detection method based on a 3D convolutional network, which is used for carrying out feature extraction and classification, identification and prediction on human body actions in a video image. The method of the invention is the basis for realizing the technologies of safety monitoring, intelligent monitoring, man-machine interaction, intelligent robot and the like.
In order to achieve the purpose of the invention, the invention adopts the following inventive concept:
aiming at time sequence information of video detection action with any unlimited duration and judging action types, a key frame-based action extraction method is designed, and multi-scale fusion is performed by combining a 3D convolutional network and a space-time characteristic pyramid structure to generate prediction of the whole action and details thereof.
Firstly, extracting key frames with obviously changed actions through K-means clustering, extracting action characteristics by using a 3D convolution network, and fusing the 3D convolution deconvolution network with a space-time characteristic pyramid structure to realize multi-scale action frame level prediction; and then, fusing the prediction results by using Kalman filtering so as to achieve the purpose of predicting the action time sequence.
According to the inventive concept, the invention adopts the following technical scheme:
a motion time sequence detection method based on a 3D convolutional network is characterized in that: extracting key frames with obviously changed actions through K-means clustering, and extracting action characteristics by utilizing a 3D (three-dimensional) convolutional network;
then fusing the 3D convolution deconvolution network with the space-time characteristic pyramid structure to perform multi-scale action frame level prediction;
and finally, fusing the prediction results by Kalman filtering, and predicting the action time sequence to generate a proposal.
Preferably, the motion feature extraction method includes the steps of:
1) dividing the video clip into a training video and a testing video, and taking the training video and the testing video as input in a training stage and a testing stage respectively;
2) clustering similar motion frames in the video by using the K-mean value, and selecting a frame of video frame in each cluster as a key frame;
3) and inputting the obtained action key frame sequence into a 3D convolutional network, and extracting the space-time action characteristics.
Preferably, the action schedule proposal comprises the following steps:
inputting feature data obtained by extracting action features into a 3D convolution deconvolution network, and restoring the features to the original input length through time dimension upsampling to meet the prediction of a frame level;
independently outputting motion predictions of different scales to the intermediate process of the 3D convolution deconvolution network by using the multi-scale characteristics of the space-time pyramid, and realizing overall prediction of the motion;
and thirdly, performing time sequence filtering on the characteristics obtained by each sliding window through Kalman filtering to improve the continuity of the predicted action between adjacent windows and generate a time sequence detection action proposal.
Compared with the prior art, the invention has the following prominent substantive characteristics and remarkable progress:
1. the invention adopts the action time sequence detection method based on the 3D convolution network, so that the action which occurs at any position and has any duration can be predicted at the frame level, and the real-time effect is achieved;
2. according to the method, the information difference between the action key frames is maximized through K-means clustering, so that the 3D convolutional network can more effectively extract rich action characteristic information, and the classification accuracy is improved;
3. the method disclosed by the invention integrates the frame level prediction of the 3D convolution deconvolution network and the multi-scale characteristics of the space-time pyramid network, and integrates the frame level prediction result of the action and the overall prediction result of the action, so that the time sequence position of the action can be accurately detected, and the detection precision is remarkably improved compared with that of single scale prediction.
Drawings
Fig. 1 is a block diagram of the structure of the method for detecting an action timing sequence based on a 3D convolutional network according to the present invention.
FIG. 2 is a schematic diagram of key frame extraction according to the method of the present invention.
FIG. 3 is a schematic diagram of the method of the present invention for extracting the motion characteristics.
FIG. 4 is a diagram illustrating multi-scale frame-level motion prediction in accordance with the present invention.
FIG. 5 is a schematic diagram of the proposed generation of timing action detection according to the method of the present invention.
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
Example one
In this embodiment, referring to fig. 1, a method for detecting an action timing sequence based on a 3D convolutional network extracts a key frame whose action changes significantly through K-means clustering, and extracts action features by using the 3D convolutional network;
then fusing the 3D convolution deconvolution network with the space-time characteristic pyramid structure to perform multi-scale action frame level prediction;
and finally, fusing the prediction results by Kalman filtering, and predicting the action time sequence to generate a proposal.
The method is used for carrying out feature extraction, classification, identification and prediction on human body actions in the video images, and can realize safety monitoring, intelligent monitoring and human-computer interaction functions.
Example two
This embodiment is substantially the same as the first embodiment, and is characterized in that:
in this embodiment, the motion feature extraction method includes the steps of:
1) dividing the video clip into a training video and a testing video, and taking the training video and the testing video as input in a training stage and a testing stage respectively;
2) clustering similar motion frames in the video by using the K-mean value, and selecting a frame of video frame in each cluster as a key frame;
3) and inputting the obtained action key frame sequence into a 3D convolutional network, and extracting the space-time action characteristics.
In this embodiment, the action sequence proposal includes the following steps:
inputting feature data obtained by extracting action features into a 3D convolution deconvolution network, and restoring the features to the original input length through time dimension upsampling to meet the prediction of a frame level;
independently outputting motion predictions of different scales to the intermediate process of the 3D convolution deconvolution network by using the multi-scale characteristics of the space-time pyramid, and realizing overall prediction of the motion;
and thirdly, performing time sequence filtering on the characteristics obtained by each sliding window through Kalman filtering to improve the continuity of the predicted action between adjacent windows and generate a time sequence detection action proposal.
According to the method, key frames with obviously changed actions are extracted through K-means clustering, action features are extracted through a 3D convolution network, then the 3D convolution deconvolution network is fused with a space-time feature pyramid structure to achieve multi-scale action frame level prediction, and prediction results are fused through Kalman filtering to achieve the purpose of predicting action time sequences. The method for detecting the action time sequence of the 3D convolutional network is used for predicting the frame level of the action which occurs at any position and has any duration, and achieves the effect of real-time performance; information difference between action key frames is maximized through K-means clustering, so that the 3D convolutional network can more effectively extract rich action characteristic information, and the classification accuracy is improved; by adopting a multi-scale fusion scheme of a 3D convolution deconvolution network and a space-time characteristic pyramid network, the problem of low prediction precision under a single scale is solved, the prediction result has the action integrity and the action detail information, and the detection precision is obviously improved.
EXAMPLE III
This embodiment is substantially the same as the above embodiment, and is characterized in that:
in this embodiment, as shown in fig. 1, a method for detecting an action timing based on a 3D convolutional network includes the following steps:
step 1: a video clip is generated for the input video sliding window. The length of a real natural video in a time dimension is very long, so for detecting an action time sequence of a video with unlimited length, a sliding window with a fixed length needs to be performed on the video, so as to perform subsequent operation on each sliding window.
Step 2: and extracting video action key frames. After all frame images of the video sequence are extracted according to the video sampling rate, clustering is carried out on the similar motion frames through a K-mean value, and one frame of video frame is selected from each cluster as a key frame to obtain a key frame sequence of the video. The extraction of the video action key frame can remove redundancy of a lengthy video, eliminate similar redundant frames and adjust the length of the video on the premise of ensuring the action integrity.
And step 3: the motion characteristics are extracted by a 3D convolutional network. And inputting the obtained action key frame sequence into a 3D convolutional network for extracting space-time action characteristics. The initial parameters of the network adopt a pre-training model, and the characteristics are extracted after fine tuning. In the training stage, loss obtained by the neural network through the softmax output layer is reversely propagated layer by layer, and network parameters are adjusted layer by layer through a gradient descent method, so that the 3D convolutional network can adaptively learn characteristics of actions of the input video. In the testing stage, the input action key frame sequence passes through the 5 th layer pooling layer of the network to obtain action characteristics for subsequent classification prediction tasks.
And 4, step 4: multi-scale frame-level prediction based on 3D convolution deconvolution and a spatio-temporal feature pyramid. And inputting the characteristic data obtained by the steps into a 3D convolution deconvolution network, sampling on a time dimension while sampling on a space dimension, and restoring the time dimension. On the basis, aiming at the problem that the overall action information is possibly lost based on single-scale network frame prediction, a space-time feature pyramid structure is introduced, action predictions of different scales are independently output in the middle process of the 3D convolution and deconvolution network, and multi-scale features are fused to obtain the final frame-level action prediction.
And 5: a time series operation prediction proposal is generated. The motion between adjacent windows of the frame-level motion prediction result generated in the above steps is divided by the sliding window, so that the completeness of motion proposal generation is influenced. And performing time sequence filtering on the frame level prediction result by using Kalman filtering, and making the optimal estimation of the current frame by combining the state value of the historical sequence and the observation value of the current frame to achieve the optimal action proposal generation result.
As shown in fig. 2, the method for extracting a video motion key frame according to this embodiment includes the following steps:
clustering similar motion frames of the video sequence through a K-mean value, and selecting one frame of video frame in each cluster as a key frame to obtain a key frame sequence of the video.
As shown in fig. 3, the steps of the motion feature extraction method adopted in this embodiment are as follows:
and inputting the obtained action key frame sequence into a 3D convolutional network for extracting space-time action characteristics.
As shown in fig. 4, the steps of the multi-scale frame level motion prediction method adopted in the present embodiment are as follows:
and inputting the motion characteristic data into a 3D convolution deconvolution network and a space-time characteristic pyramid structure to obtain motion predictions of different scales, and fusing multi-scale characteristics to obtain a final frame-level motion prediction.
As shown in fig. 5, the steps of the method for generating a time series motion detection proposal according to this embodiment are as follows:
and performing time sequence filtering on the frame level motion prediction by using Kalman filtering to make the optimal estimation of the current frame so as to achieve the optimal motion proposal generation result.
In the embodiment, the action time sequence detection method based on the 3D convolutional network is adopted, so that actions which occur at any position and have any duration can be predicted at the frame level, and the real-time effect is achieved; according to the method, the information difference between the action key frames is maximized through K-means clustering, so that the 3D convolutional network can more effectively extract rich action characteristic information, and the classification accuracy is improved; the method integrates the frame level prediction of the 3D convolution deconvolution network and the multi-scale characteristics of the space-time pyramid network, integrates the frame level prediction result of the action and the overall prediction result of the action, can accurately detect the time sequence position of the action, and remarkably improves the detection precision compared with single scale prediction.
The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.

Claims (3)

1. A motion time sequence detection method based on a 3D convolutional network is characterized in that: extracting key frames with obviously changed actions through K-means clustering, and extracting action characteristics by utilizing a 3D (three-dimensional) convolutional network;
then fusing the 3D convolution deconvolution network with the space-time characteristic pyramid structure to perform multi-scale action frame level prediction;
and finally, fusing the prediction results by Kalman filtering, and predicting the action time sequence to generate a proposal.
2. The 3D convolutional network-based action timing detection method according to claim 1, wherein: the action feature extraction method comprises the following steps:
1) dividing the video clip into a training video and a testing video, and taking the training video and the testing video as input in a training stage and a testing stage respectively;
2) clustering similar motion frames in the video by using the K-mean value, and selecting a frame of video frame in each cluster as a key frame;
3) and inputting the obtained action key frame sequence into a 3D convolutional network, and extracting the space-time action characteristics.
3. The 3D convolutional network-based action timing detection method according to claim 1, wherein: the action sequence proposal comprises the following steps:
inputting feature data obtained by extracting action features into a 3D convolution deconvolution network, and restoring the features to the original input length through time dimension upsampling to meet the prediction of a frame level;
independently outputting motion predictions of different scales to the intermediate process of the 3D convolution deconvolution network by using the multi-scale characteristics of the space-time pyramid, and realizing overall prediction of the motion;
and thirdly, performing time sequence filtering on the characteristics obtained by each sliding window through Kalman filtering to improve the continuity of the predicted action between adjacent windows and generate a time sequence detection action proposal.
CN202110285908.1A 2021-03-17 2021-03-17 Action time sequence detection method based on 3D convolutional network Pending CN112949544A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110285908.1A CN112949544A (en) 2021-03-17 2021-03-17 Action time sequence detection method based on 3D convolutional network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110285908.1A CN112949544A (en) 2021-03-17 2021-03-17 Action time sequence detection method based on 3D convolutional network

Publications (1)

Publication Number Publication Date
CN112949544A true CN112949544A (en) 2021-06-11

Family

ID=76229361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110285908.1A Pending CN112949544A (en) 2021-03-17 2021-03-17 Action time sequence detection method based on 3D convolutional network

Country Status (1)

Country Link
CN (1) CN112949544A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345061A (en) * 2021-08-04 2021-09-03 成都市谛视科技有限公司 Training method and device for motion completion model, completion method and device, and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109541583A (en) * 2018-11-15 2019-03-29 众安信息技术服务有限公司 A kind of leading vehicle distance detection method and system
CN109947986A (en) * 2019-03-18 2019-06-28 东华大学 Infrared video timing localization method based on structuring sectional convolution neural network
CN110688927A (en) * 2019-09-20 2020-01-14 湖南大学 Video action detection method based on time sequence convolution modeling
CN111291647A (en) * 2020-01-21 2020-06-16 陕西师范大学 Single-stage action positioning method based on multi-scale convolution kernel and superevent module
CN111898514A (en) * 2020-07-24 2020-11-06 燕山大学 Multi-target visual supervision method based on target detection and action recognition
CN112101243A (en) * 2020-09-17 2020-12-18 四川轻化工大学 Human body action recognition method based on key posture and DTW

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109541583A (en) * 2018-11-15 2019-03-29 众安信息技术服务有限公司 A kind of leading vehicle distance detection method and system
CN109947986A (en) * 2019-03-18 2019-06-28 东华大学 Infrared video timing localization method based on structuring sectional convolution neural network
CN110688927A (en) * 2019-09-20 2020-01-14 湖南大学 Video action detection method based on time sequence convolution modeling
CN111291647A (en) * 2020-01-21 2020-06-16 陕西师范大学 Single-stage action positioning method based on multi-scale convolution kernel and superevent module
CN111898514A (en) * 2020-07-24 2020-11-06 燕山大学 Multi-target visual supervision method based on target detection and action recognition
CN112101243A (en) * 2020-09-17 2020-12-18 四川轻化工大学 Human body action recognition method based on key posture and DTW

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIU WANG等: ""Video action recognition based on improved 3D convolutional network and sparse representation classification"", 《PROCEEDINGS OF SPIE》 *
刘望等: ""基于时空特征金字塔网络的动作时序检测方法"", 《系统仿真学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345061A (en) * 2021-08-04 2021-09-03 成都市谛视科技有限公司 Training method and device for motion completion model, completion method and device, and medium

Similar Documents

Publication Publication Date Title
CN109446923B (en) Deep supervision convolutional neural network behavior recognition method based on training feature fusion
CN108491077B (en) Surface electromyographic signal gesture recognition method based on multi-stream divide-and-conquer convolutional neural network
CN111079646A (en) Method and system for positioning weak surveillance video time sequence action based on deep learning
CN108764059B (en) Human behavior recognition method and system based on neural network
CN105069434B (en) A kind of human action Activity recognition method in video
CN108399435B (en) Video classification method based on dynamic and static characteristics
CN109858407B (en) Video behavior recognition method based on multiple information flow characteristics and asynchronous fusion
CN110569843B (en) Intelligent detection and identification method for mine target
CN111738054B (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
Su et al. HDL: Hierarchical deep learning model based human activity recognition using smartphone sensors
CN113673510B (en) Target detection method combining feature point and anchor frame joint prediction and regression
CN110991278A (en) Human body action recognition method and device in video of computer vision system
Lorenzo et al. Intformer: Predicting pedestrian intention with the aid of the transformer architecture
CN109614896A (en) A method of the video content semantic understanding based on recursive convolution neural network
CN109993770A (en) A kind of method for tracking target of adaptive space-time study and state recognition
CN113065515A (en) Abnormal behavior intelligent detection method and system based on similarity graph neural network
CN113705445A (en) Human body posture recognition method and device based on event camera
Patil et al. An approach of understanding human activity recognition and detection for video surveillance using HOG descriptor and SVM classifier
CN113343760A (en) Human behavior recognition method based on multi-scale characteristic neural network
CN116956222A (en) Multi-complexity behavior recognition system and method based on self-adaptive feature extraction
CN109002808B (en) Human behavior recognition method and system
CN112949544A (en) Action time sequence detection method based on 3D convolutional network
Singhal et al. Deep Learning Based Real Time Face Recognition For University Attendance System
Sun et al. Weak supervised learning based abnormal behavior detection
CN110659630A (en) Video human body abnormal behavior detection method based on skeleton point track dynamic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210611

RJ01 Rejection of invention patent application after publication