CN109977773B

CN109977773B - Human behavior identification method and system based on multi-target detection 3D CNN

Info

Publication number: CN109977773B
Application number: CN201910136442.1A
Authority: CN
Inventors: 董敏; 李永发; 毕盛; 聂宏蓄
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-02-18
Filing date: 2019-02-18
Publication date: 2021-01-19
Anticipated expiration: 2039-02-18
Also published as: CN109977773A

Abstract

The invention discloses a human behavior recognition method and a system based on multi-target detection 3D CNN, wherein the method comprises the following steps: 1) preprocessing a video, and converting a video stream into image frames; 2) the current relatively mature SSD detection technology is adopted to calibrate and cut the target object in the video; 3) establishing a feature extraction network structure of image frame data and calibration cutting data; 4) establishing a feature fusion model, and fusing the two features extracted in the step 3); 5) classifying by using a Softmax regression model classifier; 6) and fine-tuning the trained model according to the actual application scene or the public data set. The method makes up for the situation that the information is lost due to the convolution of the current deep neural network model in the time dimension, strengthens the expression of the characteristics in the time dimension, integrally improves the identification efficiency of the model, and enables the model to better understand the behavior of the human body.

Description

Human behavior identification method and system based on multi-target detection 3D CNN

Technical Field

The invention relates to the technical field of human behavior recognition analysis, in particular to a human behavior recognition method and system based on multi-target detection 3D CNN.

Background

Human behavior recognition refers to recognition of human behavior or actions in a real environment, and may be applied in various fields. Common application scenarios at present are: the method comprises the fields of intelligent monitoring, intelligent home, man-machine interaction, human behavior attribute analysis, prejudgment and the like. However, improving the accuracy and efficiency of identification remains a very challenging task and is also receiving a great deal of attention from all researchers.

In the past decades, the extraction and representation of human behavior features mainly stay in the manual stage, and the design and extraction of features by manual work often depend on the experience of designers. Common artificial feature extraction methods include: spatiotemporal points of interest (STIP), visual bag of words (boww), Histogram of Oriented Gradients (HOG), motion history graphics (MHI), Motion Energy Images (MEI), etc. The design of artificial features is usually performed only for a certain part of specific data, so that the generalization capability of the model is poor, the model cannot be rapidly migrated to other applications, and the cost of labor is greatly increased. The conventional method can be said to enter a bottleneck period.

The application of deep learning in human behavior recognition can be said to be a great remedy to the defects existing in the traditional recognition mode. The method is mainly characterized in that: (1) the trouble of manual feature extraction is avoided, and the process of feature extraction is simplified; (2) the deep neural network has a certain feedback regulation function, so that the generalization capability of the model is enhanced to a great extent; (3) the automatic dimensionality reduction can be carried out on the complex features; (4) in the aspect of processing big data, the calculation overhead can be greatly reduced and the overall execution efficiency can be improved; (5) for the identification classification of the label-free data, the performance is better; (6) the modal-based behavior recognition is easy to realize, only a corresponding deep learning model needs to be designed independently to extract features, and then the features of two or more network models are fused, so that the recognition accuracy is greatly improved.

One of the biggest differences between the analysis of human behavior recognition and image classification detection is whether information in the time dimension is included. Therefore, the analysis for human behavior recognition not only needs to extract behavior features from the spatial dimension, but also needs to extract continuous information from the temporal dimension of the behavior. This ensures that a continuous behavioral action is correctly described.

Disclosure of Invention

The invention aims to overcome the defect of time dimension information capture of the current deep neural network model on human behavior recognition, provides a human behavior recognition method and system based on multi-target detection 3D CNN, makes up for a situation that information is lost due to convolution on the time dimension, strengthens expression of characteristics on the time dimension, improves the recognition efficiency of the model as a whole and enables the model to better understand behavior of a human body.

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

the human behavior identification method based on multi-target detection 3D CNN comprises the following steps:

1) preprocessing a video, and converting a video stream into image frames;

2) SSD (full name: a Single Shot MultiBox Detector) detection technology carries out calibration cutting on a target object in a video;

3) establishing a feature extraction network structure of image frame data and calibration cutting data;

4) establishing a feature fusion model, and fusing the two features extracted in the step 3);

5) classifying by using a Softmax regression model classifier;

6) and according to the actual application scene or the public data set, the trained model is finely adjusted, so that the generalization and popularization capability of the model is enhanced.

In step 1), preprocessing a video and converting a video stream into image frames, comprising the following steps:

1.1) acquiring a video data set, wherein a public data set is mainly adopted for training a model, and a test data set is acquired by a camera in a real environment;

1.2) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using the behavior tag of the folder;

1.3) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;

1.4) cutting and dividing the image frame set obtained in the step 1.3) by adopting a cross verification method for training a model;

in the step 2), the SSD detection technology is adopted to perform calibration cutting on the target object in the video, and the method comprises the following steps:

2.1) loading the trained SSD detection model;

2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;

2.3) setting the size of the calibration data clipping, wherein the size of each frame in the image frame set in the step 1.3) is half of the size of each frame, and converting all videos and storing the videos as the calibrated image frame set.

In step 3), a feature extraction network structure of the image frame data and the calibration cutting data is established, which specifically comprises the following steps:

firstly, respectively building a 3D convolution neural network model based on the image frame set in the step 1.3) and a 3D convolution neural network model of the image frame set calibrated in the step 2.3); then, taking continuous 16 frames of data as input of a model, and respectively adopting 5 layers of 3D convolution operation, 5 layers of maximum 3D pooling operation, 1 layer of feature fusion layer and 3 layers of full connection operation; in order to prevent the model from being over-fitted, the 5 convolutional layers are subjected to L2 regular, and dropout (0.5) is added to the fully-connected layer;

in step 4), a feature fusion model is established to perform feature fusion, and the method comprises the following steps:

4.1) respectively obtaining 3D convolution characteristics extracted based on the 3D convolution neural network model of the image frame set in the step 1.3) and the 3D convolution neural network model of the image frame set calibrated in the step 2.3), and carrying out Flatten () operation on the obtained characteristics to be used as the input of a fusion layer;

4.2) completing the fusion of the intermediate features as the input of the full connection layer.

In step 5), classifying by using a Softmax classifier, comprising the steps of:

5.1) after completing the fusion of the characteristics in the step 4), passing through three full-connection layers to be used as the input of a Softmax classifier, and then classifying;

and 5.2) setting a threshold value of the early warning report, and giving an early warning prompt by the system after judging that the recognition rate of a certain behavior action reaches the corresponding threshold value.

In step 6), according to an actual application scenario or a public data set, the trained model is finely adjusted, and generalization and popularization capabilities of the model are enhanced, including the following steps:

6.1) transferring the model to a specific application scene, and freezing the convolution and pooling layer parameters of the model;

6.2) changing the input and output layers of the model;

6.3) loading a data set under a new scene, and retraining parameters of the full connection layer.

Human behavior recognition system based on multi-target detection 3D CNN includes:

the data acquisition module is used for acquiring original video data information of human behavior analysis, wherein the original video data information comprises a public behavior data set and a video data set in an actual scene;

the data preprocessing module is used for preprocessing the original video data, classifying and calibrating, detecting a target, cutting and converting a video frame;

the method for preprocessing the video and converting the video stream into the image frame comprises the following steps:

the method for calibrating and cutting the target object in the video by adopting the SSD detection technology comprises the following steps:

2.1) loading the trained SSD detection model;

2.3) setting the size of the calibration data clipping, which is half of the size of each frame in the image frame set of the step 1.3), converting all videos and storing the videos as a calibrated image frame set;

the feature extraction module is used for sending the preprocessed data into the constructed 3D CNN network model, and respectively extracting the behavior feature information of the video stream and the behavior main body feature information of the calibration cutting, and the feature extraction module specifically comprises the following steps:

the feature fusion module is used for fusing the feature information acquired by the feature extraction module and comprises the following steps:

4.2) completing the fusion of the intermediate features as the input of the full connection layer;

the model training module is used for learning and modeling the preprocessed training set to obtain a trained 3D CNN human body behavior recognition model for multi-target detection;

and the human body behavior recognition module is used for classifying and recognizing the behavior actions of the human body by utilizing a multi-target detection 3D CNN human body behavior recognition model.

Further, the data acquisition module acquires video data in an actual scene through a monocular camera and a binocular camera and downloads a public human behavior data set; the data preprocessing module processes video data by adopting an FFmpeg tool, converts the video data into an image frame set, and calibrates and cuts the video by utilizing an SSD detection algorithm to generate a set of the image frame set obtained in the step 1.3) and the image frame set calibrated in the step 2.3); the feature extraction module adopts a 3D CNN model, takes continuous 16-frame data as the input of the model, and adopts 5-layer 3D convolution operation and 5-layer maximum 3D pooling operation; the feature fusion module adopts a 1-layer 3D feature fusion layer structure, fuses two kinds of behavior feature information, and further extracts and classifies features by a 3-layer full-connection layer; the model training module combines a public human behavior data set of 'UCF-101' and 'HMDB 51' and an actual data set acquired by the model training module to form a training data set; the human behavior recognition module carries out classification recognition by utilizing a Softmax classifier.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the video data is converted into an image frame set, characters in a video stream are calibrated and cut by using an SSD (solid State disk) detection algorithm, behavior feature information in the video can be extracted from the whole situation, local features can be extracted aiming at a behavior main body, the defect of weakening of the whole situation features is overcome, and the model learning capability is enhanced.

2. The 3D CNN model is adopted to extract the characteristics of the two preprocessed data sets, so that the defect that the traditional 2D CNN can only extract video characteristics from space can be overcome, other extraction and fusion of time sequence characteristics of behaviors are not needed, and only the image frame data are input in batches; the model can automatically extract behavior characteristics from two dimensions of time and space, and the difficulty of characteristic extraction in the time dimension is greatly reduced.

3. The behavior characteristics learned by the model can be used for classification and identification and can also be used as the function of an early warning report, the model can pre-judge and report special behaviors according to a set early warning threshold value, and the scenes of the model in practical application are increased.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of the 3D convolution operation structure in the present invention.

FIG. 3 is a structural design diagram of a 3D convolutional neural network model in the present invention.

Fig. 4 is a structural diagram of a 3D CNN model based on multi-target detection.

Detailed Description

The present invention will be further described with reference to the following specific examples.

Referring to fig. 1, the human behavior recognition method based on multi-target detection 3D CNN provided in this embodiment includes the following steps:

1) establishing a human behavior recognition data acquisition system, and acquiring a human behavior video data set, wherein a public data set is mainly used for model training, and a test data set is acquired by a camera in a real environment;

2) the acquired video data sets are respectively converted into image frame sets and SSD (full name: single Shot multi box Detector) detection algorithm to calibrate the cut data set;

3) establishing a 3D CNN learning model, respectively learning the data sets, and fusing the learned characteristics;

4) classifying and identifying the fused features by using a Softmax classifier;

5) classifying, calibrating and identifying or early warning reports for classified results and behaviors;

6) and the model is finely adjusted according to a specific application scene, so that the popularization and generalization capability of the model are enhanced.

In step 2), the video data set acquired in step 1) is preprocessed. Because the model is the fusion recognition for multiple targets, the method is divided into the following two independent processes:

2.1) directly performing frame cropping on a video data set to establish a first image frame set, and the method comprises the following steps:

2.1.1) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using a behavior tag of the folder;

2.1.2) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;

2.1.3) cutting and dividing the image frame set obtained in the step 2.1.2) by adopting a cross-validation method for training a model.

2.2) using SSD (full name: single Shot multi box Detector) algorithm detects a subject of a behavior action, extracts a targeted action feature, and establishes a second image frame set, including the following steps:

2.2.1) load trained SSDs (full name: single Shot multitox Detector) detection model;

2.2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;

2.2.3) setting the size of the calibration data clipping to be half of the size of each frame in the 2.1.3) image frame set, converting all videos and storing the videos as the calibrated image frame set.

Referring to fig. 2, a schematic structural diagram of extracting behavior characteristics for performing convolution operation on the 3D CNN model designed in the present invention is shown. The 3D CNN can extract behavior feature information from two dimensions, namely space and time, and as can be seen from fig. 2, the time dimension for performing convolution operation is N, that is, the convolution operation is performed on consecutive N frames of images. The 3D convolution in the figure is performed by stacking N successive image frames into a cube and then applying a 3D convolution kernel to the cube. In this configuration, each feature map in the convolutional layer is connected to a number of adjacent consecutive frames in the previous layer, thus capturing motion information.

Referring to fig. 3, in step 3), a 3D CNN model is built, and feature learning is performed, including the following steps:

3.1) respectively building a 3D convolution neural network model based on the image frame set of 2.1.3) and a 3D convolution neural network model based on the image frame set of 2.2.3) calibration. With continuous 16 frames of data as input of the model, respectively adopting 5-layer 3D convolution operations (where the number of convolution kernels is 64, 128, 256, 512 in order), 5-layer maximum 3D pooling operation and 1-layer full join (number is 2048) operation, and using the obtained features as input of the model fusion layer, as shown in fig. 4 specifically, it includes the following steps:

3.1.1) respectively obtaining the 3D convolution characteristics extracted by the two models, and carrying out Flatten () operation on the obtained characteristics as the input of a fusion layer;

3.1.2) completing the fusion of the intermediate features as the input of the full connection layer.

3.2) to prevent model training overfitting, add dropout (0.5) to the fully-connected layers using L2 regularization for the 5 convolutional layers.

Referring to fig. 4, in step 4), classification and identification are performed on the features fused in step 3.1) by using a Softmax classifier, and the method comprises the following steps:

4.1) after completing the fusion of the characteristics, passing through three full-connection layers to be used as the input of a Softmax classifier, and then classifying;

4.2) setting a threshold value of the early warning report, and giving an early warning prompt by the system after judging that the recognition rate of a certain behavior action reaches the corresponding threshold value.

In step 6), the model is finely adjusted according to the specific application scene, and the popularization and generalization capability of the model is enhanced, including the following steps:

6.2) changing the input and output layers of the model;

The following is a human behavior recognition system based on multimodal 3D CNN provided in this embodiment, including:

a data acquisition module: the method is used for collecting original video data information of human body behavior analysis, and the original video data information comprises a public behavior data set and a video data set in an actual scene. In this embodiment, a monocular camera and a binocular camera are used to capture video data in an actual scene and download a public human behavior data set as a total data set to be captured.

A data preprocessing module: the method is used for preprocessing, classifying and calibrating, target detecting, cutting and video frame converting the original video data. In this embodiment, a "FFmpeg" tool is used to process video data and convert the video data into an image frame set, and a SSD (Single Shot multi box Detector) detection algorithm is used to calibrate and clip the video, so as to generate two image frame sets, which are specifically as follows:

2.1.2) preprocessing a video data set, and converting all videos into corresponding image frame sets by using an FFmpeg tool;

A feature extraction module: and the method is used for sending the preprocessed data into the constructed 3D CNN network model and respectively extracting the behavior characteristic information of the video stream and the behavior main body characteristic information of the calibration cutting. In the present embodiment, a 3D CNN model is used. And taking continuous 16 frames of data as the input of the model, and extracting two kinds of feature information as the input of the feature fusion module by adopting 5-layer 3D convolution operation and 5-layer maximum 3D pooling operation.

A feature fusion module: and the system is used for fusing the feature information acquired by the feature extraction module. In the embodiment, a 1-layer 3D feature fusion layer structure is adopted, two kinds of behavior feature information are fused, and a 3-layer full-connection layer further extracts and classifies features.

A model training module: and learning and modeling the preprocessed training set to obtain a trained 3D CNN human body behavior recognition model with multiple target detection. In the embodiment, a training data set is formed by combining public human behavior data sets such as 'UCF-101', 'HMDB 51' and the like and actual data sets collected by the user.

Human behavior recognition module: and classifying and identifying the behavior actions of the human body by using a multi-target detection 3D CNN human body behavior identification model. In the present embodiment, classification recognition is performed by a Softmax classifier.

In the above embodiments, the included modules are only divided according to the functional logic of the present invention, but are not limited to the above division, as long as the corresponding functions can be implemented, and are not used to limit the scope of the present invention.

In conclusion, the human behavior identification method and system based on multi-target detection 3D CNN provided by the invention not only make up the deficiency that the 2D neural network extracts features in the time dimension; a multi-target detection method is also adopted, and an SSD (Single Shot Multi Box Detector) target detection algorithm is introduced to calibrate a behavior main body in a video stream for acquiring more detailed local features, and the behavior main body is fused into a model to make up the defect of weakening the global features of the model; meanwhile, the behavior characteristics learned by the model can be used for classification and identification and can also be used as an early warning report, and the model can prejudge and report special behaviors according to a set early warning threshold value. The scenes of the model in practical application are increased. The model can also be migrated and applied to Internet of things platforms such as smart homes, intelligent monitoring and intelligent anti-theft, and has wide research and use values, and the popularization is pointed.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. The human behavior identification method based on multi-target detection 3D CNN is characterized by comprising the following steps:

1) the method for preprocessing the video and converting the video stream into the image frame comprises the following steps:

2) the method for calibrating and cutting the target object in the video by adopting the SSD detection technology comprises the following steps:

2.1) loading the trained SSD detection model;

3) establishing a feature extraction network structure of image frame data and calibration cutting data, which comprises the following steps:

4) establishing a feature fusion model, fusing the two features extracted in the step 3), and comprising the following steps:

5) classifying by using a Softmax regression model classifier;

2. The human behavior recognition method based on multi-target detection 3D CNN of claim 1, wherein in step 5), classification is performed by using a Softmax classifier, and the method comprises the following steps:

3. The human behavior recognition method based on multi-target detection 3D CNN of claim 1, wherein in step 6), according to an actual application scenario or a common data set, a trained model is finely tuned to enhance generalization and popularization capabilities of the model, and the method comprises the following steps:

6.2) changing the input and output layers of the model;

4. Human behavior recognition system based on multi-target detection 3D CNN, its characterized in that includes:

2.1) loading the trained SSD detection model;

5. The human behavior recognition system based on multi-target detection 3D CNN of claim 4, characterized in that: the data acquisition module acquires video data in an actual scene through a monocular camera and a binocular camera and downloads a public human behavior data set; the data preprocessing module processes video data by adopting an FFmpeg tool, converts the video data into an image frame set, and calibrates and cuts the video by utilizing an SSD detection algorithm to generate a set of the image frame set obtained in the step 1.3) and the image frame set calibrated in the step 2.3); the feature extraction module adopts a 3D CNN model, takes continuous 16-frame data as the input of the model, and adopts 5-layer 3D convolution operation and 5-layer maximum 3D pooling operation; the feature fusion module adopts a 1-layer 3D feature fusion layer structure, fuses two kinds of behavior feature information, and further extracts and classifies features by a 3-layer full-connection layer; the model training module combines a public human behavior data set of 'UCF-101' and 'HMDB 51' and an actual data set acquired by the model training module to form a training data set; the human behavior recognition module carries out classification recognition by utilizing a Softmax classifier.