CN109977773B - Human behavior identification method and system based on multi-target detection 3D CNN - Google Patents

Human behavior identification method and system based on multi-target detection 3D CNN Download PDF

Info

Publication number
CN109977773B
CN109977773B CN201910136442.1A CN201910136442A CN109977773B CN 109977773 B CN109977773 B CN 109977773B CN 201910136442 A CN201910136442 A CN 201910136442A CN 109977773 B CN109977773 B CN 109977773B
Authority
CN
China
Prior art keywords
model
image frame
video
data
behavior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910136442.1A
Other languages
Chinese (zh)
Other versions
CN109977773A (en
Inventor
董敏
李永发
毕盛
聂宏蓄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201910136442.1A priority Critical patent/CN109977773B/en
Publication of CN109977773A publication Critical patent/CN109977773A/en
Application granted granted Critical
Publication of CN109977773B publication Critical patent/CN109977773B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a human behavior recognition method and a system based on multi-target detection 3D CNN, wherein the method comprises the following steps: 1) preprocessing a video, and converting a video stream into image frames; 2) the current relatively mature SSD detection technology is adopted to calibrate and cut the target object in the video; 3) establishing a feature extraction network structure of image frame data and calibration cutting data; 4) establishing a feature fusion model, and fusing the two features extracted in the step 3); 5) classifying by using a Softmax regression model classifier; 6) and fine-tuning the trained model according to the actual application scene or the public data set. The method makes up for the situation that the information is lost due to the convolution of the current deep neural network model in the time dimension, strengthens the expression of the characteristics in the time dimension, integrally improves the identification efficiency of the model, and enables the model to better understand the behavior of the human body.

Description

Human behavior identification method and system based on multi-target detection 3D CNN
Technical Field
The invention relates to the technical field of human behavior recognition analysis, in particular to a human behavior recognition method and system based on multi-target detection 3D CNN.
Background
Human behavior recognition refers to recognition of human behavior or actions in a real environment, and may be applied in various fields. Common application scenarios at present are: the method comprises the fields of intelligent monitoring, intelligent home, man-machine interaction, human behavior attribute analysis, prejudgment and the like. However, improving the accuracy and efficiency of identification remains a very challenging task and is also receiving a great deal of attention from all researchers.
In the past decades, the extraction and representation of human behavior features mainly stay in the manual stage, and the design and extraction of features by manual work often depend on the experience of designers. Common artificial feature extraction methods include: spatiotemporal points of interest (STIP), visual bag of words (boww), Histogram of Oriented Gradients (HOG), motion history graphics (MHI), Motion Energy Images (MEI), etc. The design of artificial features is usually performed only for a certain part of specific data, so that the generalization capability of the model is poor, the model cannot be rapidly migrated to other applications, and the cost of labor is greatly increased. The conventional method can be said to enter a bottleneck period.
The application of deep learning in human behavior recognition can be said to be a great remedy to the defects existing in the traditional recognition mode. The method is mainly characterized in that: (1) the trouble of manual feature extraction is avoided, and the process of feature extraction is simplified; (2) the deep neural network has a certain feedback regulation function, so that the generalization capability of the model is enhanced to a great extent; (3) the automatic dimensionality reduction can be carried out on the complex features; (4) in the aspect of processing big data, the calculation overhead can be greatly reduced and the overall execution efficiency can be improved; (5) for the identification classification of the label-free data, the performance is better; (6) the modal-based behavior recognition is easy to realize, only a corresponding deep learning model needs to be designed independently to extract features, and then the features of two or more network models are fused, so that the recognition accuracy is greatly improved.
One of the biggest differences between the analysis of human behavior recognition and image classification detection is whether information in the time dimension is included. Therefore, the analysis for human behavior recognition not only needs to extract behavior features from the spatial dimension, but also needs to extract continuous information from the temporal dimension of the behavior. This ensures that a continuous behavioral action is correctly described.
Disclosure of Invention
The invention aims to overcome the defect of time dimension information capture of the current deep neural network model on human behavior recognition, provides a human behavior recognition method and system based on multi-target detection 3D CNN, makes up for a situation that information is lost due to convolution on the time dimension, strengthens expression of characteristics on the time dimension, improves the recognition efficiency of the model as a whole and enables the model to better understand behavior of a human body.
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
the human behavior identification method based on multi-target detection 3D CNN comprises the following steps:
1) preprocessing a video, and converting a video stream into image frames;
2) SSD (full name: a Single Shot MultiBox Detector) detection technology carries out calibration cutting on a target object in a video;
3) establishing a feature extraction network structure of image frame data and calibration cutting data;
4) establishing a feature fusion model, and fusing the two features extracted in the step 3);
5) classifying by using a Softmax regression model classifier;
6) and according to the actual application scene or the public data set, the trained model is finely adjusted, so that the generalization and popularization capability of the model is enhanced.
In step 1), preprocessing a video and converting a video stream into image frames, comprising the following steps:
1.1) acquiring a video data set, wherein a public data set is mainly adopted for training a model, and a test data set is acquired by a camera in a real environment;
1.2) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using the behavior tag of the folder;
1.3) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;
1.4) cutting and dividing the image frame set obtained in the step 1.3) by adopting a cross verification method for training a model;
in the step 2), the SSD detection technology is adopted to perform calibration cutting on the target object in the video, and the method comprises the following steps:
2.1) loading the trained SSD detection model;
2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.3) setting the size of the calibration data clipping, wherein the size of each frame in the image frame set in the step 1.3) is half of the size of each frame, and converting all videos and storing the videos as the calibrated image frame set.
In step 3), a feature extraction network structure of the image frame data and the calibration cutting data is established, which specifically comprises the following steps:
firstly, respectively building a 3D convolution neural network model based on the image frame set in the step 1.3) and a 3D convolution neural network model of the image frame set calibrated in the step 2.3); then, taking continuous 16 frames of data as input of a model, and respectively adopting 5 layers of 3D convolution operation, 5 layers of maximum 3D pooling operation, 1 layer of feature fusion layer and 3 layers of full connection operation; in order to prevent the model from being over-fitted, the 5 convolutional layers are subjected to L2 regular, and dropout (0.5) is added to the fully-connected layer;
in step 4), a feature fusion model is established to perform feature fusion, and the method comprises the following steps:
4.1) respectively obtaining 3D convolution characteristics extracted based on the 3D convolution neural network model of the image frame set in the step 1.3) and the 3D convolution neural network model of the image frame set calibrated in the step 2.3), and carrying out Flatten () operation on the obtained characteristics to be used as the input of a fusion layer;
4.2) completing the fusion of the intermediate features as the input of the full connection layer.
In step 5), classifying by using a Softmax classifier, comprising the steps of:
5.1) after completing the fusion of the characteristics in the step 4), passing through three full-connection layers to be used as the input of a Softmax classifier, and then classifying;
and 5.2) setting a threshold value of the early warning report, and giving an early warning prompt by the system after judging that the recognition rate of a certain behavior action reaches the corresponding threshold value.
In step 6), according to an actual application scenario or a public data set, the trained model is finely adjusted, and generalization and popularization capabilities of the model are enhanced, including the following steps:
6.1) transferring the model to a specific application scene, and freezing the convolution and pooling layer parameters of the model;
6.2) changing the input and output layers of the model;
6.3) loading a data set under a new scene, and retraining parameters of the full connection layer.
Human behavior recognition system based on multi-target detection 3D CNN includes:
the data acquisition module is used for acquiring original video data information of human behavior analysis, wherein the original video data information comprises a public behavior data set and a video data set in an actual scene;
the data preprocessing module is used for preprocessing the original video data, classifying and calibrating, detecting a target, cutting and converting a video frame;
the method for preprocessing the video and converting the video stream into the image frame comprises the following steps:
1.1) acquiring a video data set, wherein a public data set is mainly adopted for training a model, and a test data set is acquired by a camera in a real environment;
1.2) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using the behavior tag of the folder;
1.3) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;
1.4) cutting and dividing the image frame set obtained in the step 1.3) by adopting a cross verification method for training a model;
the method for calibrating and cutting the target object in the video by adopting the SSD detection technology comprises the following steps:
2.1) loading the trained SSD detection model;
2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.3) setting the size of the calibration data clipping, which is half of the size of each frame in the image frame set of the step 1.3), converting all videos and storing the videos as a calibrated image frame set;
the feature extraction module is used for sending the preprocessed data into the constructed 3D CNN network model, and respectively extracting the behavior feature information of the video stream and the behavior main body feature information of the calibration cutting, and the feature extraction module specifically comprises the following steps:
firstly, respectively building a 3D convolution neural network model based on the image frame set in the step 1.3) and a 3D convolution neural network model of the image frame set calibrated in the step 2.3); then, taking continuous 16 frames of data as input of a model, and respectively adopting 5 layers of 3D convolution operation, 5 layers of maximum 3D pooling operation, 1 layer of feature fusion layer and 3 layers of full connection operation; in order to prevent the model from being over-fitted, the 5 convolutional layers are subjected to L2 regular, and dropout (0.5) is added to the fully-connected layer;
the feature fusion module is used for fusing the feature information acquired by the feature extraction module and comprises the following steps:
4.1) respectively obtaining 3D convolution characteristics extracted based on the 3D convolution neural network model of the image frame set in the step 1.3) and the 3D convolution neural network model of the image frame set calibrated in the step 2.3), and carrying out Flatten () operation on the obtained characteristics to be used as the input of a fusion layer;
4.2) completing the fusion of the intermediate features as the input of the full connection layer;
the model training module is used for learning and modeling the preprocessed training set to obtain a trained 3D CNN human body behavior recognition model for multi-target detection;
and the human body behavior recognition module is used for classifying and recognizing the behavior actions of the human body by utilizing a multi-target detection 3D CNN human body behavior recognition model.
Further, the data acquisition module acquires video data in an actual scene through a monocular camera and a binocular camera and downloads a public human behavior data set; the data preprocessing module processes video data by adopting an FFmpeg tool, converts the video data into an image frame set, and calibrates and cuts the video by utilizing an SSD detection algorithm to generate a set of the image frame set obtained in the step 1.3) and the image frame set calibrated in the step 2.3); the feature extraction module adopts a 3D CNN model, takes continuous 16-frame data as the input of the model, and adopts 5-layer 3D convolution operation and 5-layer maximum 3D pooling operation; the feature fusion module adopts a 1-layer 3D feature fusion layer structure, fuses two kinds of behavior feature information, and further extracts and classifies features by a 3-layer full-connection layer; the model training module combines a public human behavior data set of 'UCF-101' and 'HMDB 51' and an actual data set acquired by the model training module to form a training data set; the human behavior recognition module carries out classification recognition by utilizing a Softmax classifier.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the video data is converted into an image frame set, characters in a video stream are calibrated and cut by using an SSD (solid State disk) detection algorithm, behavior feature information in the video can be extracted from the whole situation, local features can be extracted aiming at a behavior main body, the defect of weakening of the whole situation features is overcome, and the model learning capability is enhanced.
2. The 3D CNN model is adopted to extract the characteristics of the two preprocessed data sets, so that the defect that the traditional 2D CNN can only extract video characteristics from space can be overcome, other extraction and fusion of time sequence characteristics of behaviors are not needed, and only the image frame data are input in batches; the model can automatically extract behavior characteristics from two dimensions of time and space, and the difficulty of characteristic extraction in the time dimension is greatly reduced.
3. The behavior characteristics learned by the model can be used for classification and identification and can also be used as the function of an early warning report, the model can pre-judge and report special behaviors according to a set early warning threshold value, and the scenes of the model in practical application are increased.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of the 3D convolution operation structure in the present invention.
FIG. 3 is a structural design diagram of a 3D convolutional neural network model in the present invention.
Fig. 4 is a structural diagram of a 3D CNN model based on multi-target detection.
Detailed Description
The present invention will be further described with reference to the following specific examples.
Referring to fig. 1, the human behavior recognition method based on multi-target detection 3D CNN provided in this embodiment includes the following steps:
1) establishing a human behavior recognition data acquisition system, and acquiring a human behavior video data set, wherein a public data set is mainly used for model training, and a test data set is acquired by a camera in a real environment;
2) the acquired video data sets are respectively converted into image frame sets and SSD (full name: single Shot multi box Detector) detection algorithm to calibrate the cut data set;
3) establishing a 3D CNN learning model, respectively learning the data sets, and fusing the learned characteristics;
4) classifying and identifying the fused features by using a Softmax classifier;
5) classifying, calibrating and identifying or early warning reports for classified results and behaviors;
6) and the model is finely adjusted according to a specific application scene, so that the popularization and generalization capability of the model are enhanced.
In step 2), the video data set acquired in step 1) is preprocessed. Because the model is the fusion recognition for multiple targets, the method is divided into the following two independent processes:
2.1) directly performing frame cropping on a video data set to establish a first image frame set, and the method comprises the following steps:
2.1.1) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using a behavior tag of the folder;
2.1.2) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;
2.1.3) cutting and dividing the image frame set obtained in the step 2.1.2) by adopting a cross-validation method for training a model.
2.2) using SSD (full name: single Shot multi box Detector) algorithm detects a subject of a behavior action, extracts a targeted action feature, and establishes a second image frame set, including the following steps:
2.2.1) load trained SSDs (full name: single Shot multitox Detector) detection model;
2.2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.2.3) setting the size of the calibration data clipping to be half of the size of each frame in the 2.1.3) image frame set, converting all videos and storing the videos as the calibrated image frame set.
Referring to fig. 2, a schematic structural diagram of extracting behavior characteristics for performing convolution operation on the 3D CNN model designed in the present invention is shown. The 3D CNN can extract behavior feature information from two dimensions, namely space and time, and as can be seen from fig. 2, the time dimension for performing convolution operation is N, that is, the convolution operation is performed on consecutive N frames of images. The 3D convolution in the figure is performed by stacking N successive image frames into a cube and then applying a 3D convolution kernel to the cube. In this configuration, each feature map in the convolutional layer is connected to a number of adjacent consecutive frames in the previous layer, thus capturing motion information.
Referring to fig. 3, in step 3), a 3D CNN model is built, and feature learning is performed, including the following steps:
3.1) respectively building a 3D convolution neural network model based on the image frame set of 2.1.3) and a 3D convolution neural network model based on the image frame set of 2.2.3) calibration. With continuous 16 frames of data as input of the model, respectively adopting 5-layer 3D convolution operations (where the number of convolution kernels is 64, 128, 256, 512 in order), 5-layer maximum 3D pooling operation and 1-layer full join (number is 2048) operation, and using the obtained features as input of the model fusion layer, as shown in fig. 4 specifically, it includes the following steps:
3.1.1) respectively obtaining the 3D convolution characteristics extracted by the two models, and carrying out Flatten () operation on the obtained characteristics as the input of a fusion layer;
3.1.2) completing the fusion of the intermediate features as the input of the full connection layer.
3.2) to prevent model training overfitting, add dropout (0.5) to the fully-connected layers using L2 regularization for the 5 convolutional layers.
Referring to fig. 4, in step 4), classification and identification are performed on the features fused in step 3.1) by using a Softmax classifier, and the method comprises the following steps:
4.1) after completing the fusion of the characteristics, passing through three full-connection layers to be used as the input of a Softmax classifier, and then classifying;
4.2) setting a threshold value of the early warning report, and giving an early warning prompt by the system after judging that the recognition rate of a certain behavior action reaches the corresponding threshold value.
In step 6), the model is finely adjusted according to the specific application scene, and the popularization and generalization capability of the model is enhanced, including the following steps:
6.1) transferring the model to a specific application scene, and freezing the convolution and pooling layer parameters of the model;
6.2) changing the input and output layers of the model;
6.3) loading a data set under a new scene, and retraining parameters of the full connection layer.
The following is a human behavior recognition system based on multimodal 3D CNN provided in this embodiment, including:
a data acquisition module: the method is used for collecting original video data information of human body behavior analysis, and the original video data information comprises a public behavior data set and a video data set in an actual scene. In this embodiment, a monocular camera and a binocular camera are used to capture video data in an actual scene and download a public human behavior data set as a total data set to be captured.
A data preprocessing module: the method is used for preprocessing, classifying and calibrating, target detecting, cutting and video frame converting the original video data. In this embodiment, a "FFmpeg" tool is used to process video data and convert the video data into an image frame set, and a SSD (Single Shot multi box Detector) detection algorithm is used to calibrate and clip the video, so as to generate two image frame sets, which are specifically as follows:
2.1) directly performing frame cropping on a video data set to establish a first image frame set, and the method comprises the following steps:
2.1.1) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using a behavior tag of the folder;
2.1.2) preprocessing a video data set, and converting all videos into corresponding image frame sets by using an FFmpeg tool;
2.1.3) cutting and dividing the image frame set obtained in the step 2.1.2) by adopting a cross-validation method for training a model.
2.2) using SSD (full name: single Shot multi box Detector) algorithm detects a subject of a behavior action, extracts a targeted action feature, and establishes a second image frame set, including the following steps:
2.2.1) load trained SSDs (full name: single Shot multitox Detector) detection model;
2.2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.2.3) setting the size of the calibration data clipping to be half of the size of each frame in the 2.1.3) image frame set, converting all videos and storing the videos as the calibrated image frame set.
A feature extraction module: and the method is used for sending the preprocessed data into the constructed 3D CNN network model and respectively extracting the behavior characteristic information of the video stream and the behavior main body characteristic information of the calibration cutting. In the present embodiment, a 3D CNN model is used. And taking continuous 16 frames of data as the input of the model, and extracting two kinds of feature information as the input of the feature fusion module by adopting 5-layer 3D convolution operation and 5-layer maximum 3D pooling operation.
A feature fusion module: and the system is used for fusing the feature information acquired by the feature extraction module. In the embodiment, a 1-layer 3D feature fusion layer structure is adopted, two kinds of behavior feature information are fused, and a 3-layer full-connection layer further extracts and classifies features.
A model training module: and learning and modeling the preprocessed training set to obtain a trained 3D CNN human body behavior recognition model with multiple target detection. In the embodiment, a training data set is formed by combining public human behavior data sets such as 'UCF-101', 'HMDB 51' and the like and actual data sets collected by the user.
Human behavior recognition module: and classifying and identifying the behavior actions of the human body by using a multi-target detection 3D CNN human body behavior identification model. In the present embodiment, classification recognition is performed by a Softmax classifier.
In the above embodiments, the included modules are only divided according to the functional logic of the present invention, but are not limited to the above division, as long as the corresponding functions can be implemented, and are not used to limit the scope of the present invention.
In conclusion, the human behavior identification method and system based on multi-target detection 3D CNN provided by the invention not only make up the deficiency that the 2D neural network extracts features in the time dimension; a multi-target detection method is also adopted, and an SSD (Single Shot Multi Box Detector) target detection algorithm is introduced to calibrate a behavior main body in a video stream for acquiring more detailed local features, and the behavior main body is fused into a model to make up the defect of weakening the global features of the model; meanwhile, the behavior characteristics learned by the model can be used for classification and identification and can also be used as an early warning report, and the model can prejudge and report special behaviors according to a set early warning threshold value. The scenes of the model in practical application are increased. The model can also be migrated and applied to Internet of things platforms such as smart homes, intelligent monitoring and intelligent anti-theft, and has wide research and use values, and the popularization is pointed.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims (5)

1. The human behavior identification method based on multi-target detection 3D CNN is characterized by comprising the following steps:
1) the method for preprocessing the video and converting the video stream into the image frame comprises the following steps:
1.1) acquiring a video data set, wherein a public data set is mainly adopted for training a model, and a test data set is acquired by a camera in a real environment;
1.2) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using the behavior tag of the folder;
1.3) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;
1.4) cutting and dividing the image frame set obtained in the step 1.3) by adopting a cross verification method for training a model;
2) the method for calibrating and cutting the target object in the video by adopting the SSD detection technology comprises the following steps:
2.1) loading the trained SSD detection model;
2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.3) setting the size of the calibration data clipping, which is half of the size of each frame in the image frame set of the step 1.3), converting all videos and storing the videos as a calibrated image frame set;
3) establishing a feature extraction network structure of image frame data and calibration cutting data, which comprises the following steps:
firstly, respectively building a 3D convolution neural network model based on the image frame set in the step 1.3) and a 3D convolution neural network model of the image frame set calibrated in the step 2.3); then, taking continuous 16 frames of data as input of a model, and respectively adopting 5 layers of 3D convolution operation, 5 layers of maximum 3D pooling operation, 1 layer of feature fusion layer and 3 layers of full connection operation; in order to prevent the model from being over-fitted, the 5 convolutional layers are subjected to L2 regular, and dropout (0.5) is added to the fully-connected layer;
4) establishing a feature fusion model, fusing the two features extracted in the step 3), and comprising the following steps:
4.1) respectively obtaining 3D convolution characteristics extracted based on the 3D convolution neural network model of the image frame set in the step 1.3) and the 3D convolution neural network model of the image frame set calibrated in the step 2.3), and carrying out Flatten () operation on the obtained characteristics to be used as the input of a fusion layer;
4.2) completing the fusion of the intermediate features as the input of the full connection layer;
5) classifying by using a Softmax regression model classifier;
6) and according to the actual application scene or the public data set, the trained model is finely adjusted, so that the generalization and popularization capability of the model is enhanced.
2. The human behavior recognition method based on multi-target detection 3D CNN of claim 1, wherein in step 5), classification is performed by using a Softmax classifier, and the method comprises the following steps:
5.1) after completing the fusion of the characteristics in the step 4), passing through three full-connection layers to be used as the input of a Softmax classifier, and then classifying;
and 5.2) setting a threshold value of the early warning report, and giving an early warning prompt by the system after judging that the recognition rate of a certain behavior action reaches the corresponding threshold value.
3. The human behavior recognition method based on multi-target detection 3D CNN of claim 1, wherein in step 6), according to an actual application scenario or a common data set, a trained model is finely tuned to enhance generalization and popularization capabilities of the model, and the method comprises the following steps:
6.1) transferring the model to a specific application scene, and freezing the convolution and pooling layer parameters of the model;
6.2) changing the input and output layers of the model;
6.3) loading a data set under a new scene, and retraining parameters of the full connection layer.
4. Human behavior recognition system based on multi-target detection 3D CNN, its characterized in that includes:
the data acquisition module is used for acquiring original video data information of human behavior analysis, wherein the original video data information comprises a public behavior data set and a video data set in an actual scene;
the data preprocessing module is used for preprocessing the original video data, classifying and calibrating, detecting a target, cutting and converting a video frame;
the method for preprocessing the video and converting the video stream into the image frame comprises the following steps:
1.1) acquiring a video data set, wherein a public data set is mainly adopted for training a model, and a test data set is acquired by a camera in a real environment;
1.2) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using the behavior tag of the folder;
1.3) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;
1.4) cutting and dividing the image frame set obtained in the step 1.3) by adopting a cross verification method for training a model;
the method for calibrating and cutting the target object in the video by adopting the SSD detection technology comprises the following steps:
2.1) loading the trained SSD detection model;
2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.3) setting the size of the calibration data clipping, which is half of the size of each frame in the image frame set of the step 1.3), converting all videos and storing the videos as a calibrated image frame set;
the feature extraction module is used for sending the preprocessed data into the constructed 3D CNN network model, and respectively extracting the behavior feature information of the video stream and the behavior main body feature information of the calibration cutting, and the feature extraction module specifically comprises the following steps:
firstly, respectively building a 3D convolution neural network model based on the image frame set in the step 1.3) and a 3D convolution neural network model of the image frame set calibrated in the step 2.3); then, taking continuous 16 frames of data as input of a model, and respectively adopting 5 layers of 3D convolution operation, 5 layers of maximum 3D pooling operation, 1 layer of feature fusion layer and 3 layers of full connection operation; in order to prevent the model from being over-fitted, the 5 convolutional layers are subjected to L2 regular, and dropout (0.5) is added to the fully-connected layer;
the feature fusion module is used for fusing the feature information acquired by the feature extraction module and comprises the following steps:
4.1) respectively obtaining 3D convolution characteristics extracted based on the 3D convolution neural network model of the image frame set in the step 1.3) and the 3D convolution neural network model of the image frame set calibrated in the step 2.3), and carrying out Flatten () operation on the obtained characteristics to be used as the input of a fusion layer;
4.2) completing the fusion of the intermediate features as the input of the full connection layer;
the model training module is used for learning and modeling the preprocessed training set to obtain a trained 3D CNN human body behavior recognition model for multi-target detection;
and the human body behavior recognition module is used for classifying and recognizing the behavior actions of the human body by utilizing a multi-target detection 3D CNN human body behavior recognition model.
5. The human behavior recognition system based on multi-target detection 3D CNN of claim 4, characterized in that: the data acquisition module acquires video data in an actual scene through a monocular camera and a binocular camera and downloads a public human behavior data set; the data preprocessing module processes video data by adopting an FFmpeg tool, converts the video data into an image frame set, and calibrates and cuts the video by utilizing an SSD detection algorithm to generate a set of the image frame set obtained in the step 1.3) and the image frame set calibrated in the step 2.3); the feature extraction module adopts a 3D CNN model, takes continuous 16-frame data as the input of the model, and adopts 5-layer 3D convolution operation and 5-layer maximum 3D pooling operation; the feature fusion module adopts a 1-layer 3D feature fusion layer structure, fuses two kinds of behavior feature information, and further extracts and classifies features by a 3-layer full-connection layer; the model training module combines a public human behavior data set of 'UCF-101' and 'HMDB 51' and an actual data set acquired by the model training module to form a training data set; the human behavior recognition module carries out classification recognition by utilizing a Softmax classifier.
CN201910136442.1A 2019-02-18 2019-02-18 Human behavior identification method and system based on multi-target detection 3D CNN Expired - Fee Related CN109977773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910136442.1A CN109977773B (en) 2019-02-18 2019-02-18 Human behavior identification method and system based on multi-target detection 3D CNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910136442.1A CN109977773B (en) 2019-02-18 2019-02-18 Human behavior identification method and system based on multi-target detection 3D CNN

Publications (2)

Publication Number Publication Date
CN109977773A CN109977773A (en) 2019-07-05
CN109977773B true CN109977773B (en) 2021-01-19

Family

ID=67077264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910136442.1A Expired - Fee Related CN109977773B (en) 2019-02-18 2019-02-18 Human behavior identification method and system based on multi-target detection 3D CNN

Country Status (1)

Country Link
CN (1) CN109977773B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348420B (en) 2019-07-18 2022-03-18 腾讯科技(深圳)有限公司 Sign language recognition method and device, computer readable storage medium and computer equipment
CN110414415A (en) * 2019-07-24 2019-11-05 北京理工大学 Human bodys' response method towards classroom scene
CN110414421B (en) * 2019-07-25 2023-04-07 电子科技大学 Behavior identification method based on continuous frame images
CN110532909B (en) * 2019-08-16 2023-04-14 成都电科慧安科技有限公司 Human behavior identification method based on three-dimensional UWB positioning
CN111259838B (en) * 2020-01-20 2023-02-03 山东大学 Method and system for deeply understanding human body behaviors in service robot service environment
CN111382677B (en) * 2020-02-25 2023-06-20 华南理工大学 Human behavior recognition method and system based on 3D attention residual error model
CN113536847A (en) * 2020-04-17 2021-10-22 天津职业技术师范大学(中国职业培训指导教师进修中心) Industrial scene video analysis system and method based on deep learning
CN113515986A (en) * 2020-07-02 2021-10-19 阿里巴巴集团控股有限公司 Video processing method, data processing method and equipment
CN112016461B (en) * 2020-08-28 2024-06-11 深圳市信义科技有限公司 Multi-target behavior recognition method and system
CN112232190B (en) * 2020-10-15 2023-04-18 南京邮电大学 Method for detecting abnormal behaviors of old people facing home scene
CN112613428B (en) * 2020-12-28 2024-03-22 易采天成(郑州)信息技术有限公司 Resnet-3D convolution cattle video target detection method based on balance loss
CN112766151B (en) * 2021-01-19 2022-07-12 北京深睿博联科技有限责任公司 Binocular target detection method and system for blind guiding glasses
CN113052059A (en) * 2021-03-22 2021-06-29 中国石油大学(华东) Real-time action recognition method based on space-time feature fusion
CN113221658A (en) * 2021-04-13 2021-08-06 卓尔智联(武汉)研究院有限公司 Training method and device of image processing model, electronic equipment and storage medium
CN113420703B (en) * 2021-07-03 2023-04-18 西北工业大学 Dynamic facial expression recognition method based on multi-scale feature extraction and multi-attention mechanism modeling
CN115601714B (en) * 2022-12-16 2023-03-10 广东汇通信息科技股份有限公司 Campus violent behavior identification method based on multi-modal data analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899561A (en) * 2015-05-27 2015-09-09 华南理工大学 Parallelized human body behavior identification method
CN108108652A (en) * 2017-03-29 2018-06-01 广东工业大学 A kind of across visual angle Human bodys' response method and device based on dictionary learning
CN108647591A (en) * 2018-04-25 2018-10-12 长沙学院 Activity recognition method and system in a kind of video of view-based access control model-semantic feature
CN108985173A (en) * 2018-06-19 2018-12-11 奕通信息科技(上海)股份有限公司 Towards the depth network migration learning method for having the label apparent age data library of noise
CN109002808A (en) * 2018-07-27 2018-12-14 高新兴科技集团股份有限公司 A kind of Human bodys' response method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402697B2 (en) * 2016-08-01 2019-09-03 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899561A (en) * 2015-05-27 2015-09-09 华南理工大学 Parallelized human body behavior identification method
CN108108652A (en) * 2017-03-29 2018-06-01 广东工业大学 A kind of across visual angle Human bodys' response method and device based on dictionary learning
CN108647591A (en) * 2018-04-25 2018-10-12 长沙学院 Activity recognition method and system in a kind of video of view-based access control model-semantic feature
CN108985173A (en) * 2018-06-19 2018-12-11 奕通信息科技(上海)股份有限公司 Towards the depth network migration learning method for having the label apparent age data library of noise
CN109002808A (en) * 2018-07-27 2018-12-14 高新兴科技集团股份有限公司 A kind of Human bodys' response method and system

Also Published As

Publication number Publication date
CN109977773A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109977773B (en) Human behavior identification method and system based on multi-target detection 3D CNN
CN108830252B (en) Convolutional neural network human body action recognition method fusing global space-time characteristics
CN110363140B (en) Human body action real-time identification method based on infrared image
CN110929593B (en) Real-time significance pedestrian detection method based on detail discrimination
CN111079646A (en) Method and system for positioning weak surveillance video time sequence action based on deep learning
Xie et al. Detecting trees in street images via deep learning with attention module
CN109670405B (en) Complex background pedestrian detection method based on deep learning
Chen et al. An improved Yolov3 based on dual path network for cherry tomatoes detection
CN111582122B (en) System and method for intelligently analyzing behaviors of multi-dimensional pedestrians in surveillance video
CN111582095B (en) Light-weight rapid detection method for abnormal behaviors of pedestrians
CN111582092B (en) Pedestrian abnormal behavior detection method based on human skeleton
CN111382677A (en) Human behavior identification method and system based on 3D attention residual error model
CN108875555B (en) Video interest area and salient object extracting and positioning system based on neural network
CN105590099A (en) Multi-user behavior identification method based on improved convolutional neural network
CN113705445B (en) Method and equipment for recognizing human body posture based on event camera
CN110929685A (en) Pedestrian detection network structure based on mixed feature pyramid and mixed expansion convolution
CN116721458A (en) Cross-modal time sequence contrast learning-based self-supervision action recognition method
CN113516102A (en) Deep learning parabolic behavior detection method based on video
CN113255464A (en) Airplane action recognition method and system
Tsutsui et al. Distantly supervised road segmentation
CN103500456A (en) Object tracking method and equipment based on dynamic Bayes model network
CN114359578A (en) Application method and system of pest and disease damage identification intelligent terminal
Al-Shakarchy et al. Detecting abnormal movement of driver's head based on spatial-temporal features of video using deep neural network DNN
CN113887272A (en) Violent behavior intelligent safety detection system based on edge calculation
CN113361475A (en) Multi-spectral pedestrian detection method based on multi-stage feature fusion information multiplexing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210119