CN109977773B - Human behavior identification method and system based on multi-target detection 3D CNN - Google Patents
Human behavior identification method and system based on multi-target detection 3D CNN Download PDFInfo
- Publication number
- CN109977773B CN109977773B CN201910136442.1A CN201910136442A CN109977773B CN 109977773 B CN109977773 B CN 109977773B CN 201910136442 A CN201910136442 A CN 201910136442A CN 109977773 B CN109977773 B CN 109977773B
- Authority
- CN
- China
- Prior art keywords
- model
- image frame
- video
- data
- behavior
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 34
- 230000004927 fusion Effects 0.000 claims abstract description 34
- 238000000605 extraction Methods 0.000 claims abstract description 21
- 238000003062 neural network model Methods 0.000 claims abstract description 20
- 238000007781 pre-processing Methods 0.000 claims abstract description 20
- 238000005516 engineering process Methods 0.000 claims abstract description 6
- 230000006399 behavior Effects 0.000 claims description 93
- 238000012549 training Methods 0.000 claims description 25
- 230000009471 action Effects 0.000 claims description 17
- 238000011176 pooling Methods 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 9
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000004458 analytical method Methods 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- 238000012795 verification Methods 0.000 claims description 4
- NVNSXBXKNMWKEJ-UHFFFAOYSA-N 5-[[5-(2-nitrophenyl)furan-2-yl]methylidene]-1,3-diphenyl-2-sulfanylidene-1,3-diazinane-4,6-dione Chemical compound [O-][N+](=O)C1=CC=CC=C1C(O1)=CC=C1C=C1C(=O)N(C=2C=CC=CC=2)C(=S)N(C=2C=CC=CC=2)C1=O NVNSXBXKNMWKEJ-UHFFFAOYSA-N 0.000 claims description 3
- 230000008014 freezing Effects 0.000 claims description 3
- 238000007710 freezing Methods 0.000 claims description 3
- 230000007547 defect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000003313 weakening effect Effects 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000009123 feedback regulation Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a human behavior recognition method and a system based on multi-target detection 3D CNN, wherein the method comprises the following steps: 1) preprocessing a video, and converting a video stream into image frames; 2) the current relatively mature SSD detection technology is adopted to calibrate and cut the target object in the video; 3) establishing a feature extraction network structure of image frame data and calibration cutting data; 4) establishing a feature fusion model, and fusing the two features extracted in the step 3); 5) classifying by using a Softmax regression model classifier; 6) and fine-tuning the trained model according to the actual application scene or the public data set. The method makes up for the situation that the information is lost due to the convolution of the current deep neural network model in the time dimension, strengthens the expression of the characteristics in the time dimension, integrally improves the identification efficiency of the model, and enables the model to better understand the behavior of the human body.
Description
Technical Field
The invention relates to the technical field of human behavior recognition analysis, in particular to a human behavior recognition method and system based on multi-target detection 3D CNN.
Background
Human behavior recognition refers to recognition of human behavior or actions in a real environment, and may be applied in various fields. Common application scenarios at present are: the method comprises the fields of intelligent monitoring, intelligent home, man-machine interaction, human behavior attribute analysis, prejudgment and the like. However, improving the accuracy and efficiency of identification remains a very challenging task and is also receiving a great deal of attention from all researchers.
In the past decades, the extraction and representation of human behavior features mainly stay in the manual stage, and the design and extraction of features by manual work often depend on the experience of designers. Common artificial feature extraction methods include: spatiotemporal points of interest (STIP), visual bag of words (boww), Histogram of Oriented Gradients (HOG), motion history graphics (MHI), Motion Energy Images (MEI), etc. The design of artificial features is usually performed only for a certain part of specific data, so that the generalization capability of the model is poor, the model cannot be rapidly migrated to other applications, and the cost of labor is greatly increased. The conventional method can be said to enter a bottleneck period.
The application of deep learning in human behavior recognition can be said to be a great remedy to the defects existing in the traditional recognition mode. The method is mainly characterized in that: (1) the trouble of manual feature extraction is avoided, and the process of feature extraction is simplified; (2) the deep neural network has a certain feedback regulation function, so that the generalization capability of the model is enhanced to a great extent; (3) the automatic dimensionality reduction can be carried out on the complex features; (4) in the aspect of processing big data, the calculation overhead can be greatly reduced and the overall execution efficiency can be improved; (5) for the identification classification of the label-free data, the performance is better; (6) the modal-based behavior recognition is easy to realize, only a corresponding deep learning model needs to be designed independently to extract features, and then the features of two or more network models are fused, so that the recognition accuracy is greatly improved.
One of the biggest differences between the analysis of human behavior recognition and image classification detection is whether information in the time dimension is included. Therefore, the analysis for human behavior recognition not only needs to extract behavior features from the spatial dimension, but also needs to extract continuous information from the temporal dimension of the behavior. This ensures that a continuous behavioral action is correctly described.
Disclosure of Invention
The invention aims to overcome the defect of time dimension information capture of the current deep neural network model on human behavior recognition, provides a human behavior recognition method and system based on multi-target detection 3D CNN, makes up for a situation that information is lost due to convolution on the time dimension, strengthens expression of characteristics on the time dimension, improves the recognition efficiency of the model as a whole and enables the model to better understand behavior of a human body.
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
the human behavior identification method based on multi-target detection 3D CNN comprises the following steps:
1) preprocessing a video, and converting a video stream into image frames;
2) SSD (full name: a Single Shot MultiBox Detector) detection technology carries out calibration cutting on a target object in a video;
3) establishing a feature extraction network structure of image frame data and calibration cutting data;
4) establishing a feature fusion model, and fusing the two features extracted in the step 3);
5) classifying by using a Softmax regression model classifier;
6) and according to the actual application scene or the public data set, the trained model is finely adjusted, so that the generalization and popularization capability of the model is enhanced.
In step 1), preprocessing a video and converting a video stream into image frames, comprising the following steps:
1.1) acquiring a video data set, wherein a public data set is mainly adopted for training a model, and a test data set is acquired by a camera in a real environment;
1.2) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using the behavior tag of the folder;
1.3) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;
1.4) cutting and dividing the image frame set obtained in the step 1.3) by adopting a cross verification method for training a model;
in the step 2), the SSD detection technology is adopted to perform calibration cutting on the target object in the video, and the method comprises the following steps:
2.1) loading the trained SSD detection model;
2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.3) setting the size of the calibration data clipping, wherein the size of each frame in the image frame set in the step 1.3) is half of the size of each frame, and converting all videos and storing the videos as the calibrated image frame set.
In step 3), a feature extraction network structure of the image frame data and the calibration cutting data is established, which specifically comprises the following steps:
firstly, respectively building a 3D convolution neural network model based on the image frame set in the step 1.3) and a 3D convolution neural network model of the image frame set calibrated in the step 2.3); then, taking continuous 16 frames of data as input of a model, and respectively adopting 5 layers of 3D convolution operation, 5 layers of maximum 3D pooling operation, 1 layer of feature fusion layer and 3 layers of full connection operation; in order to prevent the model from being over-fitted, the 5 convolutional layers are subjected to L2 regular, and dropout (0.5) is added to the fully-connected layer;
in step 4), a feature fusion model is established to perform feature fusion, and the method comprises the following steps:
4.1) respectively obtaining 3D convolution characteristics extracted based on the 3D convolution neural network model of the image frame set in the step 1.3) and the 3D convolution neural network model of the image frame set calibrated in the step 2.3), and carrying out Flatten () operation on the obtained characteristics to be used as the input of a fusion layer;
4.2) completing the fusion of the intermediate features as the input of the full connection layer.
In step 5), classifying by using a Softmax classifier, comprising the steps of:
5.1) after completing the fusion of the characteristics in the step 4), passing through three full-connection layers to be used as the input of a Softmax classifier, and then classifying;
and 5.2) setting a threshold value of the early warning report, and giving an early warning prompt by the system after judging that the recognition rate of a certain behavior action reaches the corresponding threshold value.
In step 6), according to an actual application scenario or a public data set, the trained model is finely adjusted, and generalization and popularization capabilities of the model are enhanced, including the following steps:
6.1) transferring the model to a specific application scene, and freezing the convolution and pooling layer parameters of the model;
6.2) changing the input and output layers of the model;
6.3) loading a data set under a new scene, and retraining parameters of the full connection layer.
Human behavior recognition system based on multi-target detection 3D CNN includes:
the data acquisition module is used for acquiring original video data information of human behavior analysis, wherein the original video data information comprises a public behavior data set and a video data set in an actual scene;
the data preprocessing module is used for preprocessing the original video data, classifying and calibrating, detecting a target, cutting and converting a video frame;
the method for preprocessing the video and converting the video stream into the image frame comprises the following steps:
1.1) acquiring a video data set, wherein a public data set is mainly adopted for training a model, and a test data set is acquired by a camera in a real environment;
1.2) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using the behavior tag of the folder;
1.3) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;
1.4) cutting and dividing the image frame set obtained in the step 1.3) by adopting a cross verification method for training a model;
the method for calibrating and cutting the target object in the video by adopting the SSD detection technology comprises the following steps:
2.1) loading the trained SSD detection model;
2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.3) setting the size of the calibration data clipping, which is half of the size of each frame in the image frame set of the step 1.3), converting all videos and storing the videos as a calibrated image frame set;
the feature extraction module is used for sending the preprocessed data into the constructed 3D CNN network model, and respectively extracting the behavior feature information of the video stream and the behavior main body feature information of the calibration cutting, and the feature extraction module specifically comprises the following steps:
firstly, respectively building a 3D convolution neural network model based on the image frame set in the step 1.3) and a 3D convolution neural network model of the image frame set calibrated in the step 2.3); then, taking continuous 16 frames of data as input of a model, and respectively adopting 5 layers of 3D convolution operation, 5 layers of maximum 3D pooling operation, 1 layer of feature fusion layer and 3 layers of full connection operation; in order to prevent the model from being over-fitted, the 5 convolutional layers are subjected to L2 regular, and dropout (0.5) is added to the fully-connected layer;
the feature fusion module is used for fusing the feature information acquired by the feature extraction module and comprises the following steps:
4.1) respectively obtaining 3D convolution characteristics extracted based on the 3D convolution neural network model of the image frame set in the step 1.3) and the 3D convolution neural network model of the image frame set calibrated in the step 2.3), and carrying out Flatten () operation on the obtained characteristics to be used as the input of a fusion layer;
4.2) completing the fusion of the intermediate features as the input of the full connection layer;
the model training module is used for learning and modeling the preprocessed training set to obtain a trained 3D CNN human body behavior recognition model for multi-target detection;
and the human body behavior recognition module is used for classifying and recognizing the behavior actions of the human body by utilizing a multi-target detection 3D CNN human body behavior recognition model.
Further, the data acquisition module acquires video data in an actual scene through a monocular camera and a binocular camera and downloads a public human behavior data set; the data preprocessing module processes video data by adopting an FFmpeg tool, converts the video data into an image frame set, and calibrates and cuts the video by utilizing an SSD detection algorithm to generate a set of the image frame set obtained in the step 1.3) and the image frame set calibrated in the step 2.3); the feature extraction module adopts a 3D CNN model, takes continuous 16-frame data as the input of the model, and adopts 5-layer 3D convolution operation and 5-layer maximum 3D pooling operation; the feature fusion module adopts a 1-layer 3D feature fusion layer structure, fuses two kinds of behavior feature information, and further extracts and classifies features by a 3-layer full-connection layer; the model training module combines a public human behavior data set of 'UCF-101' and 'HMDB 51' and an actual data set acquired by the model training module to form a training data set; the human behavior recognition module carries out classification recognition by utilizing a Softmax classifier.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the video data is converted into an image frame set, characters in a video stream are calibrated and cut by using an SSD (solid State disk) detection algorithm, behavior feature information in the video can be extracted from the whole situation, local features can be extracted aiming at a behavior main body, the defect of weakening of the whole situation features is overcome, and the model learning capability is enhanced.
2. The 3D CNN model is adopted to extract the characteristics of the two preprocessed data sets, so that the defect that the traditional 2D CNN can only extract video characteristics from space can be overcome, other extraction and fusion of time sequence characteristics of behaviors are not needed, and only the image frame data are input in batches; the model can automatically extract behavior characteristics from two dimensions of time and space, and the difficulty of characteristic extraction in the time dimension is greatly reduced.
3. The behavior characteristics learned by the model can be used for classification and identification and can also be used as the function of an early warning report, the model can pre-judge and report special behaviors according to a set early warning threshold value, and the scenes of the model in practical application are increased.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of the 3D convolution operation structure in the present invention.
FIG. 3 is a structural design diagram of a 3D convolutional neural network model in the present invention.
Fig. 4 is a structural diagram of a 3D CNN model based on multi-target detection.
Detailed Description
The present invention will be further described with reference to the following specific examples.
Referring to fig. 1, the human behavior recognition method based on multi-target detection 3D CNN provided in this embodiment includes the following steps:
1) establishing a human behavior recognition data acquisition system, and acquiring a human behavior video data set, wherein a public data set is mainly used for model training, and a test data set is acquired by a camera in a real environment;
2) the acquired video data sets are respectively converted into image frame sets and SSD (full name: single Shot multi box Detector) detection algorithm to calibrate the cut data set;
3) establishing a 3D CNN learning model, respectively learning the data sets, and fusing the learned characteristics;
4) classifying and identifying the fused features by using a Softmax classifier;
5) classifying, calibrating and identifying or early warning reports for classified results and behaviors;
6) and the model is finely adjusted according to a specific application scene, so that the popularization and generalization capability of the model are enhanced.
In step 2), the video data set acquired in step 1) is preprocessed. Because the model is the fusion recognition for multiple targets, the method is divided into the following two independent processes:
2.1) directly performing frame cropping on a video data set to establish a first image frame set, and the method comprises the following steps:
2.1.1) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using a behavior tag of the folder;
2.1.2) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;
2.1.3) cutting and dividing the image frame set obtained in the step 2.1.2) by adopting a cross-validation method for training a model.
2.2) using SSD (full name: single Shot multi box Detector) algorithm detects a subject of a behavior action, extracts a targeted action feature, and establishes a second image frame set, including the following steps:
2.2.1) load trained SSDs (full name: single Shot multitox Detector) detection model;
2.2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.2.3) setting the size of the calibration data clipping to be half of the size of each frame in the 2.1.3) image frame set, converting all videos and storing the videos as the calibrated image frame set.
Referring to fig. 2, a schematic structural diagram of extracting behavior characteristics for performing convolution operation on the 3D CNN model designed in the present invention is shown. The 3D CNN can extract behavior feature information from two dimensions, namely space and time, and as can be seen from fig. 2, the time dimension for performing convolution operation is N, that is, the convolution operation is performed on consecutive N frames of images. The 3D convolution in the figure is performed by stacking N successive image frames into a cube and then applying a 3D convolution kernel to the cube. In this configuration, each feature map in the convolutional layer is connected to a number of adjacent consecutive frames in the previous layer, thus capturing motion information.
Referring to fig. 3, in step 3), a 3D CNN model is built, and feature learning is performed, including the following steps:
3.1) respectively building a 3D convolution neural network model based on the image frame set of 2.1.3) and a 3D convolution neural network model based on the image frame set of 2.2.3) calibration. With continuous 16 frames of data as input of the model, respectively adopting 5-layer 3D convolution operations (where the number of convolution kernels is 64, 128, 256, 512 in order), 5-layer maximum 3D pooling operation and 1-layer full join (number is 2048) operation, and using the obtained features as input of the model fusion layer, as shown in fig. 4 specifically, it includes the following steps:
3.1.1) respectively obtaining the 3D convolution characteristics extracted by the two models, and carrying out Flatten () operation on the obtained characteristics as the input of a fusion layer;
3.1.2) completing the fusion of the intermediate features as the input of the full connection layer.
3.2) to prevent model training overfitting, add dropout (0.5) to the fully-connected layers using L2 regularization for the 5 convolutional layers.
Referring to fig. 4, in step 4), classification and identification are performed on the features fused in step 3.1) by using a Softmax classifier, and the method comprises the following steps:
4.1) after completing the fusion of the characteristics, passing through three full-connection layers to be used as the input of a Softmax classifier, and then classifying;
4.2) setting a threshold value of the early warning report, and giving an early warning prompt by the system after judging that the recognition rate of a certain behavior action reaches the corresponding threshold value.
In step 6), the model is finely adjusted according to the specific application scene, and the popularization and generalization capability of the model is enhanced, including the following steps:
6.1) transferring the model to a specific application scene, and freezing the convolution and pooling layer parameters of the model;
6.2) changing the input and output layers of the model;
6.3) loading a data set under a new scene, and retraining parameters of the full connection layer.
The following is a human behavior recognition system based on multimodal 3D CNN provided in this embodiment, including:
a data acquisition module: the method is used for collecting original video data information of human body behavior analysis, and the original video data information comprises a public behavior data set and a video data set in an actual scene. In this embodiment, a monocular camera and a binocular camera are used to capture video data in an actual scene and download a public human behavior data set as a total data set to be captured.
A data preprocessing module: the method is used for preprocessing, classifying and calibrating, target detecting, cutting and video frame converting the original video data. In this embodiment, a "FFmpeg" tool is used to process video data and convert the video data into an image frame set, and a SSD (Single Shot multi box Detector) detection algorithm is used to calibrate and clip the video, so as to generate two image frame sets, which are specifically as follows:
2.1) directly performing frame cropping on a video data set to establish a first image frame set, and the method comprises the following steps:
2.1.1) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using a behavior tag of the folder;
2.1.2) preprocessing a video data set, and converting all videos into corresponding image frame sets by using an FFmpeg tool;
2.1.3) cutting and dividing the image frame set obtained in the step 2.1.2) by adopting a cross-validation method for training a model.
2.2) using SSD (full name: single Shot multi box Detector) algorithm detects a subject of a behavior action, extracts a targeted action feature, and establishes a second image frame set, including the following steps:
2.2.1) load trained SSDs (full name: single Shot multitox Detector) detection model;
2.2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.2.3) setting the size of the calibration data clipping to be half of the size of each frame in the 2.1.3) image frame set, converting all videos and storing the videos as the calibrated image frame set.
A feature extraction module: and the method is used for sending the preprocessed data into the constructed 3D CNN network model and respectively extracting the behavior characteristic information of the video stream and the behavior main body characteristic information of the calibration cutting. In the present embodiment, a 3D CNN model is used. And taking continuous 16 frames of data as the input of the model, and extracting two kinds of feature information as the input of the feature fusion module by adopting 5-layer 3D convolution operation and 5-layer maximum 3D pooling operation.
A feature fusion module: and the system is used for fusing the feature information acquired by the feature extraction module. In the embodiment, a 1-layer 3D feature fusion layer structure is adopted, two kinds of behavior feature information are fused, and a 3-layer full-connection layer further extracts and classifies features.
A model training module: and learning and modeling the preprocessed training set to obtain a trained 3D CNN human body behavior recognition model with multiple target detection. In the embodiment, a training data set is formed by combining public human behavior data sets such as 'UCF-101', 'HMDB 51' and the like and actual data sets collected by the user.
Human behavior recognition module: and classifying and identifying the behavior actions of the human body by using a multi-target detection 3D CNN human body behavior identification model. In the present embodiment, classification recognition is performed by a Softmax classifier.
In the above embodiments, the included modules are only divided according to the functional logic of the present invention, but are not limited to the above division, as long as the corresponding functions can be implemented, and are not used to limit the scope of the present invention.
In conclusion, the human behavior identification method and system based on multi-target detection 3D CNN provided by the invention not only make up the deficiency that the 2D neural network extracts features in the time dimension; a multi-target detection method is also adopted, and an SSD (Single Shot Multi Box Detector) target detection algorithm is introduced to calibrate a behavior main body in a video stream for acquiring more detailed local features, and the behavior main body is fused into a model to make up the defect of weakening the global features of the model; meanwhile, the behavior characteristics learned by the model can be used for classification and identification and can also be used as an early warning report, and the model can prejudge and report special behaviors according to a set early warning threshold value. The scenes of the model in practical application are increased. The model can also be migrated and applied to Internet of things platforms such as smart homes, intelligent monitoring and intelligent anti-theft, and has wide research and use values, and the popularization is pointed.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.
Claims (5)
1. The human behavior identification method based on multi-target detection 3D CNN is characterized by comprising the following steps:
1) the method for preprocessing the video and converting the video stream into the image frame comprises the following steps:
1.1) acquiring a video data set, wherein a public data set is mainly adopted for training a model, and a test data set is acquired by a camera in a real environment;
1.2) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using the behavior tag of the folder;
1.3) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;
1.4) cutting and dividing the image frame set obtained in the step 1.3) by adopting a cross verification method for training a model;
2) the method for calibrating and cutting the target object in the video by adopting the SSD detection technology comprises the following steps:
2.1) loading the trained SSD detection model;
2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.3) setting the size of the calibration data clipping, which is half of the size of each frame in the image frame set of the step 1.3), converting all videos and storing the videos as a calibrated image frame set;
3) establishing a feature extraction network structure of image frame data and calibration cutting data, which comprises the following steps:
firstly, respectively building a 3D convolution neural network model based on the image frame set in the step 1.3) and a 3D convolution neural network model of the image frame set calibrated in the step 2.3); then, taking continuous 16 frames of data as input of a model, and respectively adopting 5 layers of 3D convolution operation, 5 layers of maximum 3D pooling operation, 1 layer of feature fusion layer and 3 layers of full connection operation; in order to prevent the model from being over-fitted, the 5 convolutional layers are subjected to L2 regular, and dropout (0.5) is added to the fully-connected layer;
4) establishing a feature fusion model, fusing the two features extracted in the step 3), and comprising the following steps:
4.1) respectively obtaining 3D convolution characteristics extracted based on the 3D convolution neural network model of the image frame set in the step 1.3) and the 3D convolution neural network model of the image frame set calibrated in the step 2.3), and carrying out Flatten () operation on the obtained characteristics to be used as the input of a fusion layer;
4.2) completing the fusion of the intermediate features as the input of the full connection layer;
5) classifying by using a Softmax regression model classifier;
6) and according to the actual application scene or the public data set, the trained model is finely adjusted, so that the generalization and popularization capability of the model is enhanced.
2. The human behavior recognition method based on multi-target detection 3D CNN of claim 1, wherein in step 5), classification is performed by using a Softmax classifier, and the method comprises the following steps:
5.1) after completing the fusion of the characteristics in the step 4), passing through three full-connection layers to be used as the input of a Softmax classifier, and then classifying;
and 5.2) setting a threshold value of the early warning report, and giving an early warning prompt by the system after judging that the recognition rate of a certain behavior action reaches the corresponding threshold value.
3. The human behavior recognition method based on multi-target detection 3D CNN of claim 1, wherein in step 6), according to an actual application scenario or a common data set, a trained model is finely tuned to enhance generalization and popularization capabilities of the model, and the method comprises the following steps:
6.1) transferring the model to a specific application scene, and freezing the convolution and pooling layer parameters of the model;
6.2) changing the input and output layers of the model;
6.3) loading a data set under a new scene, and retraining parameters of the full connection layer.
4. Human behavior recognition system based on multi-target detection 3D CNN, its characterized in that includes:
the data acquisition module is used for acquiring original video data information of human behavior analysis, wherein the original video data information comprises a public behavior data set and a video data set in an actual scene;
the data preprocessing module is used for preprocessing the original video data, classifying and calibrating, detecting a target, cutting and converting a video frame;
the method for preprocessing the video and converting the video stream into the image frame comprises the following steps:
1.1) acquiring a video data set, wherein a public data set is mainly adopted for training a model, and a test data set is acquired by a camera in a real environment;
1.2) carrying out filing operation on the video data set, filing the video data with the same action behavior under the same folder, and naming the folder by using the behavior tag of the folder;
1.3) preprocessing a video data set, and converting all videos into corresponding image frame sets through a video conversion script program;
1.4) cutting and dividing the image frame set obtained in the step 1.3) by adopting a cross verification method for training a model;
the method for calibrating and cutting the target object in the video by adopting the SSD detection technology comprises the following steps:
2.1) loading the trained SSD detection model;
2.2) reading video stream data, sending the video stream data into an SSD detection model, and carrying out calibration detection on each frame of the video;
2.3) setting the size of the calibration data clipping, which is half of the size of each frame in the image frame set of the step 1.3), converting all videos and storing the videos as a calibrated image frame set;
the feature extraction module is used for sending the preprocessed data into the constructed 3D CNN network model, and respectively extracting the behavior feature information of the video stream and the behavior main body feature information of the calibration cutting, and the feature extraction module specifically comprises the following steps:
firstly, respectively building a 3D convolution neural network model based on the image frame set in the step 1.3) and a 3D convolution neural network model of the image frame set calibrated in the step 2.3); then, taking continuous 16 frames of data as input of a model, and respectively adopting 5 layers of 3D convolution operation, 5 layers of maximum 3D pooling operation, 1 layer of feature fusion layer and 3 layers of full connection operation; in order to prevent the model from being over-fitted, the 5 convolutional layers are subjected to L2 regular, and dropout (0.5) is added to the fully-connected layer;
the feature fusion module is used for fusing the feature information acquired by the feature extraction module and comprises the following steps:
4.1) respectively obtaining 3D convolution characteristics extracted based on the 3D convolution neural network model of the image frame set in the step 1.3) and the 3D convolution neural network model of the image frame set calibrated in the step 2.3), and carrying out Flatten () operation on the obtained characteristics to be used as the input of a fusion layer;
4.2) completing the fusion of the intermediate features as the input of the full connection layer;
the model training module is used for learning and modeling the preprocessed training set to obtain a trained 3D CNN human body behavior recognition model for multi-target detection;
and the human body behavior recognition module is used for classifying and recognizing the behavior actions of the human body by utilizing a multi-target detection 3D CNN human body behavior recognition model.
5. The human behavior recognition system based on multi-target detection 3D CNN of claim 4, characterized in that: the data acquisition module acquires video data in an actual scene through a monocular camera and a binocular camera and downloads a public human behavior data set; the data preprocessing module processes video data by adopting an FFmpeg tool, converts the video data into an image frame set, and calibrates and cuts the video by utilizing an SSD detection algorithm to generate a set of the image frame set obtained in the step 1.3) and the image frame set calibrated in the step 2.3); the feature extraction module adopts a 3D CNN model, takes continuous 16-frame data as the input of the model, and adopts 5-layer 3D convolution operation and 5-layer maximum 3D pooling operation; the feature fusion module adopts a 1-layer 3D feature fusion layer structure, fuses two kinds of behavior feature information, and further extracts and classifies features by a 3-layer full-connection layer; the model training module combines a public human behavior data set of 'UCF-101' and 'HMDB 51' and an actual data set acquired by the model training module to form a training data set; the human behavior recognition module carries out classification recognition by utilizing a Softmax classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910136442.1A CN109977773B (en) | 2019-02-18 | 2019-02-18 | Human behavior identification method and system based on multi-target detection 3D CNN |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910136442.1A CN109977773B (en) | 2019-02-18 | 2019-02-18 | Human behavior identification method and system based on multi-target detection 3D CNN |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109977773A CN109977773A (en) | 2019-07-05 |
CN109977773B true CN109977773B (en) | 2021-01-19 |
Family
ID=67077264
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910136442.1A Expired - Fee Related CN109977773B (en) | 2019-02-18 | 2019-02-18 | Human behavior identification method and system based on multi-target detection 3D CNN |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109977773B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110348420B (en) | 2019-07-18 | 2022-03-18 | 腾讯科技(深圳)有限公司 | Sign language recognition method and device, computer readable storage medium and computer equipment |
CN110414415A (en) * | 2019-07-24 | 2019-11-05 | 北京理工大学 | Human bodys' response method towards classroom scene |
CN110414421B (en) * | 2019-07-25 | 2023-04-07 | 电子科技大学 | Behavior identification method based on continuous frame images |
CN110532909B (en) * | 2019-08-16 | 2023-04-14 | 成都电科慧安科技有限公司 | Human behavior identification method based on three-dimensional UWB positioning |
CN111259838B (en) * | 2020-01-20 | 2023-02-03 | 山东大学 | Method and system for deeply understanding human body behaviors in service robot service environment |
CN111382677B (en) * | 2020-02-25 | 2023-06-20 | 华南理工大学 | Human behavior recognition method and system based on 3D attention residual error model |
CN113536847A (en) * | 2020-04-17 | 2021-10-22 | 天津职业技术师范大学(中国职业培训指导教师进修中心) | Industrial scene video analysis system and method based on deep learning |
CN113515986A (en) * | 2020-07-02 | 2021-10-19 | 阿里巴巴集团控股有限公司 | Video processing method, data processing method and equipment |
CN112016461B (en) * | 2020-08-28 | 2024-06-11 | 深圳市信义科技有限公司 | Multi-target behavior recognition method and system |
CN112232190B (en) * | 2020-10-15 | 2023-04-18 | 南京邮电大学 | Method for detecting abnormal behaviors of old people facing home scene |
CN112613428B (en) * | 2020-12-28 | 2024-03-22 | 易采天成(郑州)信息技术有限公司 | Resnet-3D convolution cattle video target detection method based on balance loss |
CN112766151B (en) * | 2021-01-19 | 2022-07-12 | 北京深睿博联科技有限责任公司 | Binocular target detection method and system for blind guiding glasses |
CN113052059A (en) * | 2021-03-22 | 2021-06-29 | 中国石油大学(华东) | Real-time action recognition method based on space-time feature fusion |
CN113221658A (en) * | 2021-04-13 | 2021-08-06 | 卓尔智联(武汉)研究院有限公司 | Training method and device of image processing model, electronic equipment and storage medium |
CN113420703B (en) * | 2021-07-03 | 2023-04-18 | 西北工业大学 | Dynamic facial expression recognition method based on multi-scale feature extraction and multi-attention mechanism modeling |
CN115601714B (en) * | 2022-12-16 | 2023-03-10 | 广东汇通信息科技股份有限公司 | Campus violent behavior identification method based on multi-modal data analysis |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104899561A (en) * | 2015-05-27 | 2015-09-09 | 华南理工大学 | Parallelized human body behavior identification method |
CN108108652A (en) * | 2017-03-29 | 2018-06-01 | 广东工业大学 | A kind of across visual angle Human bodys' response method and device based on dictionary learning |
CN108647591A (en) * | 2018-04-25 | 2018-10-12 | 长沙学院 | Activity recognition method and system in a kind of video of view-based access control model-semantic feature |
CN108985173A (en) * | 2018-06-19 | 2018-12-11 | 奕通信息科技(上海)股份有限公司 | Towards the depth network migration learning method for having the label apparent age data library of noise |
CN109002808A (en) * | 2018-07-27 | 2018-12-14 | 高新兴科技集团股份有限公司 | A kind of Human bodys' response method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10402697B2 (en) * | 2016-08-01 | 2019-09-03 | Nvidia Corporation | Fusing multilayer and multimodal deep neural networks for video classification |
-
2019
- 2019-02-18 CN CN201910136442.1A patent/CN109977773B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104899561A (en) * | 2015-05-27 | 2015-09-09 | 华南理工大学 | Parallelized human body behavior identification method |
CN108108652A (en) * | 2017-03-29 | 2018-06-01 | 广东工业大学 | A kind of across visual angle Human bodys' response method and device based on dictionary learning |
CN108647591A (en) * | 2018-04-25 | 2018-10-12 | 长沙学院 | Activity recognition method and system in a kind of video of view-based access control model-semantic feature |
CN108985173A (en) * | 2018-06-19 | 2018-12-11 | 奕通信息科技(上海)股份有限公司 | Towards the depth network migration learning method for having the label apparent age data library of noise |
CN109002808A (en) * | 2018-07-27 | 2018-12-14 | 高新兴科技集团股份有限公司 | A kind of Human bodys' response method and system |
Also Published As
Publication number | Publication date |
---|---|
CN109977773A (en) | 2019-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977773B (en) | Human behavior identification method and system based on multi-target detection 3D CNN | |
CN108830252B (en) | Convolutional neural network human body action recognition method fusing global space-time characteristics | |
CN110363140B (en) | Human body action real-time identification method based on infrared image | |
CN110929593B (en) | Real-time significance pedestrian detection method based on detail discrimination | |
CN111079646A (en) | Method and system for positioning weak surveillance video time sequence action based on deep learning | |
Xie et al. | Detecting trees in street images via deep learning with attention module | |
CN109670405B (en) | Complex background pedestrian detection method based on deep learning | |
Chen et al. | An improved Yolov3 based on dual path network for cherry tomatoes detection | |
CN111582122B (en) | System and method for intelligently analyzing behaviors of multi-dimensional pedestrians in surveillance video | |
CN111582095B (en) | Light-weight rapid detection method for abnormal behaviors of pedestrians | |
CN111582092B (en) | Pedestrian abnormal behavior detection method based on human skeleton | |
CN111382677A (en) | Human behavior identification method and system based on 3D attention residual error model | |
CN108875555B (en) | Video interest area and salient object extracting and positioning system based on neural network | |
CN105590099A (en) | Multi-user behavior identification method based on improved convolutional neural network | |
CN113705445B (en) | Method and equipment for recognizing human body posture based on event camera | |
CN110929685A (en) | Pedestrian detection network structure based on mixed feature pyramid and mixed expansion convolution | |
CN116721458A (en) | Cross-modal time sequence contrast learning-based self-supervision action recognition method | |
CN113516102A (en) | Deep learning parabolic behavior detection method based on video | |
CN113255464A (en) | Airplane action recognition method and system | |
Tsutsui et al. | Distantly supervised road segmentation | |
CN103500456A (en) | Object tracking method and equipment based on dynamic Bayes model network | |
CN114359578A (en) | Application method and system of pest and disease damage identification intelligent terminal | |
Al-Shakarchy et al. | Detecting abnormal movement of driver's head based on spatial-temporal features of video using deep neural network DNN | |
CN113887272A (en) | Violent behavior intelligent safety detection system based on edge calculation | |
CN113361475A (en) | Multi-spectral pedestrian detection method based on multi-stage feature fusion information multiplexing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210119 |