CN109977773A

CN109977773A - Human bodys' response method and system based on multi-target detection 3D CNN

Info

Publication number: CN109977773A
Application number: CN201910136442.1A
Authority: CN
Inventors: 董敏; 李永发; 毕盛; 聂宏蓄
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-02-18
Filing date: 2019-02-18
Publication date: 2019-07-05
Anticipated expiration: 2039-02-18
Also published as: CN109977773B

Abstract

The invention discloses a kind of Human bodys' response method and system based on multi-target detection 3D CNN to convert picture frame for video flowing this method comprises: 1) pre-processing to video；2) calibration cutting is carried out to the target object in video using the SSD detection technique of current comparative maturity；3) it establishes image frame data and demarcates the feature extraction network structure of cut data；4) Fusion Features model is established, the two kinds of features extracted in step 3) are merged；5) classified using Softmax regression model classifier；6) according to actual application scenarios or common data sets, trained model is finely adjusted.The present invention makes up current deep neural network model convolution and a kind of situation for causing information to lose on time dimension, strengthens the expression of feature on time dimension, the whole recognition efficiency for improving model enables model to better understand the behavior act of human body.

Description

Human bodys' response method and system based on multi-target detection 3D CNN

Technical field

The present invention relates to the technical fields of Human bodys' response analysis, refer in particular to a kind of based on multi-target detection 3D CNN Human bodys' response method and system.

Background technique

Human bodys' response refers to behavior expression or the movement of the mankind in identification true environment, can be in every field In applied.Application scenarios common at present have: intelligent monitoring, smart home, human-computer interaction and the analysis of human body behavior property, The fields such as anticipation.However, the accuracy rate and efficiency of promotion identification are still a very challenging task, also receive The extensive concern of all researchers.

In in the past few decades, extraction and expression to human body behavioural characteristic predominantly stay in the artificial stage, and artificial The experience of designer is often depended on again to design, the extraction of feature.Common manual features extracting method has: space-time interest points (STIP), vision bag of words (BOVW), histograms of oriented gradients (HOG), motion history figure (MHI), kinergety image (MEI) Deng.The design of manual features is often to carry out just for the specific data of certain a part, and which results in the general of model Change ability is poor, can not quickly move to other application up, greatly increase artificial cost.Conventional method can be with Say it is to enter a bottleneck period.

Application of the deep learning in Human bodys' response can be described as to present on tional identification mode insufficient one A very big makes up.It is mainly reflected in the following aspects: (1) avoiding the trouble of manual features extraction, simplify feature and mention The process taken；(2) since deep neural network all has certain feedback regulation effect, model is largely strengthened Generalization ability；(3) automatic dimensionality reduction can be carried out to complicated feature；(4) it in terms of handling big data, can greatly reduce The expense of calculating and the execution efficiency for improving entirety；(5) classify for the identification of no label data, performance is more excellent；(6) for base It is relatively easy in the realization of the Activity recognition of mode, it is only necessary to which individually designed corresponding deep learning model carries out mentioning for feature It takes, then the feature of two or more network models is merged, it is very big that this has obtained the accuracy of identification It is promoted.

One maximum difference of the analysis of Human bodys' response and image classification detection is that whether contain the time Information in dimension.Therefore, the analysis of Human bodys' response to not only be gone to extract behavioural characteristic from Spatial Dimension, also wanted Successional information is excavated up from the time dimension of its behavior.It can guarantee in this way to successional behavior act Correct description.

Summary of the invention

It is an object of the invention to overcome current deep neural network model time dimension information on Human bodys' response The deficiency of capture proposes a kind of Human bodys' response method and system based on multi-target detection 3D CNN, make up its when Between convolution and a kind of situation for causing information to lose in dimension, strengthen the expression of the feature on time dimension, it is whole to improve mould The recognition efficiency of type enables model to better understand the behavior act of human body.

To achieve the above object, technical solution provided by the present invention is as follows:

Human bodys' response method based on multi-target detection 3D CNN, comprising the following steps:

1) video is pre-processed, converts picture frame for video flowing；

2) using SSD (full name: Single Shot MultiBox Detector) detection technique to the target pair in video As carrying out calibration cutting；

3) it establishes image frame data and demarcates the feature extraction network structure of cut data；

4) Fusion Features model is established, the two kinds of features extracted in step 3) are merged；

5) classified using Softmax regression model classifier；

6) according to actual application scenarios or common data sets, trained model is finely adjusted, enhances the general of model Change, Generalization Ability.

In step 1), video is pre-processed, converts picture frame for video flowing, comprising the following steps:

1.1) sets of video data is obtained, is mainly used for the training of model, test data using common data sets here Integrate and is acquired as camera under true environment；

1.2) archive operation is carried out to sets of video data, the video data filing of same action behavior is pressed from both sides to same file Under, file is named with its behavior label；

1.3) sets of video data is pre-processed, all videos is completely converted by Video Quality Metric shell script Corresponding picture frame collection；

1.4) cutting division is carried out to the picture frame collection that step 1.3) obtains using cross-validation method, the instruction for model Practice；

In step 2), calibration cutting, including following step are carried out to the target object in video using SSD detection technique It is rapid:

2.1) trained SSD detection model is loaded；

2.2) video stream data is read, is sent into SSD detection model, calibration detection is carried out to each frame of video；

2.3) size that setting nominal data is cut, the half of each frame sign is concentrated for step 1.3) frame data, to institute There is video to be converted and saves as the picture frame collection of calibration.

In step 3), establishes image frame data and demarcate the feature extraction network structure of cut data, specific as follows:

Firstly, building the 3D convolutional neural networks model and human detection module data set based on image frame data collection respectively 3D convolutional neural networks model；Then 5 layers of 3D convolution operation, 5 are respectively adopted as the input of model using continuous 16 frame data The maximum 3D pondization operation of layer, 1 layer of Fusion Features layer and 3 layers of full attended operation；To prevent model training over-fitting, to 5 layers of convolution Layer uses L2 canonical, adds dropout (0.5) in full articulamentum；

In step 4), Fusion Features model is established, carries out the fusion of feature, comprising the following steps:

4.1) 3D convolutional neural networks model and human detection module data set based on image frame data collection are obtained respectively 3D convolutional neural networks model extraction 3D convolution feature, and Flatten () operation is carried out to the feature of acquisition, as melting Close the input of layer；

4.2) fusion of intermediate features, the input as full articulamentum are completed.

In step 5), classified using Softmax classifier, comprising the following steps:

5.1) after the fusion for completing feature in the step 4), into crossing after three layers of full articulamentum as Softmax classifier Input, then classifies；

5.2) setting early warning report threshold value, when determine some behavior act discrimination reach its corresponding threshold value it Afterwards, system provides early warning.

In step 6), according to actual application scenarios or common data sets, trained model is finely adjusted, is enhanced Extensive, the Generalization Ability of model, comprising the following steps:

6.1) migration models are into specific application scenarios, the convolution sum pond layer parameter of Freezing Model；

6.2) input of model, output layer are changed；

6.3) data set under new scene, the parameter of the full articulamentum of re -training are loaded.

Human bodys' response system based on multi-target detection 3D CNN, comprising:

Data acquisition module, for acquiring the original video data information of human body behavioural analysis, including public behavior number According to the sets of video data in collection and actual scene；

Data preprocessing module, for being pre-processed to original video data, calibration of classifying, target detection, cutting, with And video frame conversion；

Characteristic extracting module is extracted respectively for pretreated data to be sent into the 3D CNN network model of building The behavioral agent characteristic information that video flowing behavior characteristic information and calibration are cut；

Fusion Features module, for being merged to the characteristic information that characteristic extracting module obtains；

Model training module, by carrying out learning model building to pretreated training set, the multi-target detection after being trained 3D CNN Human bodys' response model；

Human bodys' response module, it is dynamic to the behavior of human body using the 3D CNN Human bodys' response model of multi-target detection Make carry out Classification and Identification.

Further, the data acquisition module acquires the video in actual scene by monocular cam and binocular camera Data, and download disclosed human body behavioral data collection；The data preprocessing module is using " FFmpeg " tool to video data It is handled, is converted to picture frame collection, while video is demarcated using SSD detection algorithm, is cut, generate multiple target frame number According to collection；The characteristic extracting module uses 3D CNN model, using continuous 16 frame data as the input of model, using 5 layers 3D volumes Product operation and 5 layers of maximum 3D pondization operation；The Fusion Features module uses 1 layer of 3D Fusion Features layer structure, merges two kinds of rows It is characterized information, 3 layers of full articulamentum are further extracted and classified to feature；The model training module uses " UCF- 101 " combine composing training data with " HMDB51 " public human body behavioral data collection and the real data collection oneself acquired Collection；The Human bodys' response module carries out Classification and Identification using Softmax classifier.

Compared with prior art, the present invention have the following advantages that with the utility model has the advantages that

1, picture frame collection is converted video data to, and utilizes SSD (full name: Single Shot MultiBox Detector) detection algorithm is demarcated the personage in video flowing, is cut, can not only be from the global behavior extracted in video Characteristic information, additionally it is possible to which the extraction that local feature is carried out for behavioral agent makes up the drawbacks of global characteristics weaken, and strengthens model The ability of study.

2, the extraction that using 3D CNN model two kinds of pretreated data sets are carried out with feature, can make up for it traditional 2D CNN It can only be from the shortcoming for spatially extracting video features, without individually doing other extractions to the temporal aspect of behavior, melting It closes, it is only necessary to press batch input picture frame data；Model will get on extraction behavior from time and two, space dimension automatically Feature greatly reduces the difficulty of feature extraction on time dimension.

3, the behavioural characteristic that model learning arrives can not only be used to Classification and Identification, be also used as the effect of early warning report, Model prejudges according to the threshold value of warning that sets special behavior and reported, increases model in practical applications Scene.

Detailed description of the invention

Fig. 1 is the method for the present invention flow chart.

Fig. 2 is 3D convolution operation structural schematic diagram in the present invention.

Fig. 3 is 3D convolutional neural networks model structure design drawing in the present invention.

Fig. 4 is based on multi-target detection 3D CNN model structure schematic diagram.

Specific embodiment

The present invention is further explained in the light of specific embodiments.

It is shown in Figure 1, the Human bodys' response method based on multi-target detection 3D CNN provided by the present embodiment, packet Include following steps:

1) Human bodys' response data collection system is established, human body behavior sets of video data is obtained, mainly uses here It is the training that common data sets are used for model, test data set is that camera is acquired under true environment；

2) frame data collection is respectively converted into the sets of video data of acquisition and utilizes SSD (full name: Single Shot MultiBox Detector) detection algorithm calibration cut data set；

3) 3D CNN learning model is established, data set is learnt respectively, and the feature of each self study is merged Processing；

4) Classification and Identification is carried out to fused feature using Softmax classifier；

5) classify to classification results behavior and demarcate identification or early warning report；

6) model is finely adjusted according to specific application scene, enhances popularization and the generalization ability of model.

In step 2), the sets of video data of step 1) acquisition is pre-processed.Since the model is for multiple target Fusion recognition, therefore be divided into including the independent process of following two:

2.1) frame cutting is directly carried out to sets of video data, establishes first frame data collection, comprising the following steps:

2.1.1 archive operation) is carried out to sets of video data, the video data of same action behavior is filed to same file Under folder, file is named with its behavior label；

2.1.2) sets of video data is pre-processed, all videos are totally converted by Video Quality Metric shell script For corresponding picture frame collection；

2.1.3) using cross-validation method to 2.1.2) obtain picture frame collection carry out cutting division, the instruction for model Practice.

2.2) with SSD (full name: Single Shot MultiBox Detector) algorithm to the master of behavior act

Body is detected, and is extracted targetedly motion characteristic, is established second frame data collection, comprising the following steps:

2.2.1 trained SSD (full name: Single Shot MultiBox Detector) detection model) is loaded；

2.2.2 video stream data) is read, is sent into SSD detection model, calibration detection is carried out to each frame of video；

2.2.3) the size that setting nominal data is cut, the half of each frame sign is concentrated for 2.1.3) frame data, to institute There is video to be converted and saves as the picture frame collection of calibration.

Shown in Figure 2, the 3D CNN model to design in the present invention carries out convolution operation, extracts the structure of behavioural characteristic Schematic diagram.3D CNN can go to extract behavior characteristic information from two dimensions of room and time, as can be seen from Figure 2, carry out convolution behaviour The time dimension of work is N, i.e., carries out convolution operation to continuous N frame image.3D convolution in figure is N number of continuous by stacking Picture frame forms a cube, and 3D convolution kernel is then used in cube.In this structure, each in convolutional layer is special Sign map can be connected with multiple neighbouring successive frames in upper one layer, therefore capture motion information.

It is shown in Figure 3, in step 3), 3D CNN model is established, the study of feature is carried out, comprising the following steps:

3.1) 3D convolutional neural networks model and human detection module data set based on image frame data collection are built respectively 3D convolutional neural networks model.Using continuous 16 frame data as the input of model, 5 layers of 3D convolution operation are respectively adopted (wherein The number of convolution kernel is followed successively by 64,128,256,512,512), 5 layers of maximum 3D pondizations operation and 1 layer connect that (number is entirely 2048) operate, the feature of acquisition is used as the input of Model Fusion layer, it is specific as shown in figure 4, it the following steps are included:

3.1.1 the 3D convolution feature of two model extractions) is obtained respectively, and Flatten () behaviour is carried out to the feature of acquisition Make, the input as fused layer；

3.1.2 the fusion of intermediate features, the input as full articulamentum) are completed.

3.2) to prevent model training over-fitting, L2 canonical is used to 5 layers of convolutional layer, adds dropout in full articulamentum (0.5)。

It is shown in Figure 4, classification knowledge is carried out to the fused feature of step 3.1) using Softmax classifier in step 4) Not, comprising the following steps:

4.1) after the fusion for completing feature, into input as Softmax classifier after three layers of full articulamentum is crossed, then into Row classification；

4.2) setting early warning report threshold value, when determine some behavior act discrimination reach its corresponding threshold value it Afterwards, system provides early warning.

In step 6), model is finely adjusted according to specific application scene, enhances popularization and the generalization ability of model, including Following steps:

6.2) input of model, output layer are changed；

It is below a kind of Human bodys' response system based on multi-modal 3D CNN provided by the present embodiment, comprising:

Data acquisition module: for acquiring the original video data information of human body behavioural analysis, including public behavior number According to the sets of video data in collection and actual scene.In the present embodiment, it is acquired using monocular cam and binocular camera real Video data in the scene of border, and download disclosed human body behavioral data collection, total data set as acquisition.

Data preprocessing module: for being pre-processed to original video data, calibration of classifying, target detection, cutting, with And video frame conversion.In the present embodiment, video data is handled using " FFmpeg " tool, is converted to picture frame Collect, while video is demarcated using SSD (full name: Single Shot MultiBox Detector) detection algorithm, is cut out It cuts, generates multiple target frame data collection.

Characteristic extracting module: it for pretreated data to be sent into the 3D CNN network model of building, extracts respectively The behavioral agent characteristic information that video flowing behavior characteristic information and calibration are cut.In the present embodiment, using 3D CNN model. Two kinds are extracted to obtain using 5 layers of 3D convolution operation and 5 layers of maximum 3D pondization operation using continuous 16 frame data as the input of model Input of the characteristic information as Fusion Features module.

Fusion Features module: for being merged to the characteristic information that characteristic extracting module obtains.In the present embodiment, Using 1 layer of 3D Fusion Features layer structure, two kinds of behavior characteristic informations are merged, 3 layers of full articulamentum further extract feature, Classification.

Model training module: by carrying out learning model building to pretreated training set, the multi-target detection after being trained 3D CNN Human bodys' response model.In the present embodiment, " UCF-101 " is used, the public human body behavior such as " HMDB51 " Data set and the real data collection of oneself acquisition combine composing training data set.

Human bodys' response module: dynamic to the behavior of human body using the 3D CNN Human bodys' response model of multi-target detection Make carry out Classification and Identification.In the present embodiment, Classification and Identification is carried out using Softmax classifier.

In the above-described embodiments, included modules are that function logic according to the invention is divided, but It is not limited to the above division, as long as corresponding functions can be realized, the protection scope that is not intended to restrict the invention.

In conclusion the Human bodys' response method and system provided by the present invention based on multi-target detection 3D CNN, Not only compensate for the deficiency that 2D neural network extracts feature on time dimension；The method of multi-target detection is also used, is introduced SSD (full name: Single Shot MultiBox Detector) algorithm of target detection carries out the behavioral agent in video flowing The drawbacks of calibration is for obtaining more detailed local feature, being fused in model, make up the reduction of model global characteristics；Mould simultaneously The behavioural characteristic that type learns can not only be used to Classification and Identification, be also used as the effect of early warning report, and model will be according to setting The threshold value of warning set, prejudges special behavior, report.Increase the scene of model in practical applications.Of the invention Model, which can also migrate, is applied to the enterprising enforcement use of the platform of internet of things such as smart home, intelligent monitoring, intelligent anti-theft, has extensive Research and use value, the popularization referred to.

Embodiment described above is only the preferred embodiments of the invention, and but not intended to limit the scope of the present invention, therefore All shapes according to the present invention change made by principle, should all be included within the scope of protection of the present invention.

Claims

1. the Human bodys' response method based on multi-target detection 3D CNN, which comprises the following steps:

1) video is pre-processed, converts picture frame for video flowing；

2) calibration cutting is carried out to the target object in video using SSD detection technique；

5) classified using Softmax regression model classifier；

6) according to actual application scenarios or common data sets, trained model is finely adjusted, enhance model it is extensive, push away Wide ability.

2. the Human bodys' response method according to claim 1 based on multi-target detection 3D CNN, which is characterized in that In step 1), video is pre-processed, converts picture frame for video flowing, comprising the following steps:

1.1) sets of video data is obtained, is mainly used for the training of model using common data sets here, test data set is Camera is acquired under true environment；

1.2) archive operation is carried out to sets of video data, under the video data filing to same file folder of same action behavior, File is named with its behavior label；

1.3) sets of video data is pre-processed, all videos is completely converted by Video Quality Metric shell script by correspondence Picture frame collection；

1.4) cutting division is carried out to the picture frame collection that step 1.3) obtains using cross-validation method, the training for model；

In step 2), calibration cutting is carried out to the target object in video using SSD detection technique, comprising the following steps:

2.1) trained SSD detection model is loaded；

2.3) size that setting nominal data is cut, the half of each frame sign is concentrated for step 1.3) frame data, to all views The picture frame collection of calibration is converted and saved as to frequency.

3. the Human bodys' response method according to claim 1 based on multi-target detection 3D CNN, which is characterized in that In step 3), establishes image frame data and demarcate the feature extraction network structure of cut data, specific as follows:

Firstly, building the 3D of 3D convolutional neural networks model and human detection module data set based on image frame data collection respectively Convolutional neural networks model；Then using continuous 16 frame data as the input of model, be respectively adopted 5 layers of 3D convolution operation, 5 layers most Big 3D pondization operation, 1 layer of Fusion Features layer and 3 layers of full attended operation；To prevent model training over-fitting, 5 layers of convolutional layer are adopted With L2 canonical, dropout (0.5) is added in full articulamentum；

4.1) 3D of 3D convolutional neural networks model and human detection module data set based on image frame data collection is obtained respectively The 3D convolution feature of convolutional neural networks model extraction, and Flatten () operation is carried out to the feature of acquisition, as fused layer Input；

4. the Human bodys' response method according to claim 1 based on multi-target detection 3D CNN, which is characterized in that In step 5), classified using Softmax classifier, comprising the following steps:

5.1) after the fusion for completing feature in the step 4), into the input crossed after three layers of full articulamentum as Softmax classifier, Then classify；

5.2) threshold value of setting early warning report, after the discrimination for determining some behavior act reaches its corresponding threshold value, System provides early warning.

5. the Human bodys' response method according to claim 1 based on multi-target detection 3D CNN, which is characterized in that In step 6), according to actual application scenarios or common data sets, trained model is finely adjusted, enhances the general of model Change, Generalization Ability, comprising the following steps:

6.2) input of model, output layer are changed；

6. the Human bodys' response system based on multi-target detection 3D CNN characterized by comprising

Data acquisition module, for acquiring the original video data information of human body behavioural analysis, including public behavioral data collection With the sets of video data in actual scene；

Data preprocessing module, for being pre-processed to original video data, calibration of classifying, target detection, cutting and view The conversion of frequency frame；

Characteristic extracting module extracts video for pretreated data to be sent into the 3D CNN network model of building respectively The behavioral agent characteristic information that Flow Behavior characteristic information and calibration are cut；

Model training module, by carrying out learning model building, the 3D of the multi-target detection after being trained to pretreated training set CNN Human bodys' response model；

Human bodys' response module, using multi-target detection 3D CNN Human bodys' response model to the behavior act of human body into Row Classification and Identification.

7. the Human bodys' response system according to claim 6 based on multi-target detection 3D CNN, it is characterised in that: institute Data acquisition module is stated by the video data in monocular cam and binocular camera acquisition actual scene, and disclosed in downloading Human body behavioral data collection；The data preprocessing module is handled video data using " FFmpeg " tool, is converted to figure As frame collection, while using SSD detection algorithm video demarcated, being cut, generates multiple target frame data collection；The feature extraction Module uses 3D CNN model, using continuous 16 frame data as the input of model, using 5 layers of 3D convolution operation and 5 layers of maximum 3D Pondization operation；The Fusion Features module uses 1 layer of 3D Fusion Features layer structure, merges two kinds of behavior characteristic informations, 3 layers connect entirely It connects layer and feature is further extracted and classified；The model training module uses " UCF-101 " and " HMDB51 " public people Body behavioral data collection and the real data collection of oneself acquisition combine composing training data set；The Human bodys' response Module carries out Classification and Identification using Softmax classifier.