CN107808376A

CN107808376A - A kind of detection method of raising one's hand based on deep learning

Info

Publication number: CN107808376A
Application number: CN201711044722.7A
Authority: CN
Inventors: 林娇娇; 姜飞; 申瑞民
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2018-03-16
Anticipated expiration: 2037-10-31
Also published as: CN107808376B

Abstract

The present invention relates to a kind of detection method of raising one's hand based on deep learning, comprise the following steps：1) sample is collected, the sample is complex environment sample；2) detection model of raising one's hand is established, the detection model of raising one's hand is based on convolutional neural networks structure, and is trained based on the sample with R FCN algorithm of target detection；3) video to be measured is carried out raising one's hand to detect using the detection model of raising one's hand after training, obtains frame position of raising one's hand.Compared with prior art, the present invention have the advantages that can to raise one's hand in detection of complex environment to act, accuracy rate recall ratio it is high.

Description

A kind of detection method of raising one's hand based on deep learning

Technical field

The present invention relates to a kind of video detecting method, more particularly, to a kind of detection method of raising one's hand based on deep learning.

Background technology

Movement human detection in video sequence and Activity recognition are one and are related to computer vision, pattern-recognition and artificial The multi-field research topics such as intelligence, it is always people because it is widely applied value in the fields such as business, medical treatment and military affairs The focus of research.However, because the diversity of human body behavior and non-rigid and intrinsic video image complexity, be proposed A kind of sane and real-time accurately method is still difficult point.

Due to noise and highly dynamic background, different illumination conditions, and small size and multiple possible matchings pair As the action of raising one's hand that people is detected in a typical classroom environment is a challenging task.

Document " Haar-Feature Based Gesture Detection of Hand-Raising for Mobile Robot in HRI Environments " disclose a kind of detection technique of raising one's hand based on Haar features, and this method is trained first Then two graders, all positions of this method human-face detector scanning input picture raise one's hand to examine to search people with one Survey device and scan the specific region around face to have detected whether to raise one's hand.This method is divided into training stage and detection-phase.Training Stage specifically includes：(1) sample is created, training sample is divided into positive sample and negative sample, and wherein positive sample refers to target sample to be checked This, negative sample refers to other any images；(2) feature extraction, including edge feature, linear feature and central feature；(3) Cascaded Adaboost are trained, and are completed by calling OpenCV opencv_traincascade programs.Training terminates A .xml model file is generated afterwards, and the adaboost cascade classifiers of generation, which can detect, raises one's hand to act, and this is also entirely to examine The key of survey technology.Detection-phase specifically includes：(1) video cuts frame and carries out Face datection；(2) sense based on face constraint is emerging Interesting regional choice；(3) carry out raising one's hand to detect in the region of interest using the cascade classifier trained.

Although the above method can obtain testing result, still have several drawbacks：(1) need to carry out Face datection, face The effect quality of detection will directly affect the effect of final detection of raising one's hand；(2) selection of area-of-interest needs to continuously attempt to, right New detection environment needs to reformulate selection scheme, as testing result not robust；(3) raise one's hand to detect based on Haar features Ineffective, accuracy rate and recall ratio are relatively low.

The content of the invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind is based on deep learning Detection method of raising one's hand.

An object of the present invention is can to raise one's hand in detection of complex environment (such as classroom environment) to act.

The second object of the present invention is to improve the accuracy rate for detection of raising one's hand.

The third object of the present invention is to improve the recall ratio for detection of raising one's hand.

The fourth object of the present invention is to merge the same action of raising one's hand of different frame, obtains number of more really raising one's hand.

The purpose of the present invention can be achieved through the following technical solutions：

A kind of detection method of raising one's hand based on deep learning, comprises the following steps：

1) sample is collected, the sample is complex environment sample；

2) detection model of raising one's hand is established, the detection model of raising one's hand is based on convolutional neural networks structure, and is based on the sample It is trained with R-FCN algorithm of target detection；

3) video to be measured is carried out raising one's hand to detect using the detection model of raising one's hand after training, obtains frame position of raising one's hand.

Further, in the step 1), sample size is more than 30,000.

Further, the step 1) also includes：Sample information is preserved, the sample information includes key frame of video figure The bounding box coordinate for target of being raised one's hand in picture, key frame image information and key frame image information.

Further, the step 1) also includes：Sample-size is clustered, obtains the Pattern plate ruler needed for training process It is very little.

Further, the convolutional neural networks structure includes intermediate level fused layer.

Further, this method also includes step：

4) merged using same raise one's hand action of the track algorithm to different frame.

Further, the step 4) is specially：

401) obtaining first picture frame and the frame coordinate of raising one's hand detected, frame correspondence establishment of respectively raising one's hand has a tracklet Array, and state initialization is ALIVE；

402) next picture frame is obtained, judges whether that camera lens view transformation occurs, if so, then by all tracklet numbers The state of group is changed to DEAD, re-establishes new tracklet arrays, return to step 402), if it is not, then performing step 403)；

403) all frames of raising one's hand that traversal current image frame detects, it is optimal for each frame selection of raising one's hand using track algorithm One tracklet array of matching；

404) for the tracklet arrays not being matched under current image frame, judge its state whether ALIVE, if It is that then status modifier is WAIT, if it is not, then status modifier is DEAD, return to step 402), until all images are completed in processing Frame.

Further, it is described to judge whether that camera lens view transformation, which occurs, is specially：

Two adjacent images frame is obtained, counts the pixel that two picture frame corresponding pixel points rates of change exceed first threshold Number；Judge whether the pixel number of change is more than Second Threshold, if so, be then judged to that camera lens view transformation occurs, if it is not, Camera lens view transformation does not occur then.

Further, this method also includes step：

5) action of raising one's hand after detection and merging is counted.

Compared with prior art, the invention has the advantages that：

1st, the present invention uses the video image in complex environment as sample raise one's hand the training of detection model so that Inventive method can raise one's hand to detect suitable for complex environment, can be well adapted for more complicated background.

2nd, detection model of raising one's hand proposed by the invention is the depth based on a large amount of (sample of being raised one's hand more than 30,000) sample trainings Learning model, the accuracy rate of model is high, and by substantial amounts of test, accuracy rate of the present invention is more than 90%.

3rd, the template size required for training process of the present invention is that the size cluster based on sample obtains, rather than artificial choosing Select, effectively improve the effect of model.

4th, template size of the invention cluster and the fusion of the network intermediate level ensure that the recall ratio of model, by a large amount of Test, recall ratio of the present invention be more than 70%.

5th, the track algorithm that uses of the present invention can effectively track between different frame it is same raise one's hand to act, therefore can obtain true Raise one's hand in fact the data of number, foundation is provided for further analysis and evaluation.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the present invention；

Fig. 2 is the schematic flow sheet of sample-size of the present invention cluster；

Fig. 3 is the schematic diagram of network intermediate layer level fusion；

Fig. 4 is the schematic network structure of detection model of the invention of raising one's hand；

Fig. 5 is the merging schematic flow sheet that the present invention raises one's hand to act；

Fig. 6 is that shot boundary of the present invention judges schematic flow sheet；

Fig. 7 is the Detection results figure in embodiment.

Embodiment

The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.The present embodiment is with technical solution of the present invention Premised on implemented, give detailed embodiment and specific operating process, but protection scope of the present invention is not limited to Following embodiments.

As shown in figure 1, the present invention provides a kind of detection method of raising one's hand based on deep learning, comprise the following steps：

1) sample is collected, the sample is complex environment sample, and sample size is more than 30,000.

Need to preserve sample information, including Video Key two field picture, key frame image information and key frame after collecting sample Bounding box coordinate for target of being raised one's hand in image information etc..

Preserving for sample information can make according to the form of PASCAL VOC data sets.PASCAL VOC are image recognition The outstanding data set of a whole set of standardization is provided with classification, the file preserved under the form include JPEGImages, The key frame images of video are deposited in Annotations etc., wherein JPEGImages, correspondence image is deposited in Annotations Details and image in raise one's hand the bounding box coordinate of target, wherein frame position mark form of raising one's hand by top left co-ordinate and Lower left corner coordinate composition.

Need to use template (anchors) during model training, template size is clustered by sample-size in the present invention Mode obtains.In certain embodiments, sample-size is clustered using kmeans algorithms, selects most representational 9 Kind size is as template.

Distance metric formula in k-means is newly defined as herein：

D (box, centroid)=1-IOU (box, centroid)

Wherein, d (box, centroid) expression bounding box box and particle centroid distance, IOU (box, Centroid rate is overlapped corresponding to) representing.

In above-mentioned formula, IOU (Intersection over Union) represents template anchors (i.e. box) and preliminary making Raise one's hand frame ground truth (i.e. centroid) overlapping rate, be defined as：

As shown in Fig. 2 the input detailed process false code of cluster can be described as：

Require:The pre- bounding box for demarcating frame of raising one's hand of input

Ensure:9 kinds of most typical sizes are exported as template size

1:K=9

2:K point is selected as initial barycenter

3:repeat

4:According to range formula：D (box, centroid)=1-IOU (box, centroid)

5:Each bounding box are assigned to nearest barycenter, form k cluster

6:Recalculate the barycenter of each cluster

7:Until clusters do not change

2) detection model of raising one's hand is established, the detection model of raising one's hand is based on convolutional neural networks structure, and is based on the sample It is trained with R-FCN algorithm of target detection.Convolutional neural networks structure includes intermediate level fused layer, to enrich convolutional Neural The feature that network extraction arrives, and then improve the accuracy rate of detection.

In certain embodiments, the ResNet-101 for the revision that convolutional neural networks structure uses, with C1, C2, C3, C4, C5 represent ResNet-101 conv1, conv2, conv3, conv4, conv5 output respectively.With folding for the convolution number of plies Add, the receptive field of each convolution kernel is increasing, and the semantic feature learnt is also more advanced, but some trickle features are got over Easily it is ignored.And some environment pass the imperial examinations at the provincial level the resolution ratio made manually can be smaller, therefore in order to correctly detect Small object, we are by C3 It is superimposed with C5 output, make feature that network learns in C5 layers while have high-level semantics feature and low-level details special Sign.As shown in figure 3, res5c_relu is C5 output, C5_topdown is C5 up-sampling layer, C5 is upsampled to and C3 mono- The size of sample, last C5_topdown are superimposed to obtain P3 layers with C3, and P3 is in being that instead of outputs of the res5c_relu as C5, and this is just Enrich the feature that convolutional neural networks extract.

After feature extraction network uses ResNet-101, and the characteristic pattern for having done the network intermediate level merges, using R-FCN Algorithm of target detection carries out model training.Extract image's first by the conv+relu+pooling layers on one group of basis feature maps.The feature maps are shared for follow-up RPN networks and detection networks.RPN networks are used to give birth to Into region proposals, the layer judges that anchors belongs to foreground or background by softmax, then Accurate proposals is obtained using bounding box regression amendments anchors.Roi Pooling layers are collected defeated The feature maps and proposals entered, proposal feature maps are extracted after integrating these information, and calculated Position-sensitive score maps, it is then fed into follow-up detection networks and judges target classification.Finally utilize Proposal feature maps calculate proposal classification, and obtain the final exact position of detection block.

ResNet-101 includes 5 convolution blocks, 101 layers altogether, and 4 convolution blocks are as RPN nets before the R-FCN of master is used The shared weights network of network and detection networks, feature extraction network of the 5th convolution block as detection networks, The present invention is using all 101 layers as RPN networks and the shared weights network of detection networks, the 5th convolution block output Feature map be shared for RPN networks and detection networks, such processing mode is ensureing the base of accuracy rate Amount of calculation is also greatly reduced on plinth simultaneously.

Raise one's hand detection model network it is as shown in Figure 4.

In certain embodiments, this method also includes step：4) according to the position of previous frame, next frame is raised one's hand to act It is tracked, is merged using same raise one's hand action of the track algorithm to different frame.In the feelings that camera lens visual angle does not convert Under condition, it can be tracked using same raise one's hand action of the track algorithm to different frame.Track algorithm can use backtracking-beta pruning Method, for raise one's hand action and the action progress Optimum Matching of raising one's hand of next frame of previous frame.

Step 4) is specially：

403) all frames of raising one's hand that traversal current image frame detects, frame selection is raised one's hand most to be each using beta pruning method is recalled One tracklet array of good matching；

The false code of said process can be summarized as：

Require:The set of N number of image is inputted, and the frame bounding box that raise one's hand detected respectively

Ensure:Export tracklets

The single image frame merging process made manually of passing the imperial examinations at the provincial level is as shown in Figure 5.

There is the possibility of camera lens view transformation in the video capture based on camera, the present invention solves this using frame difference method and asked Topic, i.e., successive frame subtracts each other.As shown in fig. 6, judge whether that camera lens view transformation, which occurs, is specially：

Specific determination methods are that white portion (i.e. motion parts) accounts for whether overall pixel has exceeded 20%, more than being to cut Change.

Based on above-mentioned merging process, this method may also include step：5) action of raising one's hand after detection and merging is counted Number.

Embodiment 1

The present embodiment illustrates the above method by taking students in middle and primary schools' classroom environment as an example.40,000 sample sizes are collected, by PASCAL The form of VOC data sets makes sample of raising one's hand.By the cluster of sample-size, the 9 kinds of anchor box sizes finally clustered out For：

(37,59) (44,72) (53,80) (56,96) (67,105) (75,128) (91,150) (115,184) (177, 283)。

Training process in the present embodiment has iteration altogether 20000 times, obtains an effect and preferably raises one's hand detection model. The detection model part design sketch of raising one's hand trained is as shown in Figure 7.

After the merging for action of being raised one's hand using track algorithm progress different frame, the statistics of quantity is carried out, records whole classroom Pass the imperial examinations at the provincial level the frequency made manually, complete a classroom and pass the imperial examinations at the provincial level the counting made manually, classroom atmosphere is assessed with this, is classroom atmosphere Intellectual analysis provide foundation.

Through experiment, the above method raise one's hand Detection accuracy and recall ratio it is higher, accuracy rate more than 90%, recall ratio 70% More than.

Preferred embodiment of the invention described in detail above.It should be appreciated that one of ordinary skill in the art without Creative work can is needed to make many modifications and variations according to the design of the present invention.Therefore, all technologies in the art Personnel are available by logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea Technical scheme, all should be in the protection domain being defined in the patent claims.

Claims

1. a kind of detection method of raising one's hand based on deep learning, it is characterised in that comprise the following steps：

1) sample is collected, the sample is complex environment sample；

2) detection model of raising one's hand is established, the detection model of raising one's hand is based on convolutional neural networks structure, and based on the sample with R- FCN algorithm of target detection is trained；

2. the detection method of raising one's hand according to claim 1 based on deep learning, it is characterised in that in the step 1), Sample size is more than 30,000.

3. the detection method of raising one's hand according to claim 1 based on deep learning, it is characterised in that the step 1) is also wrapped Include：Sample information is preserved, the sample information includes Video Key two field picture, key frame image information and key frame image information In raise one's hand the bounding box coordinate of target.

4. the detection method of raising one's hand according to claim 1 based on deep learning, it is characterised in that the step 1) is also wrapped Include：Sample-size is clustered, obtains the template size needed for training process.

5. the detection method of raising one's hand according to claim 1 based on deep learning, it is characterised in that the convolutional Neural net Network structure includes intermediate level fused layer.

6. the detection method of raising one's hand according to claim 1 based on deep learning, it is characterised in that this method also includes step Suddenly：

7. the detection method of raising one's hand according to claim 6 based on deep learning, it is characterised in that the step 4) is specific For：

401) obtaining first picture frame and the frame coordinate of raising one's hand detected, frame correspondence establishment of respectively raising one's hand has a tracklet numbers Group, and state initialization is ALIVE；

402) next picture frame is obtained, judges whether that camera lens view transformation occurs, if so, then by all tracklet arrays State is changed to DEAD, re-establishes new tracklet arrays, return to step 402), if it is not, then performing step 403)；

403) all frames of raising one's hand that traversal current image frame detects, best match is selected for each frame of raising one's hand using track algorithm A tracklet array；

404) for the tracklet arrays not being matched under current image frame, judge its state whether ALIVE, if so, then Status modifier is WAIT, if it is not, then status modifier is DEAD, return to step 402), until all picture frames are completed in processing.

8. the detection method of raising one's hand according to claim 6 based on deep learning, it is characterised in that described to judge whether to send out Giving birth to camera lens view transformation is specially：

Two adjacent images frame is obtained, counts the pixel that two picture frame corresponding pixel points rates of change exceed first threshold Number；Judge whether the pixel number of change is more than Second Threshold, if so, being then judged to that camera lens view transformation occurs, if it is not, then Camera lens view transformation does not occur.

9. the detection method of raising one's hand according to claim 6 based on deep learning, it is characterised in that this method also includes step Suddenly：

5) action of raising one's hand after detection and merging is counted.