CN110110665A

CN110110665A - The detection method of hand region under a kind of driving environment

Info

Publication number: CN110110665A
Application number: CN201910378179.7A
Authority: CN
Inventors: 林相波; 史明明; 李一博; 戴佐俊
Original assignee: Beijing Chuang Yuan Micro Software Co Ltd; Dalian University of Technology
Current assignee: Beijing Chuang Yuan Micro Software Co Ltd; Dalian University of Technology
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2019-08-09
Anticipated expiration: 2039-05-08
Also published as: CN110110665B

Abstract

The invention discloses a kind of detection methods of hand region under driving environment, include the following steps: that step 1) prepares data set, the data set is shot in driver's cabin by being mounted on camera apparatus at driver's cabin different location in true driving environment and is obtained under situation, and data set is divided into training image collection and test chart image set, then data extending is carried out to data set, generates new hand region label later；Step 2) constructs hand and detects convolutional neural networks structure, is completed feature extraction using the characteristic information on different scale using multiple dimensioned framework and merged；Step 3) uses the end-to-end training of ADAM optimization algorithm, concentrates stochastical sampling from training image, deconditioning after loss function L stablizes；Step 4) is used to eliminate the candidate frame of redundancy using non-maxima suppression, obtains optimal hand target and surrounds frame；Step 5) announces testing result；It is easy to implement the detection to human hand region, suitable for the manpower area marking under cab environment.

Description

The detection method of hand region under a kind of driving environment

Technical field

The invention belongs to the object detection fields of computer vision, more particularly, to hand region under a kind of driving environment Detection method.

Background technique

Manpower detection, classification and tracking have had years of researches history, can apply in many fields, such as virtually existing It is real, man-machine interaction environment, driving behavior monitoring etc..Since hand region is by more multifactor interference in natural image, such as Illumination variation, block, hand shape variation, visual angle change, hand resolution ratio are low etc., up to the present, the hand area in natural image Domain detection is far from reaching the accuracy of mankind's identification, and many applications must not depend on the artificial detection side of inefficiency Formula.Therefore, the accurate detection method for studying mankind's hand region under natural environment is of great significance.This paper target be from Hand region is detected in still image under motor vehicle driving room environmental, studies a kind of new method based on depth learning technology, Technological means can be provided for driving behavior detection etc..

It using skin color information is the available strategy that many methods obtain better effects in hand detection.Such as document [1][A.Mittal,A.Zisserman,and P.H.S.Torr.Hand detection using multiple Proposals.In British Machine Vision Conference, 2011] a kind of two-part method is proposed, in use Hereafter, the colour of skin, sliding window shape these three complementary detectors provide hand region candidate frame, are then provided often by classifier The fiducial probability of a candidate frame.The shortcomings that such methods is when detecting the hand region in natural image, since complexity illuminates The variation of skin color caused by situation greatly influences its detection performance.It can also be answered certain using the method for multi-modal information Preferable result is obtained with middle.Such as document [2] [E.Ohn-Bar, S.Martin, A.Tawari, and M.M.Trivedi.Head,eye,and hand patterns for driver activity recognition.In ICPR, pages 660-665,2014] the HOG feature of RGB image and depth image is extracted simultaneously, hand area is detected in conjunction with SVM Simultaneously complete driving behavior identification in domain.But, because of the limitation of selected HOG feature, inspection of this method to hand region It is not high to survey precision.Document [3] [X.Zhu, X.Jia, and K.Wong, " Pixel-level hand detection with shapeaware structured forests,”in Processing of Asian Conference on Computer Vision.Springer Press, 2014, pp.64-78.] it is detected using shape sensitive type structuring forest algorithm individual element Hand region, although having better effects to the hand detection under the first visual angle, individual element scans the mode of entire image It is excessively time-consuming.By human body segmentation indirectly obtain hand region [4] [L.Karlinsky, M.Dinerstein, D.Harari,and S.Ullman,“The chains model for detecting parts by their context,”in Proceedings of Computer Vision and Pattern Recognition.IEEE Press, 2010, pp.25-32.] it is another hand region detection scheme, it is determined by the way that human body is divided into different positions Hand region, not excessive when blocking, such method is difficult to detect hand.With the flourishing hair of depth learning technology Exhibition, the target detection based on convolutional neural networks, which achieves, greatly to improve.Convolutional Neural net such as based on candidate region nomination Network series (RCNN, Fast-RCNN, Faster-RCNN, R-FCN), YOLO list of target detect network etc., although they are detected The objects such as cat, dog, pedestrian, automobile, sofa yield good result, but when shared region is relatively small in the picture for target When (such as manpower) or when blocking, it is not high using the prototype structure accuracy in detection of these networks, need to design more effectively Structure.Document [5] [Lu Ding, Yong Wang, et al.Multi-scale predictions for robust hand Detection and classification, arXiv:1804.08220v1 [cs.CV], 2018] it is a kind of multiple dimensioned to propose R-FCN network structure includes 5 convolutional layers, provides hand region candidate frame from different scale, and therefrom extracts the spy of different layers Sign figure is merged, and then the hand region detected surrounds frame.Document [6] [T.Hoang Ngan Le Kha Gia Quach Chenchen Zhu,et al.Robust Hand Detection and Classification in Vehicles And in the Wild, CVPRW 2018, pp:39-46] it is also using R-FCN network structure as basic framework, use is multiple dimensioned Mode merges the feature of different layers, and hand region is screened in candidate frame.Document [7] [Xiaoming Deng, Ye Yuan, Yinda Zhang,et al.,Joint Hand Detection and Rotation Estimation by Using CNN, ArXiv:1612.02742v1 [cs.CV], 2016.] a kind of joint network that hand region detection is detected with hand rotation direction is designed, Last hand region detection is completed by the way that feature is shared.

Summary of the invention

Object of the present invention is to: a kind of detection method of hand region under driving environment is provided, as a kind of new hand inspection Network structure is surveyed, does not need to establish complexion model, does not need additional feature extractor, pass through the RGB number under cab environment Network model is trained according to collection, the detection to human hand region is realized, suitable for the manpower region under cab environment Mark.

The technical scheme is that under a kind of driving environment hand region detection method, specifically comprise the following steps:

Step 1) prepares data set, and the data set is in true driving environment by being mounted at driver's cabin different location It is obtained under situation in camera apparatus shooting driver's cabin, and data set is divided into training image collection and test chart image set, then logarithm Data extending is carried out according to collection, generates new hand region label later；

Step 2) building hand detection convolutional neural networks structure utilizes the spy on different scale using multiple dimensioned framework Reference breath is completed feature extraction and is merged；

Step 3) uses the end-to-end training of ADAM optimization algorithm, stochastical sampling is concentrated from training image, when loss function L is steady Deconditioning after fixed；

Loss function L formula is as follows:

L=L_c+L_r (1)

Wherein L_cFor the probability whether evaluation in-out-snap pixel correctly classifies, L_rSurround whether frame vertex position obtains for evaluation It is returned to correct；

L_c=-α p* (1-p)^γlogp-(1-α)(1-p*)p^γlog(1-p) (2)

Wherein p* indicates true pixel classifications as a result, p indicates that network-evaluated pixel is located at the probability surrounded in frame, α It is positive negative sample balance factor,γ rule of thumb value, setting γ=2 can obtain preferably in experiment Experimental result；

Wherein C_iWithRespectively indicate regression result and true value that hand surrounds frame coordinate；

Step 4) is used to eliminate the candidate frame of redundancy using non-maxima suppression, obtains optimal hand target and surrounds frame；

Step 5) announces testing result.

As a preferred technical solution, training image collection described in step 1) according to 9:1 ratio be randomly divided into training subset, Verify subset.

Include as a preferred technical solution, flip horizontal to the data extending method of data set in step 1), vertically turn over Turn, random angles rotation, translation, Gaussian Blur and sharpening, training data increases at least 22000 width images after expansion.

Data extending includes following rule in step 1) as a preferred technical solution:

Expand rule 1: brightness enhances 1.2-1.5 times of range, and 0.7-1.5 times of scaling, the direction x translates 40 pixels, the side y To translating 60 pixels；

Expand rule 2: random cropping back gauge 0-16 pixel, is overturn by 50% probability level；

Expand rule 3:100% flip vertical, it is 0 that mean value, which is added, and the Gaussian Blur that variance is 3 is handled；

Expand rule 4: Random-Rotation, rotate 45 ° of the angle upper limit, white Gaussian noise, noise level 20%, by 50% is added Probability sharpens at random.

Hand region label generating method new in step 1) is as follows as a preferred technical solution: with original encirclement frame Four edges frame on the basis of, into frame be retracted designated length d=0.2l_min, l_minFor most short frame length, frame inner part is labeled as 1, outer frame part is labeled as 0.

Feature extraction and fusion include three convolution modules and a up-sampling in step 2) as a preferred technical solution, Fusion Features processing, specifically comprises the following steps:

Input layer picture size 256 × 256, first convolution module ConvB_1 is containing two convolutional layers and a maximum pond Change layer, 3 × 3,64 channels of convolution kernel；Second convolution module ConvB_2 is containing two convolutional layers and a maximum pond layer, volume Product 3 × 3,128 channels of core；Third convolution module ConvB_3 is containing three convolutional layers and a maximum pond layer, convolution kernel 3 × 3,256 channels；The core size of above-mentioned pond layer is 2 × 2, step-length 2；

By the characteristic pattern up-sampling of third convolution module ConvB_3 output, one times of dimension enlargement, then by second convolution The characteristic pattern of module ConvB_2 output removes 20% port number using Dropout mechanism at random, and the two is cascaded；After fusion Characteristic pattern FusF_1 standardization processing after be sent into 1 × 1 and 3 × 3 concatenated convolutional group ConvC_1, totally 128 channels；It is exported Output layer is sent into after 3 × 3 convolutional layers for being 32 using a convolution kernel number；Output layer branch containing Liang Ge, branch 1 pass through list 1 × 1 convolution of channel predicts that each pixel is located at the probability of target area；Branch 2 predicts target by 4 channel, 1 × 1 convolution Surround the coordinate value on frame vertex.

Testing result includes following objective quantification evaluation index in step 5) as a preferred technical solution: average accurate Spend AP, average recall rate AR, comprehensive evaluation index F1-score and detection speed FPS；

Assuming that TP expression has estimated that real goal, FP indicate that the target estimated is not real goal, FN indicates true Target is not estimated, then

FPS is described using frame per second.

The invention has the advantages that

1, under driving environment of the present invention hand region detection method, not only accuracy rate is high, but also applicability is more preferable, calculates Complexity is low, and runing time is few, and training process is simple, high-efficient, and testing efficiency reaches 42fps.

2, the present invention establishes the model of hand detection using depth convolutional neural networks structure, and it is relevant can to extract manpower More comprehensively feature, to block, uneven illumination, dimensional variation, change in shape etc. have better robustness.

Detailed description of the invention

The invention will be further described with reference to the accompanying drawings and embodiments:

Fig. 1 is for different illumination, the testing result signal of different hand shapes, different size of hand, different number hand Figure.

Specific embodiment

Embodiment: since hand region has biggish change in size in different images, consider with different depth Characteristic pattern expresses the feature of different size manpowers respectively, wherein using the biggish hand region of feature focusing of deeper, and compared with The feature of shallow-layer focuses lesser hand region, and in order to reduce computing cost, the present invention is using U-shaped convolutional neural networks structure Thought gradually merges characteristic pattern, specifically comprises the following steps:

Step 1) prepares data set, and the data set is in true driving environment by being mounted at driver's cabin different location It is obtained under situation in camera apparatus shooting driver's cabin, the purpose is to research backgrounds mixed and disorderly, complicated lighting condition and regular screening The performance of manpower method for detecting area in the case of gear, and data set is divided into training image collection and test chart image set, then logarithm Data extending is carried out according to collection, generates new hand region label later；

Wherein data set includes 5500 training images altogether, and 5500 test images, picture size is in training and test Uniformly it is adjusted to 256 × 256；Training image collection is randomly divided into training subset, verifying subset according to 9:1 ratio, wherein training Collection includes 4950 images, and verifying subset includes 550 images, and test chart image set includes 5500 images.Camera perspective includes: Follow shot is fixed on left front shooting driver, is fixed on right front shooting driver, is fixed on rear, be fixed on the right side of driver, It is fixed on that top, to be worn over driver first-class.

Deep neural network needs the data training of magnanimity that can just obtain a preferable model.Therefore, in legacy data On the basis of, it needs to expand data set.Data extending method to data set includes flip horizontal, flip vertical, random Angle rotation, translation, Gaussian Blur and sharpening, training data increases at least 22000 width images after expansion.

Data extending includes following rule:

The hand region label that legacy data collection provides is to surround box form, that is, surrounds the apex coordinate of frame.This patent net The information that network output par, c uses is that pixel falls in the probabilistic information surrounded in frame, it is therefore desirable at original tag Reason, generates new label.New hand region label generating method is as follows: on the basis of original four edges frame for surrounding frame, to Designated length d=0.2l is retracted in frame_min, l_minFor most short frame length, frame inner part is labeled as 1, and outer frame part is labeled as 0.

Feature extraction is simultaneously merged comprising three convolution modules and a up-sampling Fusion Features processing, and following step is specifically included It is rapid:

Loss function L formula is as follows:

L=L_c+L_r (1)

L_c=-α p* (1-p)^γlogp-(1-α)(1-p*)p^γlog(1-p) (2)

During step 4) target detection, a large amount of overlapped candidate frame can be generated in same target position, each Candidate frame has different confidence levels.It is used to eliminate the candidate frame of redundancy using non-maxima suppression, obtains optimal hand target Surround frame；

Step 5) announces testing result；Testing result includes following objective quantification evaluation index: bat AP, being averaged Recall rate AR, comprehensive evaluation index F1-score and detection speed FPS；

FPS is described using frame per second.

The performance that present networks detect hand region in RGB still image under cab environment is detected by subjective vision and visitor The mode for seeing quantizating index is evaluated, and Fig. 1 show the hand testing result of several representative instances, it can be seen that the method pair Different illumination, different hand shape, different size of hand, different number hand all have preferable detection effect.

The results are shown in Table 1 for quantitative assessment on test set for this method, and method performance and contest on VIVA data set are best As a result it is compared.

To the quantitative assessing index of hand region detection in 1. test set of table

Method	AP (%)	AR (%)	F	FPS
					This patent	98.3	86.7	92.2	42
Background technology document [6]	94.8	74.7	-	4.65

The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe The personage for knowing this technology all without departing from the spirit and scope of the present invention, carries out modifications and changes to above-described embodiment.Cause This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as At all equivalent modifications or change, should be covered by the claims of the present invention.

Claims

1. the detection method of hand region under a kind of driving environment, which is characterized in that specifically comprise the following steps:

Step 1) prepares data set, camera of the data set in true driving environment by being mounted at driver's cabin different location It is obtained under situation in equipment shooting driver's cabin, and data set is divided into training image collection and test chart image set, then to data set Data extending is carried out, generates new hand region label later；

Step 2) constructs hand and detects convolutional neural networks structure, using multiple dimensioned framework, is believed using the feature on different scale Breath is completed feature extraction and is merged；

Step 3) uses the end-to-end training of ADAM optimization algorithm, concentrates stochastical sampling from training image, after loss function L stablizes Deconditioning；

Loss function L formula is as follows:

L=L_c+L_r (1)

Wherein L_cFor the probability whether evaluation in-out-snap pixel correctly classifies, L_rSurround whether frame vertex position obtains just for evaluation Really return；

L_c=-α p* (1-p)^γlogp-(1-α)(1-p*)p^γlog(1-p) (2)

Wherein p* indicates true pixel classifications as a result, p indicates that network-evaluated pixel is located at the probability surrounded in frame, and α is just Negative sample balance factor,γ rule of thumb value, setting γ=2 can obtain preferable reality in experiment Test result；

Step 5) announces testing result.

2. the detection method of hand region under driving environment according to claim 1, which is characterized in that described in step 1) Training image collection is randomly divided into training subset, verifying subset according to 9:1 ratio.

3. the detection method of hand region under driving environment according to claim 1, which is characterized in that logarithm in step 1) Data extending method according to collection includes flip horizontal, flip vertical, random angles rotation, translation, Gaussian Blur and sharpening, is expanded Training data increases at least 22000 width images afterwards.

4. the detection method of hand region under driving environment according to claim 1, which is characterized in that data in step 1) Expand includes following rule:

Expand rule 1: brightness enhances 1.2-1.5 times of range, and 0.7-1.5 times of scaling, the direction x translates 40 pixels, and the direction y is flat Move 60 pixels；

Expand rule 4: Random-Rotation, rotate 45 ° of the angle upper limit, white Gaussian noise, noise level 20%, by 50% probability is added It is random to sharpen.

5. the detection method of hand region under driving environment according to claim 1, which is characterized in that in step 1) newly Hand region label generating method is as follows: on the basis of original four edges frame for surrounding frame, designated length d=is retracted into frame 0.2l_min, l_minFor most short frame length, frame inner part is labeled as 1, and outer frame part is labeled as 0.

6. the detection method of hand region under driving environment according to claim 1, which is characterized in that feature in step 2) It extracts and merges comprising three convolution modules and a up-sampling Fusion Features processing, specifically comprise the following steps:

Input layer picture size 256 × 256, first convolution module ConvB_1 contain two convolutional layers and a maximum pond layer, 3 × 3,64 channels of convolution kernel；Second convolution module ConvB_2 is containing two convolutional layers and a maximum pond layer, convolution kernel 3 × 3,128 channels；Third convolution module ConvB_3 is containing the maximum pond layer of three convolutional layers and one, convolution kernel 3 × 3, 256 channels；The core size of above-mentioned pond layer is 2 × 2, step-length 2；

By the characteristic pattern up-sampling of third convolution module ConvB_3 output, one times of dimension enlargement, then by second convolution module The characteristic pattern of ConvB_2 output removes 20% port number using Dropout mechanism at random, and the two is cascaded；Fused spy 1 × 1 and 3 × 3 concatenated convolutional group ConvC_1 are sent into after sign figure FusF_1 standardization processing, totally 128 channels；Its output passes through again Output layer is sent into after crossing 3 × 3 convolutional layers that a convolution kernel number is 32；Output layer branch containing Liang Ge, branch 1 pass through single channel 1 × 1 convolution predicts that each pixel is located at the probability of target area；Branch 2 is surrounded by 4 channel, 1 × 1 convolution, prediction target The coordinate value on frame vertex.

7. the detection method of hand region under driving environment according to claim 1, which is characterized in that detection in step 5) As a result include following objective quantification evaluation index: bat AP, average recall rate AR, comprehensive evaluation index F1-score and Detect speed FPS；

Assuming that TP expression has estimated that real goal, FP indicate that the target estimated is not real goal, FN indicates real goal It is not estimated, then

FPS is described using frame per second.