CN107729838A

CN107729838A - A kind of head pose evaluation method based on deep learning

Info

Publication number: CN107729838A
Application number: CN201710947730.6A
Authority: CN
Inventors: 李珊如; 刘昕; 袁基睿; 山世光
Original assignee: Seetatech Beijing Technology Co ltd
Current assignee: Seetatech Beijing Technology Co ltd
Priority date: 2017-10-12
Filing date: 2017-10-12
Publication date: 2018-02-23

Abstract

The invention discloses a kind of head pose evaluation method based on deep learning, its step is：The image data collection for training is obtained, and the information labeling of face head deflection angle is carried out to image data；Data set progress sample is opened up and is filled with and pre-processes, cuts out face part；Pretreated face picture is all zoomed to the resolution ratio of 90 × 90 pixels；Using above-mentioned data set as training sample, network training is carried out using depth network TinyPoseNet；Extract the TinyPoseNet network models trained, being obtained according to above-mentioned steps needs the facial image cut of test pictures, then the pixel region of image center section 80 × 80 is cut out, the forward calculation of TinyPoseNet network models is carried out, so as to estimate the angle that personage's head pose deflects in the test pictures.The present invention has the advantages of extra small amount of calculation and strong robustness, and precision is high, computing is fast, simple to operate, versatile.

Description

A kind of head pose evaluation method based on deep learning

Technical field

The present invention relates to a kind of evaluation method, more particularly to a kind of head pose evaluation method based on deep learning, category In technical field of computer vision.

Background technology

Head pose is a kind of build-in attribute of people, the field such as Emotion identification, fatigue state monitoring, live body checking in people There is important application value.Generally, head pose estimation is based on pitch, tri- directions of yaw, roll, it is assumed that labour contractor Portion's gestural activity regards rigid motion as, and using nose as origin, horizontal direction is x-axis, and vertical direction is y-axis, and z-axis is perpendicular to x and y The plane that axle is formed, then around x, y, the angle that z-axis turns clockwise is defined as head pose in pitch, yaw, roll side To deviation angle.Due to by illumination, block, the factor such as resolution ratio is influenceed, the head pose estimation of various dimensions is always one The challenging work of item.

The difference of information is inputted according to Attitude estimation algorithm, current head pose estimation method is broadly divided into following three Class：

(1) the Attitude estimation method based on geometry, i.e., according to facial key point, as the corners of the mouth, nose, eye center it is relative Position, while same head pose (yaw, pitch, raw) is judged according to the shape prior of face and symmetry.But this method In the case where posture is larger, indivedual facial key points are invisible, and now the method based on geometry can not effectively estimate head Portion's posture.

(2) the Attitude estimation method based on 2D images, its classical mode are extraction face characteristics, then based on recurrence or classification Method estimation head pose.The classical sub- Gabor wavelet of local feature description, HOG features and LBP features be used to extract Face appearance features, posture is estimated by grader or recurrence device on the basis of face appearance features, therefore, this method is deposited Operand is larger the defects of.

(3) the Attitude estimation method based on RGB-D images, this method introduce Depth information, estimate so as to add posture The amount of input information of meter.The addition of Depth information, it can improve to illumination and the robustness blocked.But this kind of method needs Specific input equipment, therefore its versatility is not strong.

The content of the invention

In order to solve the weak point present in above-mentioned technology, the invention provides a kind of head appearance based on deep learning State evaluation method.

In order to solve the above technical problems, the technical solution adopted by the present invention is：A kind of head appearance based on deep learning State evaluation method, its overall step are as follows：

Step 1: obtaining the image data collection for training, and image data is labeled；Markup information includes face In the head deflection angle of tri- dimensions of pitch, yaw and roll；Rare sample unbalanced to data integrated distribution is carried out Sample, which is opened up, to be filled；

Step 2: opening up sample the data set after filling carries out picture pretreatment, the face part in picture is cut out to come, All and the incoherent details of face, including hair, neck are deleted with lower part, background；

Step 3: pretreated face picture is all zoomed to the resolution ratio of 90 × 90 pixels；

Step 4: the data set for the unified size that step 3 is obtained uses depth network as training sample TinyPoseNet carries out network training；Frequency of training reaches 12 epoch, can just obtain optimal network model.

Step 5: the TinyPoseNet network models that extraction was trained, the picture tested needs carries out step successively 2nd, the operation in step 3 obtains the facial image cut, then cuts out the pixel region of image center section 80 × 80, The forward calculation of TinyPoseNet network models is carried out, personage in the test pictures is estimated by extracting mode end to end The angle of head pose deflection.

Sample opens up the specific method filled and is in step 1：Using the data augmentation method based on 3D, 68 of face are used Characteristic point rotates out of the angle changes of needs after 3D modeling is carried out to face, then carries out picture and maps the missing sample needed This, the final angular deflection distributing equilibrium for causing three dimensions in data set.

The picture pretreating tool that step 2 uses is VIPLFaceDetector human-face detector.

TinyPoseNet is one 8 layers of the depth net designed on the basis of VIPLFaceNet convolutional neural networks Network, including 5 layers of convolutional layer and 3 layers of full articulamentum, there is the characteristics of extra small amount of calculation of lightweight and performance robust； TinyPoseNet carries out 80 × 80 pixel sizes for the training data of 90 × 90 pixels in the training process, to picture size Data are carried out network training again and improve robustness of the network model to face location slight shift with this by random cropping.

The present invention realizes the Attitude estimation depth network model of an extra small amount of calculation while performance robust TinyPoseNet is applied to real-time head pose estimation, solves the prediction robustness that conventional head posture evaluation method faces Not strong, precision is forbidden, pre-processes the problems such as cumbersome, versatility is not strong and speed is slow.

Brief description of the drawings

Fig. 1 is the end rotation posture schematic diagram of three-dimensional coordinate servant.

Fig. 2 is 3D data augmentation process schematics.

Fig. 3 is TinyPoseNet schematic network structures.

Embodiment

The present invention is further detailed explanation with reference to the accompanying drawings and detailed description.

A kind of head pose evaluation method based on deep learning, it is concretely comprised the following steps：

(1) data prepare：

From UmdFaces data sets as training sample；Data Angle change in UmdFaces is continuous, shares 367920 human face photos, 8501 different personages, data volume is big, and sample is sufficient.The end rotation appearance of three-dimensional coordinate servant State schematic diagram is as shown in Figure 1.

Due to the missing of big angle degrees of data, especially on yaw, pitch direction, ratio shared by wide-angle picture less than 5%, found in first training pattern is tested, testing result maximum can only achieve 53 ° on yaw directions, and pitch directions can only Reach 30 °.For this problem, the data augmentation work of wide-angle state is carried out, specific works are as follows：

A, as shown in Fig. 2 picking out UmdFaces data concentrates on the portrait photographs that pitch directions deviation angle is 0 °, 3D modeling is carried out to picture with 68 characteristic points of face, pitch direction ± 30 ° are then generated by way of projection, ± 35 °, ± 40 °, ± 45 ° of Augmented Data, average each direction increases by 3000 facial images.

B, selection Multipie data concentrate on yaw directions wide-angle picture (± 30 °, ± 45 °, ± 60 °, ± 75 °, ± 90 °), 3700 photos are selected at random in averagely each direction.

(2) image preprocessing：

Due to having the too many pictorial information unrelated with face, such as background, dress ornament, body work, figure in personage's picture Piece size etc. so that network fine can not must acquire useful head pose information, or even can directly result in training process and damage Function is lost not restrain all the time, it is therefore desirable to process of data preprocessing is carried out to the picture of training, will be with the incoherent details of face Delete, and the picture that face picture is normalized to fixed size is trained.

The present invention is detected from VIPLFaceDetector preprocessors to the face in picture, will detect what is obtained Human face region cuts and size is reset into 90 × 90 pixels, and the region is saved as into picture format is trained.Run into multiple During the situation of face, selection is with the most similar face of the face frame that UmdFaces data sets provide as pretreatment object.Joining On number is set, MinFaceSize (minimum face scale parameter) is arranged to 28 pixels, and ScoreThresh (scoring parameters) is set For (0.55,0.43,0.95), ImagePyramidScaleFactor (image scaled factor parameter) is arranged to 1.414.

(3) network adjustment of depth model

Network structure design is the core represented using depth online learning methods head pose estimation, it is contemplated that AlexNet (a kind of classical convolutional neural networks, comprising eight learning layers, five convolutional layers and three full articulamentums) calculating Advantage of the machine visual field when handling image data in computational efficiency and accuracy rate, is selected by AlexNet networks first A kind of VIPLFaceNet (depth convolutional neural networks, the DCNN comprising 7 convolutional layers and 2 full articulamentums) conduct of evolution Depth network.Compare from the effect in recognition of face, VIPLFaceNet is substantially superior to AlexNet；Come from the angle of amount of calculation Saying, VIPLFaceNet is made up of 7 convolutional layers and 3 full articulamentums, 90% of amount of calculation equivalent to AlexNet, and VIPLFaceNet reduces the Feature Map quantity of each convolutional layer to reduce amount of calculation, final VIPLFaceNet's 60% of amount of calculation equivalent to AlexNet.

And compared with VIPLFaceNet, TinyPoseNet (a kind of convolutional network of lightweight) has done similar work, It tails off the convolution number of plies, while reduce the FeatureMap quantity of every layer of convolutional layer, reduces amount of calculation with this, therefore, most Selected TinyPoseNet is as training depth network eventually.AlexNet, VIPLFaceNet, TinyPoseNet specific comparison As shown in table 1, in table 1, S represents that step-length stride, G represent that convolution packet Group, Pad represent Padding operations.For table That shows is succinct, and ReLU layers are omitted in table.

For TinyPoseNet compared with VIPLFaceNet, its principal character is as follows：

Ith, head pose estimation problem belongs to multi-tag regression problem, essentially each disaggregated classification of head pose In, changing features are small in class, simple compared to recognition of face task, therefore can be corresponding for head pose estimation problem Reduction Computation amount, therefore eliminate two convolutional layers, and 4 times by convolution kernel narrowed number, convolution kernel size have also been made weight New adjustment so that forward calculation is time-optimized 20 times.

IIth, carry out convolutional calculation with 1 × 1 convolution with reference to what is mentioned in Network In Network, employ two 1 × 1 convolution kernel, network calculations amount is reduced, network is deepened.

IIIth, the node of first of network full articulamentum is tapered to 256 from 4096, the characteristic layer that network is finally learnt The nodes of namely second full articulamentum narrow down to 128 from 4096, and model size is reduced to 1.8M from 194M.Experiment card It is bright, after cutting the degree of accuracy be not decreased obviously, improve the generalization ability under real scene on the contrary.

IVth, Dropout layers are eliminated, because output node number is inherently seldom, so Dropout behaviour need not be carried out again Make.

Vth, the full articulamentum of last layer has been changed into Sigmod layers.The set of offsets of head pose in most cases In within ± 45 degree, such operation can reach compress useless " long tail ", the effect of extended core resolution ratio.

The network structure contrast of table 1, AlexNet, VIPLFaceNet, TinyPoseNet

(4) design of loss function：

The design feature of convolutional neural networks, which determines, to provide a unified bottom for a variety of different human face analysis Rotating fields, difference are different loss function designs.

TinyPoseNet head pose estimations model using Attitude estimation as depth regression problem end to end, TinyPoseNet network structures are as shown in figure 3, the attitude angle of three dimensions is normalized to [- 90 °, 90 °] by we respectively. Activation primitive sigmod layers are provided with before loss function, as shown in Equation 1：

S (x)=1/ (1+e^-x) formula 1

Wherein, x represents the data input of three dimensions, and S (x) represents outputs of the x after the processing of sigmod functions.

On one-dimensional output layer, between angle x values are normalized into 0-1, during network iteration, low-angle passes through Numerical value conversion close to 0.5 value, x value closer to 90 ° then functional value closer to 1 value.Change of the Sigmod functions in core Rate is high, gentle in value regional change greatly, big with small angle-data amount in attitude data, and the rare situation of wide-angle data volume is just Matching.

Angle in TinyPoseNet using EuclideanLoss loss functions to tri- directions of pitch, yaw, roll Multitask recurrence is carried out, as shown in Equation 2：

Wherein, W represents the parameter of neutral net, and E (W) represents European loss, y_nRepresent that (actual value is ground-truth Value on data label),The node output of network is represented, N represents batch size (batch unit of a costing bio disturbance) Size.Finally, Attitude estimation is expressed as：

Wherein, R represents that Attitude estimation returns device, and f represents that network is output to the change of angle prediction.

Error reverse conduction is carried out by the way of stochastic gradient descent, as shown in Equation 4：

Wherein, w represents the parameter of neutral net, w_t+1Represent the parameter of the neutral net after renewal, w_tRepresent neutral net Parameter, η_tRepresent learning rate, x_iRepresent the characteristic value of i-th of sample, y_iRepresent the label value of i-th of sample, L (w_t, x_i, y_i) Represent the loss of network.

(5) training of depth model：

In Caffe (convolutional neural networks framework) implementation process, under default situations, Data layers and ImageData layers (data Layer of different-format in caffe) does not support more dimension labels, it is therefore desirable to converts instrument by multi-tag, or uses HDF5Layer, carry out data training.In the training process, basic learning rate is arranged to 0.02, and learning rate is entered by polynomial curve Row declines, and power values (the secondary values of powers of learning rate) are arranged to 0.5, Momentum values (momentum value) and are arranged to 0.9, weight Decay values (weights decrement parameter) are arranged to 0.0002.All experiments are carried out on the Titan-X video cards that video memory is 12GB, Use the Caffe changed a Open Source Platform.

(6) depth characteristic is extracted：

The depth model obtained using training, to carrying out depth characteristic by 80 × 80 pretreated face picture Extraction, depth characteristic dimension is 3, after data variation, obtains head pose respectively in pitch, tri- dimensions of yaw, roll The angular deflection of degree.

The key innovations of the present invention are：

1) extract picture feature end to end using depth model and carry out head pose estimation, using largely carrying label Human face data training deep neural network.Advantage：Depth model has highly desirable performance, and it is by learning abundant data Sample obtains the depth characteristic of head pose, compared to manual features can the more preferable deviation angle that must predict human body head posture, On UmdFaces data test collection, the mean absolute error in three directions only has 1.9 °.

2) adjustment network structure develops a kind of new lightweight network model suitable for head pose estimation, realizes The Attitude estimation depth network of one extra small amount of calculation while performance robust.Network structure must be adjusted by continuous, reduces training Model, one 8 layers of TinyPoseNet networks, including 5 layers are have devised on the basis of VIPLFaceNet convolutional neural networks Convolutional layer and 3 layers of full articulamentum.Advantage：The part that head pose estimation is researched and analysed as face character, often undertakes One role for pre-processing or playing booster action, therefore it, necessary not only for higher accuracy, is more needed in actual use Few amount of calculation of trying one's best and as far as possible short processing time.By the regulation of network structure, on the premise of precision is not lost, Compared with the model of the 200M sizes of widely used AlexNet network trainings, TinyPoseNet model sizes only have 2M；Speed In terms of degree, TinyPoseNet can reach 330FPS under 3.5Hz CPUx64 positions Release patterns.

3) missing data is supplemented using the mode of data augmentation, equalization data distribution.By to existing data Analysis is investigated in collection, finds only to account for the 1% of data set for the head pose photo of wide-angle, serious data point occurs Cloth unbalanced phenomena, almost do not have for the photo for being more than 65 ° on yaw directions, and the maximum angle on pitch directions is only ± 30 ° can be reached.In order to improve robustness of the Attitude estimation to the rare posture of training set, it is proposed that a kind of based on 3D models Data augmentation mechanism, a collection of positive character's photo is first sorted out, 3D modeling is carried out by 68 human face characteristic points, obtain 3D figures Deviation angle is set by hand after shape, carry out data augmentation with equalization data three dimensions data distribution.It is corresponding by generating The wide-angle training photo of quantity improves robustness of the head pose estimation model to the rare posture of wide-angle, and achieves very Good effect.Advantage：After it facts have proved data augmentation, yaw directions can recognize ± 89 ° compared to identifying model before [- 53 °, 53 °] are enclosed, improve percentage 67.9%；The result in pitch directions also significantly improves a lot, scope from [- 29 °, 29 °] [- 38 °, 38 °] have been brought up to, recognition effect improves 27.5%；The precision and effect in roll directions slightly improve.

The present invention total technique effect be：Model size finally narrows down to 2M, in 3.5Hz every 80*80 pixel of CPU Picture calculate the time only have 2.1 milliseconds, the mean error in three directions only has 1.9 ° on Umdface test sets.

Above-mentioned embodiment is not limitation of the present invention, and the present invention is also not limited to the example above, this technology neck The variations, modifications, additions or substitutions that the technical staff in domain is made in the range of technical scheme, also belong to this hair Bright protection domain.

Claims

A kind of 1. head pose evaluation method based on deep learning, it is characterised in that：The overall step of methods described is as follows：

Step 1: obtaining the image data collection for training, and image data is labeled；Markup information exists including face The head deflection angle of tri- dimensions of pitch, yaw and roll；Rare sample unbalanced to data integrated distribution carries out sample Originally open up and fill；

Step 2: opening up sample the data set after filling carries out picture pretreatment, the face part in picture is cut out to come, deleted All and the incoherent details of face, including hair, neck are with lower part, background；

Step 3: pretreated face picture is all zoomed to the resolution ratio of 90 × 90 pixels；

Step 4: the data set for the unified size that step 3 is obtained uses depth network TinyPoseNet as training sample Carry out network training；

Step 5: the TinyPoseNet network models that extraction was trained, the picture tested needs is carried out Step 2: walking successively Operation in rapid three obtains the facial image cut, then cuts out the pixel region of image center section 80 × 80, carries out The forward calculation of TinyPoseNet network models, personage head in the test pictures is estimated by extracting mode end to end The angle of posture deflection.
2. the head pose evaluation method according to claim 1 based on deep learning, it is characterised in that：The step 1 Middle sample opens up the specific method filled：Using the data augmentation method based on 3D, face is entered using 68 characteristic points of face The angle change of needs is rotated out of after row 3D modeling, picture is then carried out and maps the missing sample needed, finally make total According to the angular deflection distributing equilibrium for concentrating three dimensions.
3. the head pose evaluation method according to claim 1 based on deep learning, it is characterised in that：The step 2 The picture pretreating tool used is VIPLFaceDetector human-face detector.
4. the head pose evaluation method according to claim 1 based on deep learning, it is characterised in that：It is described TinyPoseNet is one 8 layers of the depth network designed on the basis of VIPLFaceNet convolutional neural networks, including 5 layers Convolutional layer and 3 layers of full articulamentum；TinyPoseNet in the training process, to picture size be 90 × 90 pixels training data Data are carried out network training again and improve network model to face location with this by the random cropping for carrying out 80 × 80 pixel sizes The robustness of slight shift.