CN106599810A

CN106599810A - Head pose estimation method based on stacked auto-encoding

Info

Publication number: CN106599810A
Application number: CN201611100343.0A
Authority: CN
Inventors: 潘力立
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-12-05
Filing date: 2016-12-05
Publication date: 2017-04-26
Anticipated expiration: 2036-12-05
Also published as: CN106599810B

Abstract

The invention discloses a head pose estimation method based on stacked auto-encoding, and belongs to the technical field of computer vision. The main idea is to establish a nonlinear mapping relation between a head depth image and pose by employing a stacked auto-encoder. The method includes: firstly acquiring a lot of head depth images as training samples, extracting histogram of oriented gradient characteristics, and recording the corresponding head pose; then designing the stacked auto-encoder, and learning parameters of each layer of the stacked auto-encoder based on the training samples and calibrated pose data by employing a gradient descent method; and finally, for the head images whose poses are to be estimated, extracting the histogram of oriented gradient characteristics, and estimating the head pose according to the learned stacked auto-encoder. Compared with the conventional head pose estimation method, according to the method, the complex mapping relation of input characteristics to the head pose can be simulated, and the problem of low estimation accuracy of a shallow model is effectively overcome.

Description

A kind of head pose estimation method based on stack own coding

Technical field

The invention belongs to technical field of computer vision, the head pose estimation problem being related in image.

Background technology

Head pose estimation (such as Fig. 1) refers to the digital picture according to head, using machine learning and computer vision Method quickly and accurately estimates the deflection angle of correspondence head in the image, also referred to as head pose.It is computer in recent years The popular problem of vision and machine learning area research, has non-at aspects such as man-machine interaction, safe driving and attention-degree analysis Often it is widely applied.For example：In field of human-computer interaction, the deflection angle of head can be used for controlling the side that computer or machine show To and position；In safe driving field, head pose can be used for auxiliary line of sight estimation, so as to point out driver correct sight line side To.In recent years, head pose estimation has further development on the basis of manifold learning and subspace theory development.It is existing There is head pose estimation method to be divided into three big classifications：1. the method based on appearance, is 2. based on the method and 3. of classification Method based on returning.

It is that the head image of input is existing with data base based on the ultimate principle of the head pose estimation method of appearance Image compared one by one, and using the angle corresponding to the most like image for finding as image to be estimated head pose (i.e. angle).The maximum defect of such method is that it can only export discrete head deflection angle, and due to needs and institute There is existing image to be compared successively, operand is huge.Referring to document：D.J.Beymer,Face Recognition under Varying Pose,IEEE Conference on Computer Vision and Pattern Recognition, Pp.756-761,1994 and J.Sherrah, S.Gong, and E.J.Ong, Face Distributions in Similarity Space under Varying Head pose Image and Vision Computing,vol.19, no.12,pp.807-819,2001。

Feature and correspondence head deflection angle instruction according to input picture is referred to based on the head pose estimation method of classification Practice grader, and the classification belonging to picture headers deflection angle to be estimated is distinguished using the grader for succeeding in school, so that it is determined that head The approximate range of portion's attitude.In such method commonly use grader include support vector machine (Support Vector Machine, SVM), linear judgment analysis (Linear Discriminative Analysis, LDA), the linear judgment analysis (Kernel of core Linear Discriminative Analysis, KLDA), the major defect of this kind of method is to be unable to estimate the continuous head of output Portion's attitude, referring to document：J.Huang,X.Shao,and H.Wechsler,Face Pose Discrimination using Support Vector Machines(SVM),International Conference on Pattern Recognition, pp.154-156,1998。

It is method of estimation the most frequently used at present based on the head pose estimation method for returning, the ultimate principle of the method is profit Mapping function is set up with existing characteristics of image and corresponding head angle, and estimates that pending image is corresponding using mapping function Head pose.Such method solves the problems, such as that aforementioned two methods are unable to estimate the continuous attitude of output, while reducing computing Complexity, referring to document G.Fanelli, J.Gall, and L.Van Gool, Real Time Head Pose Estimation with Random Regression Forests,IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp.617-624 and document H.Ji, R.Liu, F.Su, Z.Su, and Y.Tian, Convex Regularized Sparse Regression for Head Pose Estimation,IEEE International Conference on Image Processing,pp.3617-3620,2011。

The content of the invention

The task of the present invention there is provided a kind of head pose estimation method based on stack own coding.The method is with depth Image is used as input picture；And find the mapping relations between depth image and correspondence head pose using stack own coding.It is logical Above-mentioned modeling pattern is crossed, the complex mapping relation between depth image and head pose can be accurately found, head had both been improve The accuracy of portion's Attitude estimation, in turn ensure that the efficiency of estimation.

In order to easily describe present invention, some terms are defined first.

Define 1：Head pose.In three dimensions the angle of end rotation is generally by a vector representation, the vector by Three elements are constituted, and first element is the angle of pitch, and second element is yaw angle, and the 3rd element is the anglec of rotation.

Define 2：The angle of pitch.In the x-y-z coordinate system shown in Fig. 2 (b), the angle of pitch refers to what is rotated centered on x-axis Angle, θ.

Define 3：Yaw angle.In the x-y-z coordinate system shown in Fig. 2 (a), yaw angle refers to what is rotated centered on z-axis Angle φ.

Define 4：The anglec of rotation.In the x-y-z coordinate system shown in Fig. 2 (c), the anglec of rotation refers to the angle rotated centered on z ' Degree Ψ.

Define 5：Gradient orientation histogram feature.Piece image is described using the directional spreding of image pixel intensities gradient or edge In object presentation and the Visual Feature Retrieval Process method of shape.Its implementation is first divided the image into and little is called pane location Connected region；Then the gradient direction or edge orientation histogram of each pixel in pane location are gathered；It is finally that these are straight Square figure combines and can be formed by Feature Descriptor.In order to improve degree of accuracy, can with these local histograms in image Bigger interval (block) in carry out contrast normalization (contrast-normalized), the method is each by first calculating Density of the rectangular histogram in this interval (block), then does to each pane location in interval according to this density value and returns One changes.There can be higher robustness to illumination variation and shade by the normalization.

Define 6：Back-propagation algorithm.It is a kind of supervised learning algorithm, is often used to train multilayer neural network.General bag Containing two stages：(1) the propagated forward stage input will be trained to send into network obtaining exciter response；(2) back-propagation phase will Exciter response asks poor with the corresponding target output of training input, so as to obtain the response error of hidden layer and output layer.

Define 7：Gradient descent method.It is a kind of unconfined optimization method, when object function minima is solved, finds ladder Degree direction, and along the search of gradient opposite direction, the method until reaching local minimum.

According to a kind of head pose estimation method based on stack own coding of the present invention, comprise the following steps：

Step 1：Head depth image of the collection N width comprising different attitudes, and according to photographic head during collection each image Position, each self-corresponding head pitching of record N width images, driftage and the anglec of rotation obtain head pose vector The 1st dimension table show the angle of pitch, the 2nd dimension table shows inclination angle, and 3-dimensional represents the anglec of rotation, and subscript n represents the n-th width image；

Step 2：Detecting step 1 collects the head zone of image, and extracts the gradient orientation histogram of the head zone Feature, composition gradient direction histogram characteristic vector

Step 3：To obtaining gradient orientation histogram characteristic vector in step 2Numerical value normalization is being carried out per one-dimensional, will Numerical range is compressed to [0,1] interval, the scope of attitude is normalized to into [0,1] interval；

The concrete grammar of the step 3 is：

Numerical range is compressed to into [0,1] interval, specific practice is：For n-th sample, the data of its i-th dimensionReturn One changes formula

For the minima in all sample i-th dimensions,It is all Maximum in sample i-th dimension；

The scope of attitude is normalized to into [0,1] interval, specific practice is：

WhereinRepresent the component of the demarcation attitude jth dimension of n-th sample, y_njRepresent the numerical value after the dimension normalization；

Step 4：The corresponding mapping function of stack self-encoding encoder (such as Fig. 3) is built, if input isWherein s₁Represent The dimension of feature, the stack own coding that this patent is used is of five storeys altogether；1st layer is input layer, and the input of input layer is gradient direction Histogram feature vector, the number of the 1st node layer is the dimension of gradient orientation histogram characteristic vector, and layer 2-4 is hidden unit Layer, the 5th layer is output layer；Any one node unit symbol of any one layer of lRepresent, subscript (l) represents l layers, its Computing formula is：

Represent all s of Connection Neural Network l layers_lBetween individual unit and i-th unit of l+1 layers Parameter；Specifically,The parameter between i-th unit of j-th unit and l+1 layers of connection l layers is represented,For The bias term related to the hidden unit i of l+1 layers, s_l+1For the number of l+1 layer hidden units；σ () is sigmoid function, its expression Formula isIf definitionThen above formula can also It is expressed as：

The output layer of the stack self-encoding encoder has 3 units, uses symbolRepresent, to represent estimation head The angle of pitch of portion's attitude, inclination angle and the anglec of rotation；Whole stack own coding model function h_{W, b}X () is represented when input is x Estimate head pose, i.e.,：

Step 5：When input is x, it is assumed that corresponding demarcation attitude is y, and stack own coding is to Attitude estimation value and demarcation Error between attitude is：

Meanwhile, in order to represent that each unit of output layer defines error term to the size of error contribution

RepresentDerivative, using Back Propagation Algorithm, calculate l=2, each node j when 3,4 layers Corresponding error term；

Finally obtain following two estimation difference with regard toWithPartial derivative：

Step 6：Using the stack own coding model in step 4, by normalized gradient orientation histogram feature in step 3 [x₁..., x_N] used as the input of stack own coding, corresponding demarcation head pose value is [y₁..., y_N], set up stack self-editing The optimization object function of code：

WhereinWith Lambda binding itemIntensity；

Step 7：Object function J (w, b) is solved with regard to parameterWithPartial derivative

WhereinWithRepresent to work as and be input into as x_nWhen corresponding l layers j-th unit output and l+1 layers The corresponding error term of i unit；Object function J (w, b) is finally obtained with regard to parameter vector w, the gradient of bWith

Step 8：In order to try to achieve optimal stack own coding parameter w and b, it would be desirable to first initiation parameter, ladder is recycled Degree descent method is optimized, specifically comprising following two steps：

A () w and b is initialized；First random initializtion w and b, w are expressed as (w⁽¹⁾..., w⁽⁴⁾)^T, wherein w^(l)Represent l The parameter of layer；B is expressed as (b⁽¹⁾..., b⁽⁴⁾)^T, the parameter of the 1st, 2,3 layers of layer-by-layer correction afterwards；When 1 layer parameter is corrected, Using gradient descent method parameters optimization w⁽¹⁾And b⁽¹⁾, feature is originally inputted using the reconstruct of the 1st layer network, and make reconstructed error most It is little；When 2 layer parameter is corrected, using gradient descent method parameters optimization w⁽²⁾And b⁽²⁾, using the 1st layer of output as the 2nd layer Input, using layer 2 network reconstruct feature is originally inputted, and makes reconstructed error minimum；When 3 layer parameter is corrected, using ladder Degree descent method parameters optimization w⁽³⁾And b⁽³⁾, using the 2nd layer of output as the 3rd layer of input, reconstructed using layer 3 network original defeated Enter feature, and make reconstructed error minimum；For the 4th layer parameter, by the use of the 3rd layer of output as the 4th layer of input, parameters optimization w⁽⁴⁾And b⁽⁴⁾So that output and the error sum of squares demarcated between attitude are minimum；Thus the 1st to the 4th layer network is initialized；

(b) gradient descent method；According to initialization value, undated parameter vector w and b, i.e.,：

Wherein subscript [t] and [t+1] represent the t time and t+1 iteration；Stop iteration when w and b meet the condition of convergence；

Step 9：For new head image, determine head zone and extract gradient orientation histogram feature, numerical value normalizing During the stack self-encoding encoder for training is sent into after change, corresponding head pose estimation value is obtained, and numerical range is reverted to- 180 to+180.

Further, the concrete grammar of the step 3 is：

WhereinRepresent the component of the demarcation attitude jth dimension of n-th sample, y_niRepresent the numerical value after the dimension normalization；

Further, the stack self-encoding encoder mentioned in the step 4, each layer of number of unit is respectively s₁= 1440, s₂=80, s₃=80 and s₄=80, output layer only has 3 units, i.e.,：s₅=3.

Further, when solving stack own coding parameter using gradient descent method in the step 8, before and after the condition of convergence is Twice the parameter of iteration no longer changes, that is, reach local best points.

The present invention innovation be：

Propose to utilize stack self-encoding encoder, the nonlinear mapping relation set up between head depth image and attitude.This The bright N width head depth images that gather first are normalized to the image that size is 96*128 as training sample depth image, 1440 are extracted simultaneously and ties up gradient orientation histogram feature, then record corresponding head pose.Afterwards, stack own coding is designed Device, the self-encoding encoder removes input layer and output layer, totally 3 layers of intermediate layer.Then, on training sample and demarcation attitude data, profit Learn each layer parameter of stack self-encoding encoder with gradient descent method.Finally, for the head image of attitude to be estimated, gradient is extracted Direction histogram feature, according to the above-mentioned stack self-encoding encoder for succeeding in school head pose is estimated.With traditional head pose estimation Method is compared, the method can simulation input feature to the complex mapping relation of head pose, effectively overcome shallow Model Estimate the not high problem of accuracy.

Description of the drawings

Fig. 1 is head pose estimation schematic diagram；

Fig. 2 is the angle of pitch, yaw angle and anglec of rotation schematic diagram；

Fig. 3 is stack self-encoding encoder schematic diagram.

Specific embodiment

The method according to the invention, first with Matlab or C language the training pattern of stack self-encoding encoder is write；Connect The training sample that collects of input and train stack own coding parameter；Then the image zooming-out gradient direction Nogata to collecting Figure feature, is input in the stack self-encoding encoder for training as source data and is processed；Obtain the head pose estimated.This Bright method, can be used in natural scene in head pose estimation problem.

A kind of head pose estimation method based on stack own coding, comprises the following steps：

Represent all s of Connection Neural Network l layers_lBetween individual unit and i-th unit of l+1 layers Parameter；Specifically,The parameter between i-th unit of j-th unit and l+1 layers of connection l layers is represented,Be with The hidden unit i of l+1 layers related bias term, s_l+1For the number of l+1 layer hidden units；σ () is sigmoid function (sigmoid Function), its expression formula isIf definition Then above formula can also be expressed as：

Changing the output layer of stack self-encoding encoder has 3 units, uses symbolRepresent, to represent estimation head The angle of pitch of portion's attitude, inclination angle and the anglec of rotation；Whole stack own coding model function h_{W, b}X () is represented when input is x Estimate head pose, i.e.,：

The stack self-encoding encoder mentioned in the step 4, each layer of number of unit is respectively s₁=1440, s₂=80, s₃ =8 and s₄=80, output layer only has 3 units, i.e.,：s₅=3.

Step 6：Using the stack own coding model in step 4, by normalized gradient orientation histogram feature in step 3 x_nUsed as the input of stack own coding, corresponding demarcation head pose value is [y₁..., y_N], set up the optimization of stack own coding Object function：

WhereinWith Lambda binding itemIntensity；

When solving stack own coding parameter using gradient descent method in the step 8, condition of convergence iteration twice for before and after Parameter no longer change, that is, reach local best points.

Claims

1. a kind of head pose estimation method based on stack own coding, comprises the following steps：

Step 1：Head depth image of the collection N width comprising different attitudes, and according to the position of photographic head during collection each image, The each self-corresponding head pitching of record N width images, driftage and the anglec of rotation, obtain head pose vector The 1st Dimension table shows the angle of pitch, and the 2nd dimension table shows inclination angle, and 3-dimensional represents the anglec of rotation, and subscript n represents the n-th width image；

Step 2：Detecting step 1 collects the head zone of image, and extracts the gradient orientation histogram feature of the head zone, Composition gradient direction histogram characteristic vector

Step 3：To obtaining gradient orientation histogram characteristic vector in step 2Numerical value normalization is being carried out per one-dimensional, by numerical value Ratage Coutpressioit to [0,1] is interval, the scope of attitude is normalized to into [0,1] interval；

The concrete grammar of the step 3 is：

Numerical range is compressed to into [0,1] interval, specific practice is：For n-th sample, the data of its i-th dimensionNormalization Formula

x_{n i} = \frac{{\tilde{x}}_{n i} - m i n ({\tilde{x}}_{n i}, n = 1, ..., N)}{m a x ({\tilde{x}}_{n i}, n = 1, ..., N) - \min ({\tilde{x}}_{n i}, n = 1, ..., N)}

For the minima in all sample i-th dimensions,For all samples Maximum in i-th dimension；

y_{n j} = \frac{{\tilde{y}}_{n j} + 180}{360}

Step 4：The corresponding mapping function of stack self-encoding encoder is built, if input isWherein s₁The dimension of feature is represented, The stack own coding that this patent is used is of five storeys altogether；1st layer is input layer, and the input of input layer is gradient orientation histogram feature Vector, the number of the 1st node layer is the dimension of gradient orientation histogram characteristic vector, and layer 2-4 is hidden unit layer, and the 5th layer is Output layer；Any one node unit symbol of any one layer of lRepresent, subscript (l) represents l layers, its computing formula For：

a_{i}^{(l + 1)} = σ (w_{i 1}^{(l)} a_{1}^{(l)} + w_{i 2}^{(l)} a_{2}^{(l)} ... + w_{{is}_{l}}^{(l)} a_{s_{l}}^{(l)} + b_{i}^{(l)}), i = 1, ..., s_{l + 1}

Represent all s of Connection Neural Network l layers_lGinseng between individual unit and i-th unit of l+1 layers Number；Specifically,The parameter between i-th unit of j-th unit and l+1 layers of connection l layers is represented,It is and l + 1 layer of hidden unit i related bias term, s_l+1For the number of l+1 layer hidden units；σ () is sigmoid function, and its expression formula isIf definitionThen above formula can also be represented For：

a_{i}^{(l + 1)} = σ (z_{i}^{(l + 1)}), i = 1, ..., s_{l + 1}

Changing the output layer of stack self-encoding encoder has 3 units, uses symbolRepresent, to represent head appearance is estimated The angle of pitch of state, inclination angle and the anglec of rotation；Whole stack own coding model function h_{W, b}X () represents the estimation when input is x Head pose, i.e.,：

Step 5：When input is x, it is assumed that corresponding demarcation attitude is y, and stack own coding is to Attitude estimation value and demarcates attitude Between error be：

δ_{i}^{(5)} = \frac{\partial}{\partial z_{i}^{(5)}} \frac{1}{2} | | y - h_{w, b} (x) | |_{2} = - (y_{i} - a_{i}^{(5)}) σ^{'} (z_{i}^{(5)})

RepresentDerivative, using Back Propagation Algorithm, calculate l=2, each node j correspondences when 3,4 layers Error term；

δ_{j}^{(l)} = (Σ_{k = 1}^{s_{l + 1}} w_{j k}^{(l)} δ_{k}^{(l + 1)}) σ^{'} (z_{j}^{(l)})

\frac{\partial}{\partial w_{i j}^{(l)}} \frac{1}{2} | | y - h_{w, b} (x) | |_{2} = a_{i}^{(l)} δ_{j}^{(l + 1)}

\frac{\partial}{\partial b_{i}^{(l)}} \frac{1}{2} | | y - h_{w, b} (x) | |_{2} = δ_{i}^{(l + 1)}

Step 6：Using the stack own coding model in step 4, by normalized gradient orientation histogram feature x in step 3_nMake For the input of stack own coding, corresponding demarcation head pose value is [y₁..., y_N], set up the optimization aim of stack own coding Function：

J (w, b) = \frac{1}{N} Σ_{n = 1}^{N} \frac{1}{2} | | y - h_{w, b} (x_{n}) | |_{2}^{2} + \frac{λ}{2} | | w | |_{2}^{2}

WhereinWith Lambda binding itemIntensity；

\frac{\partial J (w, b)}{\partial w_{i j}^{(l)}} = \frac{1}{N} Σ_{n = 1}^{N} a_{n j}^{(l)} δ_{n i}^{(l + 1)} + {λw}_{i j}^{(l)}

\frac{\partial J (w, b)}{\partial b_{i}^{(l)}} = \frac{1}{N} Σ_{n = 1}^{N} δ_{n i}^{(l + 1)}

WhereinWithRepresent to work as and be input into as x_nWhen corresponding l layers j-th unit output and i-th of l+1 layers The corresponding error term of unit；Object function J (w, b) is finally obtained with regard to parameter vector w, the gradient of bWith

Step 8：In order to try to achieve optimal stack own coding parameter w and b, it would be desirable to first initiation parameter, under recycling gradient Drop method is optimized, specifically comprising following two steps：

A () w and b is initialized；First random initializtion w and b, w are expressed as (w⁽¹⁾..., w⁽⁴⁾)^T, wherein w^(l)Represent l layers Parameter；B is expressed as (b⁽¹⁾..., b⁽⁴⁾)^T, the parameter of the 1st, 2,3 layers of layer-by-layer correction afterwards；When 1 layer parameter is corrected, utilize Gradient descent method parameters optimization w⁽¹⁾And b⁽¹⁾, feature is originally inputted using the reconstruct of the 1st layer network, and make reconstructed error minimum；When When correcting 2 layer parameter, using gradient descent method parameters optimization w⁽²⁾And b⁽²⁾, using the 1st layer of output as the 2nd layer of input, Feature is originally inputted using layer 2 network reconstruct, and makes reconstructed error minimum；When 3 layer parameter is corrected, declined using gradient Method parameters optimization w⁽³⁾And b⁽³⁾, using the 2nd layer of output as the 3rd layer of input, using layer 3 network reconstruct spy is originally inputted Levy, and make reconstructed error minimum；For the 4th layer parameter, by the use of the 3rd layer of output as the 4th layer of input, parameters optimization w⁽⁴⁾ And b⁽⁴⁾So that output and the error sum of squares demarcated between attitude are minimum；Thus the 1st to the 4th layer network is initialized；

w^{[t + 1]} = w^{[t]} - α {&dtri;}_{w} J (w, b)

b^{[t + 1]} = b^{[t]} - α {&dtri;}_{b} J (w, b)

Step 9：For new head image, determine head zone and extract gradient orientation histogram feature, numerical value normalization it In sending into the stack self-encoding encoder for training afterwards, corresponding head pose estimation value is obtained, and numerical range is reverted to into -180 To+180.

2. a kind of head pose estimation method based on stack own coding as claimed in claim 1, it is characterised in that the step Rapid 3 concrete grammar is：

x_{n i} = \frac{{\tilde{x}}_{n i} - \min ({\tilde{x}}_{n i}, n = 1, ..., N)}{\max ({\tilde{x}}_{n i}, n = 1, ..., N) - \min ({\tilde{x}}_{n i}, n = 1, ..., N)}

y_{n j} = \frac{{\tilde{y}}_{n j} + 180}{360}

3. a kind of head pose estimation method based on stack own coding as claimed in claim 1, it is characterised in that the step The stack self-encoding encoder mentioned in rapid 4, each layer of number of unit is respectively s₁=1440, s₂=80, s₃=80 and s₄=80, Output layer only has 3 units, i.e.,：s₅=3.

4. a kind of head pose estimation method based on stack own coding as claimed in claim 1, it is characterised in that the step When solving stack own coding parameter using gradient descent method in rapid 8, the condition of convergence is that in front and back twice the parameter of iteration no longer changes, Reach local best points.