CN110516642A

CN110516642A - A kind of lightweight face 3D critical point detection method and system

Info

Publication number: CN110516642A
Application number: CN201910818443.4A
Authority: CN
Inventors: 王正宁; 赵德明; 何庆东; 曾浩; 曾仪; 刘怡君; 吕侠; 谢镇灿
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2019-11-29

Abstract

The invention discloses a kind of lightweight face 3D critical point detection method and system, comprising: N number of 3D reference coordinate vector of face key point in database is carried out dimensionality reduction projection in three two-dimensional surfaces；Based on k rank modified hourglass network struction combined coding sub-network, N number of 2D reference coordinate vector combined coding under each visual angle 2D is combined into thermodynamic chart for 2D using the combined coding sub-network；2D joint thermodynamic chart under three visual angles 2D is superposed to by 3D joint thermodynamic chart using concat method；The decoding sub-network is constructed based on the full convolutional network of 2D, 3D joint thermodynamic chart is decoded as N number of 3D using decoding sub-network and detects coordinate vector.The present invention devises corresponding lightweight neural network (combined coding sub-network reconciliation numeral network) to carry out the map generalization of joint heating power, the recurrence of 3D coordinate；The advantages of combining existing 2D and 3D face critical point detection method reduces model parameter amount, improves model running speed while keeping compared with high measurement accuracy.

Description

A kind of lightweight face 3D critical point detection method and system

Technical field

The present invention relates to image procossing and computer machine vision technique fields more particularly to a kind of lightweight face 3D to close Key point detecting method and system.

Background technique

With depth learning technology flourishing in computer vision field, various face image processing tasks are being lived In be widely applied, wherein face critical point detection recognition of face, Expression Recognition, in terms of all play Important role.

Face critical point detection achieves huge achievement in past ten years, especially in 2D face critical point detection Field.It is wherein classical by ASM (Active Shape Model) algorithm based on points distribution models of the propositions such as Cootes Face critical point detection algorithm, the algorithm pass through the method manually demarcated and first demarcate training set, obtain shape by training, The matching of certain objects is realized by the matching of key point again；CPR (the Cascaded returned based on cascade proposed by Dollar Pose Regression) algorithm gradually refined a specified initial prediction by a series of recurrences devices, and each is returned Device depends on the previous output for returning device to execute simple image operation, and whole system can be automatically from training sample middle school It practises；In addition, proposing a kind of multitask concatenated convolutional neural network MTCNN (Multi-task Cascaded by Zhang et al. Convolutional Networks) to handle Face datection and face key point orientation problem simultaneously.However, such as big Angular pose and face block under equal complex scenes, and the face critical point detection method based on 2D is difficult to realize, there is limitation. In order to solve this limitation, more and more researchers gradually focus on 3D face critical point detection, and 3D face key point is opposite More information is indicated in 2D and more block informations are provided.

The method that 3D face critical point detection method is roughly divided into the method based on model and is not based on model.One, it is based on The method of model: the 3 D deformation model (3DMM) that Blanz et al. is proposed is the common method for completing 3D face critical point detection； Two, be not based on the method for model: Tulyakov et al. proposes a kind of returned with cascade and calculates three-dimensional shape features to position 3D Cascade homing method is generalized in 3D face critical point detection by the method for face key point.In addition, in the method based on model In, further include the method for completing face critical point detection using deep learning model, is broadly divided into the two stages Return Law and volume Representation method, two stages return typical method, (x, y) coordinate are separated with z-axis, first return (x, y) coordinate, then return z；Volume Traditional 2D thermodynamic chart is expanded to 3D volume tabular form by representation method, is also widely used in human body critical point detection.

However due to the increase of 3d space dimension, the processing speed of respective algorithms, model accuracy all suffer from huge challenge, Existing 3D face critical point detection algorithm processing speed, model size and complexity, in terms of all exist not With the defect of degree.

Summary of the invention

An object of the present invention at least that, for how to overcome the above-mentioned problems of the prior art, provide one kind Lightweight face 3D critical point detection method and system.

To achieve the goals above, the technical solution adopted by the present invention includes following aspects.

A kind of lightweight face 3D critical point detection method, comprising:

Step 101, N number of 3D reference coordinate vector of face key point in database is subjected to dimensionality reduction in three two-dimensional surfaces Projection；Wherein, three two-dimensional surfaces are respectively xy, xz, yz plane, and x, y, z is positive or is negative simultaneously simultaneously；Each two It include N number of 2D reference coordinate vector corresponding with the N number of 3D reference coordinate vector in dimensional plane；

Step 102, k rank modified hourglass network struction combined coding sub-network, the training combined coding subnet are based on Network makes its performance tend towards stability；Using trained combined coding sub-network by N number of 2D reference coordinate under each visual angle 2D to Measuring combined coding is that 2D combines thermodynamic chart；Wherein, the k rank modified hourglass network residual unit uses Residual+ Inception structure；

Step 103, the 2D joint thermodynamic chart under three visual angles 2D is superposed to by 3D joint thermodynamic chart using concat method；

Step 104, the decoding sub-network is constructed based on the full convolutional network of 2D, the training decoding sub-network makes its performance It tends towards stability；3D joint thermodynamic chart is decoded as N number of 3D using the decoding sub-network and detects coordinate vector.

Preferably, the combined coding sub-network is 2 rank modified hourglass networks.

Preferably, the combined coding sub-network, decoding sub-network are carried out using more loss function Fusion training methods Training.

Preferably, more loss function Fusion training methods carry out three-wheel to network using three kinds of different loss functions Repetitive exercise, using the optimal weights that previous training in rotation is got as the initial weight of next round, until three-wheel training is completed to stop Training.

Preferably, three kinds of loss functions are as follows: mean square error loss function, is put down at least absolute value error loss function Least absolute value error loss function after cunning.

Preferably, the decoding sub-network includes: 4 2D convolutional layers, and arrange in pairs or groups batch among each convolutional layer Normalization and LeakyRelu activation primitive.

A kind of lightweight face 3D critical point detection system, including at least one processor, and with it is described at least one The memory of processor communication connection；The memory is stored with the instruction that can be executed by least one described processor, described Instruction is executed by least one described processor, so that at least one described processor is able to carry out the above method.

In conclusion by adopting the above-described technical solution, the present invention at least has the advantages that

1, by combining 2D face critical point detection and 3D face critical point detection the advantages of, a kind of joint heating power is proposed Figure and coordinate homing method, and devise corresponding lightweight neural network (combined coding sub-network reconciliation numeral network) come into The row map generalization of joint heating power, the recurrence of 3D coordinate；This method combines existing 2D and 3D face critical point detection method Advantage, used joint thermodynamic chart representation method reduce calculation amount and model complexity, are keeping compared with high measurement accuracy Meanwhile reducing model parameter amount, improving model running speed；During combined coding, to the residual error of original neural network Unit improves, and further promotes ability in feature extraction, the detection accuracy of network.

2, combined coding sub-network uses 2 rank modified hourglass configurations, reduces the depth of network, improves network convergence speed Degree, reduces the parameter amount of network.

3, a kind of more loss function Fusion training methods are proposed, three-wheel is carried out to network using three kinds of different loss functions Repetitive exercise makes the detection accuracy of network become more accurate.

Detailed description of the invention

Fig. 1 is lightweight face 3D critical point detection method flow diagram according to an exemplary embodiment of the present invention.

Fig. 2 is former hourglass network residual unit structural schematic diagram.

Fig. 3 is modified hourglass network residual unit structural schematic diagram according to an exemplary embodiment of the present invention.

Fig. 4 is that modified second order hourglass network (combined coding sub-network) structure according to an exemplary embodiment of the present invention is shown It is intended to.

Fig. 5 is the exemplary thermogram that combined coding sub-network according to an exemplary embodiment of the present invention generates.

Fig. 6 is the 3D key point schematic diagram that decoding sub-network according to an exemplary embodiment of the present invention generates.

Fig. 7 is that the projection of the 3D key point of decoding sub-network generation according to an exemplary embodiment of the present invention on the image is shown It is intended to.

Fig. 8 is the complete network that combined coding sub-network according to an exemplary embodiment of the present invention and decoding sub-network are constituted Structural schematic diagram.

Fig. 9 is lightweight face 3D critical point detection system structure diagram according to an exemplary embodiment of the present invention.

Specific embodiment

With reference to the accompanying drawings and embodiments, the present invention will be described in further detail, so that the purpose of the present invention, technology Scheme and advantage are more clearly understood.It should be appreciated that described herein, specific examples are only used to explain the present invention, and does not have to It is of the invention in limiting.

Fig. 1 shows lightweight face 3D critical point detection method according to an exemplary embodiment of the present invention.The embodiment Method specifically include that

Specifically, extracting N number of face key point inside Ground truth (commonly abbreviated as GT information) data set 3D reference coordinate vector, a total of 68 key points of Generic face, therefore preferred N=68 in the present embodiment.It is N number of by what is extracted 3D key point reference coordinate vector (x, y, z) carries out lowering dimension decomposition in three two-dimensional surfaces.It is to be decomposed into three in specific projection A 2D reference coordinate vector (x, y), (y, z) and (x, z).Enable V_x,y,z=(x, y, z) indicates key point 3D reference coordinate vector, then Separate the three 2D reference coordinate vectors generated are as follows:

Such as: its lowering dimension decomposition can be obtained (1, -2), (- 2,3), (1,3) for (1, -2,3) by a three dimensional space coordinate point, But in order to be capable of forming joint 2D thermodynamic chart below, we will (the positive negativity of x, y, z be identical, simultaneously in xy, yz, xz in dimensionality reduction Be positive or be negative simultaneously) three coordinate planes projected；Thereby guarantee that each three-dimensional coordinate is available after dimensionality reduction The identical two-dimentional reference coordinate of three positive negativity.Preferably, we are projected in space coordinates first quartile (x, y, z are In three faces just).

Specifically, proposing as shown in figure 3, the residual error subelement (original structure such as Fig. 2) to hourglass network internal improves Residual+Inception structure, is extended in network-wide, and convolution kernel is having a size of n × n, and Chi Huahe is having a size of n × n (n=2k+1, k are positive integer) makes multiclass receptive field carry out channel fusion later.Fused characteristic pattern has input picture Different feeling is wild, different semantic informations.For the input picture of different scale, modified hourglass network has stronger feature to mention Ability is taken, detection accuracy is promoted.After changing residual unit, hourglass network will broaden, and the characterization ability of network can use width To be promoted.If still using 4 traditional rank hourglass configurations, network will excessively fall into over-fitting due to parameter.Therefore in order to prevent Over-fitting, we only retain 2 rank hourglass configurations.As shown in figure 4, combined coding sub-network of the invention is husky using 2 rank modifieds Network of slipping through the net avoids because introducing Inception structure bring coding subnet the efficient feature extraction processing of input picture progress The width of network increase and network parameter excessively caused by over-fitting.Network can be substantially reduced using 2 rank modified hourglass networks Depth, enable the network to more rapid convergence, while reducing the parameter amount of network.Green rectangular module in figure be by improving after Residual+Inception subelement composition, the first row number of green rectangle inside represents input channel, the second line number Word represents output channel.For single order hourglass module, upper midway is carried out in archeus, and it is down-sampled again that lower midway experienced elder generation The process of sampling is risen, it is down-sampled using maximum pond, it rises sampling and uses arest neighbors interpolation, finally by upper and lower two midway output phasies Add to obtain final output.The order difference of hourglass module causes the complexity of network different with parameter amount.

Learning training is carried out to the k rank hourglass combined coding network using the facial image with coordinate value, due to joint The size of thermodynamic chart is w × h × 3, and for the facial image that size is 256 × 256, code distinguishability is standing to be set to 128 × 128 × 3, so that the coding sub-network E forms mapping E (the I) → H for being input to joint thermodynamic chart H from facial image I coordinate. Network inputs be 128 × 128 sizes facial image, export for w × h combine thermodynamic chart (output layer thermodynamic chart size can basis Actual needs is configured).Preferably, the joint thermodynamic chart of generation is dimensioned to 64 × 64, so that face key point Relative position become more compact by sparse, reduce the spatial redundancy of model, reduce the parameter amount of network.

It further, is that 2D combines heating power by N number of 2D reference coordinate vector combined coding using the combined coding sub-network The detailed process of figure are as follows:

It is directed to the corresponding 2D reference coordinate vector (x, y) of some key point, is encoded to first a series of continuous Numerical value；And screened by the way of being maximized, i.e., it is chosen wherein in a series of resulting serial numbers of coding Maximum value, the encoded radio as thermodynamic chart.It enablesIt indicates to be located at (i in m-th of thermodynamic chart_m,j_m) at value, m ∈ 1,2, 3}.For n-th of key point on facial image, position v_x,y,v_y,z,v_x,z, with 2D Gaussian form to (x, y) coordinate to Amount is encoded (other two coordinate vectors carry out identical operation), as shown in formula (1) (σ is variance):

For a facial image with N number of key point, to each key pointIt is encoded out at it a series of In continuous value, by being maximized, then the encoded radio of N number of key point is joined on a figure and forms 2D joint heating power Figure, as shown in formula (2):

The 2D under three visual angles, which can be respectively obtained, by above-mentioned combined coding process combines thermodynamic chart, each 2D joint The size of thermodynamic chart is w × h, wherein encoding all N number of key points.Fig. 5 shows the illustrative combined coding of the present invention The thermodynamic chart that sub-network generates.

Specifically, the 2D joint thermodynamic chart under three two-dimensional surfaces is overlapped using concat method, obtain 3D heating power Figure.Concat method is a kind of joint vector algorithm, for connecting two or more arrays.It can be incited somebody to action by the method for concat These three 2D joint thermodynamic chart is superimposed together, and obtains the 3D thermodynamic chart (wherein 3 representing 3 channels) that size is w × h × 3, such as Shown in formula (3):

H=concat (p₁,p₂,p₃) (3)

Specifically, the decoding sub-network can be formed after pre-training between joint thermodynamic chart H to corresponding 3D coordinate vector c Mapping D (H) → c.Since the size of joint thermodynamic chart H is w × h × 3, decoding sub-network uses a full convolution of 2D Network is constructed, and is decoded to thermodynamic chart, (see annex 1) as shown in Figure 4；The decoding sub-network includes 5 2D convolution altogether Layer, convolution kernel number is respectively 128,128,256,256,512, and convolution kernel size is 4 × 4, step-length 2, the last one volume The port number of lamination is N × 3, batch normalization and LeakyRelu activation primitive of arranging in pairs or groups among each convolutional layer, The last layer is global average pond layer, and the 3D obtained by concat method joint thermodynamic chart can be obtained by the decoding sub-network N number of 3D key point coordinate vector.As shown in fig. 6, thus we by decode sub-network complete 3D key point N number of to face Detect the extraction of coordinate vector.Further, it visualizes for convenience, as shown in fig. 7, by 3D key point coordinate projection to 2D image On.

Further, it is contemplated that in network training process, different loss functions possesses different convergence rates and leads To different extreme points, the present invention is using more loss function Fusion training methods to the combined coding sub-network, decoding subnet Network is trained.More loss function Fusion training methods carry out three-wheel iteration to network using three kinds of different loss functions Training, using the optimal weights that previous training in rotation is got as the initial weight of next round, until deconditioning is completed in three-wheel training. Since each loss function is different to different size of error suseptibility, mean square error loses (MSE) to big error sensitive, because Fast convergence rate when this encounters big error；Least absolute value error loses (L1) and the loss of smoothed out least absolute value error (SmoothL1) to small error sensitive, when encountering small error, convergence rate is faster.Therefore we use three kinds of loss functions, iteration Three-wheel training is carried out, the first round uses mean square error loss function, and the second wheel uses least absolute value error loss function, third Wheel uses smoothed out least absolute value error loss function, initial power of the optimal weights as next round after each round training Weight.Training method in this way makes the detection accuracy of network become more accurate.Corresponding combined coding sub-network loses letter Number is formula 4~6:

L_hm1=∑ | E (I)-H |² (4)

L_hm2=∑ | E (I)-H | (5)

Corresponding decoding sub-network loss function is formula 7~9:

L_coord1=∑ | D (H)-c |² (7)

L_coord2=∑ | D (H)-c | (8)

Wherein, D indicates decoding sub-network；C indicates that 3D detects coordinate vector；H indicates joint thermodynamic chart；E indicates that joint is compiled Numeral network, I indicate to have the facial image of coordinate vector.

In further embodiment of the present invention, first numeral network, decoding two sons of sub-network are compiled in collaboration in distich respectively for we Network carries out pre-training, then two network connections are finely adjusted to (in programming process, Fig. 8 is complete for this as a whole together The structural schematic diagram of whole network.Cancat algorithm is added among two network models and carries out 2D joint thermodynamic chart to 3D thermodynamic chart Superposition), be mainly carried out in two steps:

Step 1: the combined coding sub-network is trained using the facial image with coordinate vector in the pre-training stage, To form it into the Nonlinear Mapping that input layer is N number of facial image with coordinate vector, output layer is joint thermodynamic chart.Together When, using the 3D joint thermodynamic chart training decoding sub-network, to form it into, input layer is 3D joint thermodynamic chart, output layer is The Nonlinear Mapping of 3D detection coordinate vector.

Step 2: the combined coding subnet in the fine tuning stage, after the decoding sub-network after pre-training to be connected to pre-training Behind network, concat algorithm (can realize by programming) is added among two networks and forms a complete joint thermodynamic chart Face 3D critical point detection network model, is finely adjusted this complete network model, inputs to be N number of with coordinate vector Original facial image, output are followed successively by corresponding 2D joint thermodynamic chart, corresponding key point 3D coordinate vector.Final whole network It is trained in a manner of end to end, uses more loss function Fusion training methods are as follows: first round training uses mean square error Loss function, corresponding penalty values are L_hm1+L_coord1；Second wheel training uses least absolute value error loss function, corresponding Penalty values are L_hm2+L_coord2；Third round training uses smoothed out least absolute value error loss function, and corresponding penalty values are L_hm3+L_coord3.Initial weight of the optimal weights that previous round training obtains as next round, until the training of three three-wheels terminates to obtain Final training result deconditioning.

In further embodiment of the invention, we make the detection coordinate vector extracted and reference coordinate vector Comparison demonstration, and this algorithm is verified by specific experimental data.By this algorithm experimental results and calculation in the prior art Method does accuracy comparison, result such as table 1, table 2, shown:

The GTE performance comparison of table 1 3D-FAN on AFLW2000-3D data set, JVCR and this algorithm

Table 2 3DDFA, 3D-FAN, JVCR and this algorithm network parameter amount size (MB) and one picture of processing are time-consuming (ms)

Fig. 9 shows the face 3D critical point detection system according to an exemplary embodiment of the present invention based on joint thermodynamic chart System, i.e. electronic equipment 310 (such as having the computer server that program executes function) comprising at least one processor 311, Power supply 314, and memory 312 and input/output interface 313 with the communication connection of at least one described processor 311；It is described Memory 312 is stored with the instruction that can be executed by least one described processor 311, and described instruction is by least one described processing Device 311 executes, so that at least one described processor 311 is able to carry out method disclosed in aforementioned any embodiment；It is described defeated Entering output interface 313 may include display, keyboard, mouse and USB interface, be used for inputoutput data；Power supply 314 is used In providing electric energy for electronic equipment 310.

It will be appreciated by those skilled in the art that: realize that all or part of the steps of above method embodiment can pass through program Relevant hardware is instructed to complete, program above-mentioned can store in computer-readable storage medium, which is executing When, execute step including the steps of the foregoing method embodiments；And storage medium above-mentioned includes: movable storage device, read-only memory The various media that can store program code such as (Read Only Memory, ROM), magnetic or disk.

When the above-mentioned integrated unit of the present invention be realized in the form of SFU software functional unit and as the sale of independent product or In use, also can store in a computer readable storage medium.Based on this understanding, the skill of the embodiment of the present invention Substantially the part that contributes to existing technology can be embodied in the form of software products art scheme in other words, the calculating Machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be individual Computer, server or network equipment etc.) execute all or part of each embodiment the method for the present invention.And it is aforementioned Storage medium include: the various media that can store program code such as movable storage device, ROM, magnetic or disk.

The above, the only detailed description of the specific embodiment of the invention, rather than limitation of the present invention.The relevant technologies The technical staff in field is not in the case where departing from principle and range of the invention, various replacements, modification and the improvement made It should all be included in the protection scope of the present invention.

Claims

1. a kind of lightweight face 3D critical point detection method characterized by comprising

Step 101, N number of 3D reference coordinate vector of face key point in database is subjected to dimensionality reduction throwing in three two-dimensional surfaces Shadow；Wherein, three two-dimensional surfaces are respectively xy, xz, yz plane, and x, y, z is positive or is negative simultaneously simultaneously；Each two dimension It include N number of 2D reference coordinate vector corresponding with the N number of 3D reference coordinate vector in plane；

Step 102, it is based on k rank modified hourglass network struction combined coding sub-network, the training combined coding sub-network makes Its performance tends towards stability；N number of 2D reference coordinate vector under each visual angle 2D is joined using trained combined coding sub-network Conjunction is encoded to 2D joint thermodynamic chart；Wherein, the k rank modified hourglass network residual unit uses Residual+Inception Structure；

Step 104, the decoding sub-network is constructed based on the full convolutional network of 2D, the training decoding sub-network tends to its performance Stablize；3D joint thermodynamic chart is decoded as N number of 3D using the decoding sub-network and detects coordinate vector.

2. the method according to claim 1, wherein the combined coding sub-network is 2 rank modified hourglass nets Network.

3. the method according to claim 1, wherein using more loss function Fusion training methods to the joint Coding sub-network, decoding sub-network are trained.

4. according to the method described in claim 3, it is characterized in that, more loss function Fusion training methods using three kinds not Same loss function carries out three-wheel repetitive exercise to network, using the optimal weights that previous training in rotation is got as the initial of next round Weight, until deconditioning is completed in three-wheel training.

5. according to the method described in claim 4, it is characterized in that, three kinds of loss functions are as follows: mean square error loss function, Least absolute value error loss function, smoothed out least absolute value error loss function.

6. the method according to claim 1, wherein the decoding sub-network includes: 4 2D convolutional layers, each Collocation batch normalization and LeakyRelu activation primitive among convolutional layer.

7. a kind of lightweight face 3D critical point detection system, which is characterized in that including at least one processor, and with it is described The memory of at least one processor communication connection；The memory is stored with the finger that can be executed by least one described processor Enable, described instruction executed by least one described processor so that at least one described processor be able to carry out claim 1 to Method described in any one of 6.