CN109886341B

CN109886341B - Method for training and generating human face detection model

Info

Publication number: CN109886341B
Application number: CN201910139318.0A
Authority: CN
Inventors: 林煜; 许清泉; 韩演; 张伟; 余清洲
Original assignee: Xiamen Meitu Technology Co Ltd
Current assignee: Xiamen Meitu Technology Co Ltd
Priority date: 2019-02-25
Filing date: 2019-02-25
Publication date: 2021-03-02
Anticipated expiration: 2039-02-25
Also published as: CN109886341A

Abstract

The invention discloses a method for training and generating a face detection model, which comprises the following steps: firstly, a face image with annotation data is obtained as a training image, wherein the annotation data comprises annotated face characteristic points. And then inputting the training image into a pre-trained face detection model to obtain a predicted first characteristic point and a predicted second characteristic point. And finally, training a face detection model based on the labeled data, the first feature point prediction and the second feature point prediction. According to the scheme, the efficient convolutional neural network structure is designed, real-time human face 3D feature point detection can be achieved, and accuracy and efficiency of human face detection are improved.

Description

Method for training and generating human face detection model

Technical Field

The invention relates to the technical field of deep learning, in particular to a method for training and generating a face detection model, a face detection method, computing equipment and a storage medium.

Background

The face detection is to extract face key points, and the finer the face key point detection is, the more perfect the construction of the subsequent face is. At present, a face detection algorithm generally adopts 5 points, 17 points, 68 points and the like, although the number of face points is small, the operation speed is low, real-time detection cannot be performed, and the user experience is poor in an actual application scene. In addition, since the two-dimensional image does not have depth information, the detection of the face points in the invisible region of the face has obvious ambiguity, and certain influence may be generated on subsequent applications, for example, the application effects of virtual makeup, face repair and the like may be influenced.

Therefore, a face detection method is needed, which can perform real-time 3D feature point detection on a face image to reduce ambiguity of 2D feature point detection.

Disclosure of Invention

To this end, the present invention provides a method of training a generated face detection model, a face detection method, a computing device and a storage medium in an attempt to solve or at least alleviate at least one of the problems identified above.

According to one aspect of the invention, a method of training a generated face detection model is provided, the method being adapted to be executed in a computing device. In the method, firstly, a face image with labeled data is obtained as a training image, wherein the labeled data comprises labeled face characteristic points. And then inputting the training image into a pre-trained face detection model to obtain a predicted first characteristic point and a predicted second characteristic point. And finally, training a face detection model based on the labeling data, the predicted first feature points and the predicted second feature points.

The face detection model comprises a first convolution processing layer, a second convolution processing layer, a first regression processing layer and a second regression processing layer which are mutually coupled. The first regression processing layer is adapted to output the coordinates of the predicted first feature point, the second regression processing layer is adapted to output the coordinates of the predicted second feature point, and the output of the first regression processing layer is applied to the second regression processing layer.

Optionally, the output of the first convolution processing layer is applied to the first regression processing layer, and the output of the second convolution processing layer is used as the input of the first regression processing layer and the second regression processing layer.

Optionally, in the above method, the face image may be preprocessed, and the preprocessed image may be used as a training image. The preprocessing can include operations of cropping and scaling so as to obtain training images with consistent sizes.

Optionally, in the method, the first convolution processing layer is adapted to perform convolution and activation processing on the input image to output the feature map. The second convolution processing layer is suitable for performing pooling, convolution, activation and superposition processing on the feature map.

Optionally, in the above method, the first convolution processing layer includes a first convolution layer, a first active layer, a second convolution layer, and a second active layer, which are connected in sequence.

Optionally, in the above method, the second convolution processing layer includes a first pooling layer, a second pooling layer, a third pooling layer, and a first stacking layer, a second stacking layer, and a third stacking layer. The first stacking layer is suitable for stacking the output of the first pooling layer after one convolution and the output after multiple convolutions. The second stratification layer and the second superposition layer comprise a plurality of convolution layers and activation layers, and the second superposition layer is suitable for superposing the output of the second stratification layer after one convolution and the output after multiple convolutions. And a plurality of convolution layers and activation layers are arranged between the third pooling layer and the third superposed layer, and the third superposed layer is suitable for superposing the output of the third pooling layer after one convolution and the output after multiple convolutions.

Optionally, in the above method, the second regression treatment layer includes a first full connection layer, a third full connection layer, a fifth full connection layer, a sixth full connection layer, and a fourth superposition layer, and the first regression treatment layer includes a second full connection layer, a fourth full connection layer, and a tenth full connection layer. And the fourth superposition layer is suitable for superposing the output of the third full-connection layer and the activated output of the fourth full-connection layer.

Optionally, in the method, the first regression processing layer includes an image processing layer and a point alignment layer, the image processing layer is adapted to perform scale and angle transformation on the feature map output by the second activation layer, and the point alignment layer is adapted to perform cropping on the feature map.

Optionally, in the above method, the classification processing layer includes a seventh fully connected layer, an eighth fully connected layer, a ninth fully connected layer, and a normalization layer. The normalization layer is adapted to output probabilities that the predicted feature points belong to positive and negative samples, respectively.

Optionally, in the above method, the training of the face detection model may use a deep learning framework, including: and acquiring parameters of a pre-trained face detection model, loading training data, and inputting the training data into the face detection model for training. Then, calculating a value of a first loss function based on the predicted first feature point and the labeled first feature point; calculating a value of a second loss function based on the predicted second feature point and the labeled second feature point; calculating a value of a third loss function based on the positive sample loss function and the negative sample loss function; and adjusting parameters of the face detection model by using a gradient descent method based on the value of the first loss function, the value of the second loss function and the value of the third loss function.

Optionally, in the method, the first feature point is a 2D feature point, and the second feature point is a 3D feature point.

Optionally, in the above method, the training data includes positive samples and negative samples, where the positive samples are face images and the negative samples are non-face images.

Optionally, in the above method, the training data may be converted into data in a format corresponding to the deep learning framework.

Optionally, in the above method, the parameters may include a learning rate, a maximum number of iterations, a batch size, a disturbance angle and a scale in an image preprocessing process.

According to another aspect of the present invention, there is provided a face detection method adapted to be executed in a computing device, comprising: inputting a face image to be detected into a face detection model generated by training, and outputting predicted face characteristic points, wherein the face detection model is generated by training through the method for generating the face detection model by training.

According to yet another aspect of the present invention, there is provided a computing device comprising: one or more processors; a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods described above.

According to a further aspect of the invention there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described above.

According to the scheme of the invention, the 3D characteristic point detection branch is added in the face detection model, so that the difficulty of training the face detection model can be reduced based on the information in part of the 2D characteristic point detection branch. More feature points describing facial details can be output, accuracy and efficiency of human face feature point detection are improved, follow-up practical application is conveniently applied, and user experience is improved.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a computing device 100, according to an embodiment of the invention;

FIG. 2 shows a schematic flow diagram of a method 200 of training a generated face detection model according to one embodiment of the invention;

FIG. 3 illustrates a face image labeling feature points of a person, according to one embodiment of the invention;

FIG. 4 is a schematic diagram illustrating the structure of a face detection model according to an embodiment of the invention;

FIG. 5 shows a schematic structural diagram of a first volume processing layer according to one embodiment of the invention;

FIG. 6 shows a schematic structural diagram of a second convolution processing layer according to one embodiment of the present invention;

FIG. 7 shows a schematic structural diagram of a classification processing layer according to one embodiment of the invention;

FIG. 8 shows a schematic structural diagram of a second regression processing layer according to an embodiment of the invention;

FIG. 9 is a schematic structural diagram of a first regression processing layer according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The detection of the human face characteristic points is to locate the key characteristic points of the human face according to the input human face image. In practical application, factors such as different scales, postures, shelters, illumination of the human face affect the accuracy of human face detection. The detection of human face feature points can be divided into a generating formula and a discriminating formula. The generative model can be considered an optimization problem, and the discriminant model locates feature points by learning an independent local detector or regression model. The scheme provides a method for training and generating a face detection model, 2D and 3D face characteristic points can be output simultaneously, and accuracy and efficiency of face characteristic point detection are improved.

Fig. 1 is a block diagram of an example computing device 100. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processor, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some embodiments, application 122 may be arranged to operate with program data 124 on an operating system. In some embodiments, where computing device 100 is configured to perform method 200 of training a generated face detection model and a face detection method, program data 124 includes instructions for performing method 200.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, image input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in a manner that encodes information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media. In some embodiments, one or more programs are stored in a computer readable medium, the one or more programs including instructions for performing certain methods.

Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a digital camera, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Of course, the computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations, or as a server having the above-described configuration. The embodiments of the present invention are not limited thereto.

It should be noted that the face detection model is a multilayer network structure based on a convolutional neural network. The convolutional neural network may include an input layer, an intermediate layer, and an output layer, the intermediate layer including a convolutional layer, an activation layer, a pooling layer, a fully-connected layer, and the like. The input layer may pre-process the training images. The convolution layer is used for carrying out feature extraction on an input image and comprises a plurality of convolution kernels, and each element forming each convolution kernel corresponds to a weight coefficient and a deviation value. The convolution operation may be understood as having a sliding window where the convolution kernel is multiplied by the corresponding image pixel and summed. The activation layer is used for adding a nonlinear factor, and the sparsity of the network can be increased. The pooling layer is sandwiched between the continuous convolutional layers and is used for compressing data and parameter quantity, so that on one hand, the feature graph can be reduced, the calculation complexity is reduced, on the other hand, the feature compression is carried out, the main features are extracted, and meanwhile, the overfitting is reduced. The full link layer connects all the features, and the output value can be sent to a classifier for identifying and classifying the result. The basic structure of the convolutional neural network is known to those skilled in the art and will not be described in detail herein.

FIG. 2 illustrates a flow diagram of a method 200 of training a generated face detection model according to one embodiment of the invention. As shown in fig. 2, the method 200 starts in step S210, and a face image with annotation data is obtained as a training image, wherein the annotation data includes annotated features of the face. For example, the feature points of the face contour, eyes, nose, mouth, etc. which can represent the features of the face can be marked.

Fig. 3 illustrates a face image labeling a face feature point according to an embodiment of the present invention. As shown in fig. 3, the right face of the model is an invisible area, and the situation of the real contour position can be better reflected by considering the human face feature points marked by the image depth information.

According to an embodiment of the present invention, the pre-designed face detection model according to the present invention may include a first convolution processing layer, a second convolution processing layer, a first regression processing layer, and a second regression processing layer, which are coupled to each other. The feature graph output by the first convolution processing layer is applied to the first regression processing layer, and the feature vector output by the second convolution processing layer is flattened and then is used as the input of the classification processing layer and the regression processing layer. The output of the first regression processing layer is applied to the second regression processing layer. The first regression processing layer is adapted to output predicted 2D feature points and the third classification processing layer is adapted to output predicted 3D feature points.

As shown in fig. 2, in step S220, a training image is input into the pre-trained face detection model to obtain a predicted first feature point and a predicted second feature point. Fig. 4 shows a schematic structural diagram of a face detection model according to an embodiment of the present invention. As shown in fig. 4, first, there is an input layer adapted to input a training image into the face detection model. The training images are used for forward calculation, and since the sizes of the obtained training images may be different, the face images with labeled feature points may be preprocessed in advance, for example, the images are cropped, scaled, and the like, so as to obtain a training image set with a consistent size. The image is not limited to a certain fixed image size at different time, and the image can be cut according to the requirements of the application scene.

In order to improve the working efficiency, the structure and the adjustment parameters of the face detection model can be designed by using a deep learning framework. Any deep learning framework such as TensorFlow, Caffe, Keras, CNTK, Torch, pyrrch and the like can be adopted to set the structure and parameters of the model, and the scheme is not limited to the method. According to one implementation of the invention, the face detection model of an embodiment of the invention may be defined using PyTorch. The deep learning framework PyTorch is a tensor calculation and dynamic neural network framework which is fused with Python and has strong GPU support, and can complete data loading, model definition, training process definition, test process definition, parameter definition and the like.

According to one embodiment of the present invention, the size of the input image may be predefined. dim:1 indicates the number of data augmentations to be performed on the image data, 1 indicates that no data augmentations are performed dim:1 indicates a single-channel grayscale image, and dim:120, dim:120 indicate the length and width of the image.

Followed by a first buildup process layer, fig. 5 shows a schematic structural diagram of the first buildup process layer, according to an embodiment of the invention. As shown in fig. 5, the first convolution processing layer includes a first convolution layer, a first active layer, a second convolution layer, and a second active layer, which are connected in sequence. The first convolution processing layer is suitable for performing convolution and activation processing on an input image to output a feature map.

Parameters of each convolutional layer may be set, including convolutional kernel size kernel _ size, step size stride, number of convolutional kernels num _ output, whether padding and padding are required, whether it is a convolutional layer and a coefficient of expansion difference, whether it is a convolutional group and a number of groups, whether it is a bias term bias _ term (default is true, on), and so on. Wherein if the packet is larger than 1, the ith output packet can only be connected with the ith input packet. For example, the input: n × c0 × w0 × h output: n × c1 × w1 × h 1. Where c1 is the number of convolution kernels in the parameter, the number of generated feature maps is:

w1＝(w0+2*pad-kernel_size)/stride+1

h1＝(h0+2*pad-kernel_size)/stride+1

if stride is set to 1, there is an overlap between the two convolved parts. If the pad is set to (kernel _ size-1)/2, the width and height are unchanged after the operation.

The active layer may be a ReLU active function, the standard ReLU function being max (x,0), when x >0, x is output; and when x is 0, outputting 0. ReLU is set to zero for all negative values, whereas Leaky ReLU is given a non-zero slope for all negative values. The parameter of the activation function may be set with a default value of 0 for the negative slope of the parameter negative _ slope, and if this value is set, the negative _ slope is multiplied by the raw data.

According to one implementation of the present invention, the convolution kernel size of the first convolution layer is 5 x 5, the step size is 3, the padding is 1, the expansion coefficient is 1, the number of packets is 1 (i.e., no packets), and the number of convolution kernels is 16. The convolution kernel size of the second convolution layer is 1 x 1, the step size is 1, the padding is 0 (i.e., no padding), the expansion coefficient is 1, the number of packets is 1, and the number of convolution kernels is 16. When filling, the feature map is symmetrical left and right and up and down, for example, the size of a convolution kernel is 5 x 5, then the filling is set to be 1, and then four edges are extended by 1 pixel, namely, the width and the height are extended by 2 pixels, so that the feature map after convolution operation cannot be reduced.

Fig. 6 shows a schematic structural diagram of a second convolution processing layer according to an embodiment of the present invention. As shown in fig. 6, the second convolution processing layer coupled to the first convolution processing layer includes a first pooling layer, a second pooling layer, a third pooling layer, and a first stacking layer, a second stacking layer, and a third stacking layer. The first stacking layer is suitable for stacking the output of the first pooling layer after one convolution and the output after multiple convolutions. The second stratification layer and the second superposition layer comprise a plurality of convolution layers and activation layers, and the second superposition layer is suitable for superposing the output of the second stratification layer after one convolution and the output after multiple convolutions. And a plurality of convolution layers and activation layers are arranged between the third pooling layer and the third superposed layer, and the third superposed layer is suitable for superposing the output of the third pooling layer after one convolution and the output after multiple convolutions. The second convolution processing layer is suitable for performing pooling, convolution, activation and superposition processing on the feature map to output feature vectors. The pooling layer may reduce the size of the parameter matrix and may be added periodically at the second convolution processing layer in order to reduce the number of training parameters. The output of the second activation layer may be used as an input to the first pooling layer. The third convolution layer and the fourth convolution layer are connected in parallel, and the fourth convolution layer, the third active layer, the fifth convolution layer, the fourth active layer and the sixth convolution layer are connected in sequence.

According to an embodiment of the present invention, the pooling layer may be an average pooling layer, that is, feature points in the neighborhood are averaged to preserve more background information of the image. The pooling kernel size may be set to 2 x 2, the pooling method is AVE, and the step size is 2, i.e., no overlap, no padding. The parameters of each convolutional layer can also be set, wherein the fifth convolutional layer can be a block convolution, and the block convolution divides the standard convolution into several groups, then performs convolution respectively, and finally combines the convolution results. The packet convolution may reduce the number of training parameters. It should be noted that this grouping is done only in depth, i.e. some channels are grouped together. It is assumed that there are 4 groups. Then the input channel for each group is 64 and the output channel is 32. Each group requires 64 × 32 convolution kernels, with a total number of convolution kernels of 4 × 64 × 32 to 8192. The parameter amount is 4 times less than that of the standard convolution. The number of convolution packets of the fifth convolution layer may be set to 8, the size of the convolution kernel may be 3 × 3, and the number of convolution kernels may be set to 32. The number of convolution packets of the tenth convolution layer is 16, the size of the convolution kernel is 3 × 3, and the number of convolution kernels is 64. The number of convolutional packets of the fifteenth convolutional layer is 32, the size of the convolutional kernel is 3 × 3, and the number of convolutional kernels is 96. The number of convolution packets of the eighteenth convolution layer is 8, the size of the convolution kernel is 3 × 3, and the number of convolution kernels is 128. Because of the change in the output data, the number of convolution kernels correspondingly needs to be changed, and the size of the convolution kernels does not need to be changed.

The first superposition layer performs superposition processing on the outputs of the third convolution layer and the sixth convolution layer. The second superposition layer performs superposition processing on the outputs of the eighth convolution layer and the eleventh convolution layer. The third superposition layer performs superposition processing on the outputs of the thirteenth convolution layer and the sixteenth convolution layer. The add layer can be used in the Pythrch for value superposition, and the number of channels is unchanged. The superimposition processing can increase the amount of information describing the image features, but the dimensions of the description image themselves are not increased, but the amount of information per dimension is increased, which is obviously beneficial to the final classification processing. After the processing of the first superposition layer, the operations of convolution, activation, pooling and superposition are continuously executed.

As shown in fig. 4, the output of the second convolution processing layer is subjected to flattening processing by the first flattening processing layer and then used as the input of the classification processing layer and the regression processing layer, which are connected in parallel. Among other things, the flattening layer is used to "flatten" the input data, i.e., to maintain the multidimensional input in a single dimension, often in the transition from a convolutional layer to a fully-connected layer. For example, a flat layer can be defined in a pytorech, for example, a is a matrix or array, flat is to reduce a to one dimension, and default is to reduce the dimension in the lateral direction.

Fig. 7 shows a schematic structural diagram of a classification processing layer according to an embodiment of the present invention. As shown in fig. 7, the classification processing layer includes a seventh fully connected layer, an eighth fully connected layer, a ninth fully connected layer, and a normalization layer. The normalization layer is adapted to output probabilities that the predicted feature points belong to positive and negative samples, respectively. The seventh fully-connected layer and the eighth fully-connected layer are both followed by a ReLU active layer, which is intended to add non-linearity. The ReLU layer supports infilace calculation, namely, the tensor transmitted by an upper layer network is directly modified, which means that the output and the input are the same to avoid the consumption of the memory. The full-connection layer is formed by connecting each unit of the front layer with the rear layer, so that the influence of characteristic positions on classification can be greatly reduced, and the final classifier or regression can be conveniently handed over. The fully-connected layer pair model-affecting parameters include the total number of fully-connected layers (length), the number of neurons in a single fully-connected layer (width), and the activation function. Finally softmax is used for classification and normalization. Generally, the cross entropy loss function is used in the second classification, but under the condition of sample imbalance, the loss function is biased to one with more samples during training, so that the loss function during training is small, but the class identification precision of the small samples is not high. Fewer classes may be weighted to form a weighted classification loss function.

In the embodiment of the invention, attention is also needed to equalize the proportion between positive and negative samples in the training process. For example, if the positive sample has 30 and the negative sample has 70, the weight w + of the positive sample is 70/(30+70) 0.7, and the weight w of the negative sample is 30/(30+70) 0.3. The positive samples should ensure that the face to be detected should be in the middle of the samples, and the selection of the negative samples requires random generation of some samples on one hand and manual processing of the previous false detection result as a newly added negative sample on the other hand.

FIG. 9 is a schematic structural diagram of a first regression processing layer according to an embodiment of the invention. As shown in fig. 9, the first regression processing layer is a 2D face detection branch, and includes an image processing layer in addition to the first full-link layer, the second full-link layer, the fourth full-link layer, and the tenth full-link layer, where the image processing layer is adapted to perform scale and angle transformation and clipping on the feature map output by the second active layer. Because the features of the image have scale invariance and rotation invariance, the training image can be constructed into a series of image sets with different scales, and the features are detected in different scales. For example, feature detection may be performed first on a coarse scale, and then fine positioning may be performed on a fine scale. The power can be used to perform a power operation on each input data, and the output of each input x is calculated by (shift + scale x) power. The selectable parameters power is defaulted to 1, scale is defaulted to 1, and shift rotation angle is defaulted to 0. According to one implementation of the invention, the parameters power:1.0, scale:0.00833333376795, shift:0.0 may be set. The feature map may then be cropped to a width of 3 and a height of 3 as inputs to the nineteenth convolution layer according to one embodiment of the present invention using the point alignment layer to set the crop size of the image. The nineteenth convolutional layer may be set to be a block convolution, the number of blocks is set to 118, the convolution kernel size is 3 x 3, the step size is 1, the expansion coefficient is 1, there is no padding, and the number of convolution kernels is 236. And after activation, flattening, full connection and activation processing, the output result is superposed to a second regression processing layer. And the output result can be used as the input of a tenth full link layer, and finally, the predicted human face 2D feature point is output.

FIG. 8 shows a schematic structural diagram of a second regression processing layer according to an embodiment of the invention. As shown in fig. 8, the second regression processing layer is a 3D face detection branch, and includes a first full connection layer, a third full connection layer, a fifth full connection layer, a sixth full connection layer, and a fourth superposition layer, where the fourth superposition layer is adapted to superimpose an output of the third full connection layer and an output of the fourth full connection layer in the first regression processing layer after activation processing. And outputting the predicted 3D characteristic points of the human face through a fifth full-connection layer and a sixth full-connection layer after the superposition processing. The full join layer treats the input as a vector and the output is also a simple vector (changing the width and height of the input data blob to all 1's). For example, the input: n × c0 × h × w output: n × c1 × 1. The fully-connected layer is also actually a convolutional layer, except that the size of its convolutional kernel is the same as the size of the original data. So its parameters are substantially the same as those of the convolutional layer.

After the structural kernel parameter setting of the face detection model is completed, the model can be trained. Finally, in step S230, a face detection model may be trained based on the annotation data, the predicted first feature point, and the predicted second feature point.

According to one embodiment of the present invention, a deep learning framework may be used to train a face detection model, and first, parameters of a pre-trained face detection model are obtained. The parameters to be trained may include a learning rate, a weight, a bias, a maximum number of iterations, a batch size, an angle in an image processing process, a scale, a clipping size, an address of storing a predicted feature point, an address of storing an optimal feature point, an address of storing a model, and the like. Meanwhile, the learning strategy of the model, the calculation method of the loss function and the like can be obtained. The following are partial codes for reading model parameters:

where maxepoch is the maximum number of iterations and batch _ size is the batch size. The above codes are merely exemplary, and the setting of each parameter is not limited herein.

The image data may then be loaded and converted to the data type required by the deep learning framework. The data can then be brought into a model for training to obtain predicted 2D feature points, 3D feature points, and probabilities of the predicted feature points. The following is a partial code of the read data, including the operations of image pre-processing:

firstly, loading training data and testing data:

the data is then converted to the format required by the framework:

output,output3d,output_fd_pos,dxdy＝model(input_var)

the train function is the entry to the model training. The following is part of the code for training the model using the train function:

the value of the first loss function may be calculated based on the predicted 2D feature points and the annotation data; calculating a value of a second loss function based on the predicted 3D feature points and the annotation data; and calculating a value of a third loss function based on the positive sample loss function and the negative sample loss function. And returning the calculated loss value, and training the parameters of the model based on a gradient descent method.

The following is a partial exemplary code for calculating each loss function:

wherein output is the result of model prediction, and the size is batch size class; target is a real label with length of batch size. And each positive sample is collected and simultaneously the negative sample is collected, so that the loss of the positive sample is minimized, and the loss of the negative sample is maximized. The scheme can adopt measurement learning loss, and the purpose of measurement learning is to reduce or limit the distance between samples of the same type and increase the distance between samples of different types through training and learning.

Return of loss values can be done using optimizer. zero _ grad (), loss. backward (), optimizer. step (). the train loss continuously decreases, and the test loss continuously decreases, which indicates that the network is still learning; the train loss is continuously reduced, and the test loss tends to be unchanged, which indicates that the network is over-fitted; the train loss tends to be unchanged, and the test loss is continuously reduced, which indicates that the data set has problems; the train loss tends to be unchanged, and the test loss tends to be unchanged, which indicates that the learning meets the bottleneck and the learning rate or the batch number needs to be reduced; the train loss is continuously increased, and the test loss is continuously increased, which indicates the problems of improper design of a network structure, improper setting of training super parameters and the like.

Finally, the parameters of the face detection model are adjusted by using a gradient descent method based on the value loss of the first loss function, the value loss3d of the second loss function and the value loss _ fd of the third loss function.

After the training of the model is completed, the model can be used for face detection, and the face image to be detected is input into the face detection model generated by the training for forward calculation, so that the 3D feature points of the face of the corresponding image can be obtained.

By training the face detection model, the output nodes of the 2D face point model are about the number of face points × 2, and the output nodes of the 3D face point model are about the number of face points × 3. Compared with the traditional face point detection scheme, the scheme has the advantages that the 2D face points are output, and meanwhile, the semantic information of the 2D face points can be used for reference, so that the corresponding 3D information is obtained. The human face detection model provided by the scheme can run in real time, and the processing time of a single image frame is verified to be less than 5 milliseconds through tests.

By the scheme, the 2D feature point detection model is fused into the 3D feature point detection model, so that real-time 3D face detection can be realized. The accuracy of characteristic point detection can be improved to follow-up applications such as 3D makeup, AR special effect, promote user experience.

A5, the method of a4, wherein the first convolution processing layer includes a first convolution layer and a first active layer, a second convolution layer and a second active layer.

A6, the method of A4, wherein the second convolution processing layer includes a first superposition layer, a second superposition layer, a third superposition layer, and a first pooling layer, a second pooling layer, a third pooling layer,

the first stacking layer is suitable for stacking the output of the first pooling layer after one convolution and the output after multiple convolutions;

the second stacking layer is suitable for stacking the output of the second pooling layer after one convolution and the output after multiple convolutions;

the third pooling layer and the third overlapping layer comprise a plurality of convolution layers and activation layers, and the third overlapping layer is suitable for overlapping the output of the third pooling layer after one convolution and the output after multiple convolutions.

A10, the method of A3, wherein the step of training a face detection model comprises:

acquiring parameters of a pre-trained face detection model and loading training data;

inputting the training data into a face detection model for training;

calculating a value of a first loss function based on the predicted first feature points and the labeled face feature points;

calculating a value of a second loss function based on the predicted second feature points and the labeled face feature points;

calculating a value of a third loss function based on the positive sample loss function and the negative sample loss function; and

and adjusting parameters of the face detection model by using a gradient descent method based on the value of the first loss function, the value of the second loss function and the value of the third loss function.

A11, the method as recited in a10, wherein the first feature points are 2D feature points and the second feature points are 3D feature points.

A12, the method of A10, wherein the training data includes positive samples and negative samples, the positive samples are human face images, and the negative samples are non-human face images.

A13, the method as recited in a10, wherein the parameters include at least learning rate, maximum number of iterations, batch size, perturbation angle and scale during image pre-processing.

A14, the method of a10, wherein the method further comprises:

and converting the training data into data in a format corresponding to the deep learning framework.

It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of the present invention according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. A method of training a generated face detection model, adapted to be executed in a computing device, wherein the method comprises:

acquiring a face image with annotation data as a training image, wherein the annotation data comprises annotated face characteristic points;

inputting the training image into a pre-trained face detection model to obtain a predicted first feature point and a predicted second feature point, wherein the first feature point is a 2D feature point, and the second feature point is a 3D feature point;

training the face detection model based on the annotation data, the predicted first feature points and the predicted second feature points,

the human face detection model comprises a first convolution processing layer, a second convolution processing layer, a first regression processing layer and a second regression processing layer which are coupled with one another, wherein the input of the first convolution processing layer is a face image, the output of the first convolution processing layer is a feature map, the input of the second convolution processing layer is the feature map, the output of the second convolution processing layer is a feature vector, the input of the first regression processing layer is the feature map, the feature vector and the coordinates of a predicted first feature point, the output of the first regression processing layer is the coordinates of the feature vector, the coordinates of a first feature point and the coordinates of a predicted second feature point.

2. The method of claim 1, wherein the method further comprises:

and preprocessing the face image, wherein the preprocessed image is used as a training image, and the preprocessing comprises cutting and scaling operations to obtain the training image with consistent size.

3. The method of claim 1, wherein,

the first convolution processing layer is suitable for performing convolution and activation processing on an input image to output a feature map;

the second convolution processing layer is suitable for performing pooling, convolution, activation and superposition processing on the feature map to output the feature vector.

4. The method of claim 3, wherein the first convolutional processing layer comprises a first convolutional layer and a first active layer, a second convolutional layer and a second active layer.

5. The method of claim 3, wherein the second convolution processing layer includes a first superposition layer, a second superposition layer, a third superposition layer, and a first pooling layer, a second pooling layer, a third pooling layer,

6. The method of claim 1, wherein the second regression-processed layer comprises a first fully-connected layer, a third fully-connected layer, a fifth fully-connected layer, a sixth fully-connected layer, and a fourth superimposed layer, the first regression-processed layer comprises a second fully-connected layer, a fourth fully-connected layer, and a tenth fully-connected layer,

the fourth superposition layer is suitable for superposing the output of the third full connection layer and the activated output of the fourth full connection layer.

7. The method of claim 4, wherein the first regression processing layer comprises an image processing layer adapted to scale and angle transform the feature map output by the second activation layer and a point alignment layer adapted to crop the feature map.

8. The method of claim 7, wherein the face detection model further comprises a classification processing layer comprising a seventh fully-connected layer, an eighth fully-connected layer, a ninth fully-connected layer, and a normalization layer adapted to output probabilities that predicted feature points belong to positive and negative samples, respectively.

9. The method of claim 2, wherein the step of training the face detection model comprises:

inputting the training data into a face detection model for training;

10. The method of claim 9, wherein the training data comprises positive samples and negative samples, the positive samples being face images and the negative samples being non-face images.

11. The method of claim 9, wherein the parameters include at least a learning rate, a maximum number of iterations, a batch size, a perturbation angle and a scale in an image pre-processing process.

12. The method of claim 9, wherein the method further comprises:

13. A face detection method, adapted to be executed in a computing device, the method comprising:

inputting the face image to be detected into the trained face detection model, outputting the predicted face characteristic points,

the face detection model is trained based on the method as claimed in any one of claims 1-12.

14. A computing device, comprising:

one or more processors; and

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-13.

15. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-13.