CN109033953A

CN109033953A - Training method, equipment and the storage medium of multi-task learning depth network

Info

Publication number: CN109033953A
Application number: CN201810614856.6A
Authority: CN
Inventors: 李千目; 练智超; 侯君; 朱虹; 李良; 宋佳
Original assignee: Shenzhen Bowei Chuangsheng Technology Co Ltd
Current assignee: Shenzhen Bowei Chuangsheng Technology Co Ltd
Priority date: 2018-06-14
Filing date: 2018-06-14
Publication date: 2018-12-18

Abstract

The embodiment of the invention discloses the training method of multi-task learning depth network, equipment and storage mediums.The embodiment of the present invention is by inputting training set in multi-task learning depth network, from in multi-task learning depth network several layers convolutional layer and pond layer export corresponding operation result respectively, and the operation result of output is subjected to Fusion Features, the study that each task in multi-task learning is carried out using Fusion Features data, exports the corresponding prediction result of each task respectively.The operation result of multiple convolutional layers and the output of pond layer in multi-task learning depth network is carried out Fusion Features in the way of Fusion Features by above-mentioned training method, multi-task learning is carried out using Fusion Features data, and then the detection accuracy of each task in multi-task learning is improved, improve the performance of multi-task learning network.

Description

Training method, device and storage medium for multi-task learning deep network

Technical Field

The embodiment of the invention relates to the field of biological recognition, in particular to a training method, training equipment and a storage medium of a multi-task learning network.

Background

Face recognition is an important issue in computer vision, and some relatively important aspects are face detection, face feature point recognition, face feature point positioning, and the like. Many visual tasks rely on accurate facial feature point localization structures such as facial recognition, facial expression analysis, and facial animation. Although widely studied and used in recent years with some degree of success, facial feature point localization still faces many problems and challenges due to the complexity and variety of facial images caused by partial occlusion, lighting, large degrees of head rotation, and exaggerated changes in expression.

In the prior art, methods for locating facial feature points can be roughly divided into two categories: traditional methods and deep learning based methods. Typical conventional methods include model-based methods and regression-based methods; model-based methods learn Shape increments given an average initial Shape, such as Active Shape Model (ASM) and Active Appearance Model (AAM), using statistical models such as Principal Component Analysis (PCA) to capture Shape and appearance changes, respectively; however, since a single linear model is difficult to characterize complex non-linear variations in real scene data, conventional model-based methods cannot obtain an accurate shape of a face image with a large degree of head pose variation and exaggerated facial expression. The regression-based method in the conventional method is to predict the location of the keypoint by training an appearance model. There are researchers predicting shape deltas by applying linear regression on Scale-invariant feature transform (SIFT). In addition, researchers also propose to use pixel intensity difference as a characteristic sequence to learn a series of random fern regression, gradually degrade the shape of the learning cascade, and carry out regression on all parameters simultaneously, thereby effectively utilizing shape constraint; that is, regression-based methods modify the predicted landmark positions iteratively, primarily from initial estimates, so the final result is highly dependent on initialization.

For the deep learning based method, there are several existing methods. Sun et al propose a new method for locating facial feature points by using a three-level cascade convolutional neural network framework, regress 5 feature points (namely, left and right eyes, nose tip, left and right mouth angle) of a human face by using a Convolutional Neural Network (CNN), and finely tune the combination of the feature points by using convolutional neural networks of different levels. Furthermore, Zhang et al proposes a Coarse-to-Fine depth nonlinear feature point localization method (CFAN) that implements a nonlinear regression model using a continuous Auto-Encoder network. Both methods use multiple deep networks to locate feature points step by step in a cascaded fashion. They search for the best feature point location for each image from coarse to fine, showing higher accuracy than previous feature point location methods, but do not deal with the occlusion problem effectively. Furthermore, because a plurality of convolutional neural network structures are employed, as the number of facial feature points increases, the time consumption for locating all the points increases accordingly. In a real unconstrained environment, facial feature point localization is not really a separate task, and it is also disturbed by various factors, such as: the swing of the head and the difference of the gender can affect the accuracy of the feature point positioning.

Disclosure of Invention

The embodiment of the invention mainly solves the technical problem of providing a training method of a multi-task learning deep network, which can improve the performance of the multi-task learning network.

In order to solve the above technical problem, one technical solution adopted by the embodiments of the present invention is: a training method of a multi-task learning deep network is provided, and comprises the following steps:

inputting a training set into a plurality of cascaded convolutional layers and pooling layers in a multi-task learning deep network step by step, and respectively outputting corresponding first operation results from the convolutional layers and the pooling layers;

inputting the first operation result into a feature fusion full link layer of the multitask learning deep network, and outputting feature fusion data;

inputting the feature fusion data into a full link layer corresponding to each task in the multi-task learning to respectively learn each task, and respectively outputting a prediction result corresponding to each task;

and correcting the multi-task learning deep network by using the prediction result and the characteristic information marked by the training set.

In order to solve the above technical problem, another technical solution adopted in the embodiments of the present invention is: there is provided a training apparatus for a multitask learning deep network, the training apparatus comprising:

a memory and a processor connected to each other;

the memory stores a training set, a constructed multi-task learning deep network and program data;

the processor is used for executing the training method according to the program data, and training the multi-task learning deep network by using the training set.

In order to solve the above technical problem, another technical solution adopted in the embodiments of the present invention is: there is provided a storage medium storing program data executable to implement the above-described training method for a multitask learning deep network.

The embodiment of the invention has the beneficial effects that: in the training method of the multi-task learning deep network, a training set is input into the multi-task learning deep network, corresponding first operation results are respectively output from a plurality of convolution layers and pooling layers in the multi-task learning deep network, the output first operation results are subjected to feature fusion, learning of each task in the multi-task learning is performed by using feature fusion data, and a prediction result corresponding to each task is respectively output; and correcting the multi-task learning deep network by using the prediction result and the characteristic information marked by the training set. In the embodiment, the operation results output by a plurality of convolution layers and pooling layers in the multi-task learning deep network are subjected to feature fusion in a feature fusion mode, and multi-task learning is performed by using feature fusion data, so that the detection precision of each task in the multi-task learning is improved, and the performance of the multi-task learning network is improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a first embodiment of a training method for a deep network for multitask learning according to the present invention;

FIG. 2 is a schematic flow chart diagram illustrating an embodiment of step S101 in FIG. 1;

FIG. 3 is a schematic flow chart of another embodiment of step S101 in FIG. 1;

FIG. 4 is a schematic structural diagram of an embodiment of the deep network for multitasking learning according to the present invention;

FIG. 5 is a flowchart illustrating a training method for a deep network for multitask learning according to a second embodiment of the present invention;

FIG. 6 is a flowchart illustrating a third embodiment of the training method for the deep network for multitask learning according to the present invention;

FIG. 7 is a schematic flow chart diagram illustrating a further embodiment of step S101 in FIG. 6;

FIG. 8 is a schematic structural diagram of an embodiment of a training apparatus for a deep network for multitask learning according to the present invention;

FIG. 9 is a schematic structural diagram of another embodiment of the training apparatus for the multitask learning deep network according to the present invention;

FIG. 10 is a flowchart illustrating a first exemplary embodiment of a method for testing a deep network for multi-task learning according to the present invention;

FIG. 11 is a schematic flow chart diagram illustrating one embodiment of step S201 in FIG. 10;

FIG. 12 is a schematic structural diagram of an embodiment of a first stage neural network of the two-stage cascaded convolutional neural network of the present invention;

FIG. 13 is a schematic diagram of an embodiment of a second stage neural network of the second stage cascaded convolutional neural network of the present invention;

FIG. 14 is a flowchart illustrating a second embodiment of the method for testing a multi-task learning deep network of the present invention;

FIG. 15 is a flowchart illustrating a third exemplary embodiment of the deep network testing method for multi-task learning according to the present invention;

FIG. 16 is a flowchart illustrating a fourth exemplary embodiment of the deep network testing method for multi-task learning according to the present invention;

FIG. 17 is a block diagram of an embodiment of a testing apparatus for a multi-task learning deep network of the present invention;

FIG. 18 is a schematic structural diagram of an embodiment of a storage medium according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating a training method of a deep multitask learning network according to a first embodiment of the present invention. As shown in fig. 1, the training method of the multi-task learning deep network of the present embodiment at least includes the following steps:

in step S101, the training set is input to the multitask learning deep network to perform multitask learning, and a prediction result of the multitask learning is output.

In this embodiment, the images of the training set are used as a training data source to input the initially constructed multi-task learning depth network, and the initially constructed multi-task learning depth network performs multi-task learning on the images included in the training set to obtain a prediction result of the multi-task learning.

In this embodiment, the multi-task learning includes a feature point positioning task, a feature point visibility prediction task, a face detection task, and a gender identification task. Therefore, the initially constructed multi-task learning depth network correspondingly outputs the feature point positioning result, the feature point visibility prediction result, the face detection result and the gender identification result of the face of the image contained in the training set.

In this embodiment, the initially constructed multi-task learning network is trained by taking an AFLW data set as an example of a training set. The AFLW data set comprises face images in most natural states and has a very large information amount, 21 feature point labels are provided for each face in the AFLW data set, and face frames, head postures and gender information are marked in the AFLW data set. The AFLW data set contains 25993 human face images that have been manually labeled, of which 41% for males and 59% for females, and most of the images are color images and only a small part of the images are grayscale images. In this embodiment, most images in the AFLW data set are used as a training set of the multi-task learning deep network, and a small part of images can be reserved for testing the trained multi-task learning deep network to determine whether the trained multi-task learning deep network meets the required accuracy.

In step S102, the prediction result is compared with the labeling result in the training set, and a loss value corresponding to the multitask learning is obtained according to the comparison result.

In step S101, a prediction result corresponding to each task obtained by performing a feature point positioning task, a feature point visibility prediction task, a face detection task, and a gender identification task on the initially constructed multi-task learning depth network may be obtained. In the step, the obtained prediction result is compared with the marking result on the image in the training set, and then the corresponding loss value executed by each task in the multi-task learning is obtained respectively.

In step S103, the loss value is fed back to the multitask learning deep network, and the multitask learning deep network is corrected.

And representing the accuracy corresponding to each task in the multi-task learning by using the loss value corresponding to each task, participating the loss value in back propagation, further obtaining the error of a network at the upper layer in the multi-task learning deep network, further correcting the multi-task learning deep network, and finally obtaining the corrected multi-task learning deep network, namely the trained multi-task learning deep network.

Referring further to fig. 2, fig. 2 is a schematic flowchart illustrating an embodiment of step S101 in fig. 1. As shown in fig. 2, step S101 may include at least the following steps:

in step S1011, the training set is gradually input into the plurality of cascaded convolutional layers and pooling layers in the multitask learning depth network, and corresponding first operation results are respectively output from a plurality of convolutional layers and/or a plurality of pooling layers in the plurality of convolutional layers and pooling layers.

In this embodiment, the initially constructed multitask learning deep network is obtained by performing improved construction on the basis of a network structure of an AlexNet network.

The multitask learning deep network comprises a plurality of cascaded convolution layers and pooling layers so as to carry out convolution operation or pooling operation on output data. Each convolution layer and each pooling layer can obtain corresponding operation results through corresponding operations, and the operation results are corresponding characteristic information. In this embodiment, a part of the convolutional layers and/or a part of the pooling layers are selected from the plurality of cascaded convolutional layers and pooling layers as output layers of the operation result, that is, corresponding operation results of the plurality of convolutional layers and/or the plurality of pooling layers are extracted from the plurality of convolutional layers and pooling layers, respectively, and the extracted operation result is used as a first operation result.

Further, since the feature information included in the operation result output by each convolution layer and each pooling layer is not exactly the same, the first operation result obtained by extracting the corresponding operation results of the plurality of convolution layers and/or the plurality of pooling layers can satisfy the requirement of the information amount required for the multitask learning.

In the multi-task learning depth network, the operation results output by the convolution layer and the pooling layer of the shallower layer contain more edge information and corner information, which is beneficial to the learning of the feature point positioning task; the operation results output by the convolution layer and the pooling layer at the deeper layer contain more overall information, so that the learning of more complex tasks such as face detection, gender identification and the like is facilitated. Therefore, in this embodiment, the plurality of convolution layers and/or the plurality of pooling layers at least include a plurality of shallower convolution layers and/or pooling layers, and a plurality of deeper convolution layers and/or pooling layers, so that the obtained first operation result includes enough edge information and corner information, and also includes certain overall information, so that the extracted information can be better subjected to multi-task learning, and the specific number of extracted layers needs to be adjusted according to a final prediction result, so as to avoid an excessive amount of information included in the first operation result.

In step S1012, the first calculation result is input to the feature fusion full link layer, and feature fusion data is output.

After the corresponding first operation results are extracted, because the information content of the plurality of convolution layers and/or pooling layers is large, multi-task learning cannot be directly performed, and the extracted plurality of corresponding first operation results need to be subjected to feature fusion and mapped to one subspace, so that the network performance is improved.

In this embodiment, the plurality of corresponding first operation results obtained in step S1011 are output to the feature fusion full link layer in the multitask learning deep network, feature fusion is performed on the plurality of input corresponding first operation results through the feature fusion full link layer, and feature fusion data is output.

In step S1013, feature fusion data is input to the all-link layer corresponding to each task in the multi-task learning, and learning is performed for each task, and a prediction result corresponding to each task is output.

And further inputting the feature fusion data obtained after feature fusion into a full link layer corresponding to each task in the multi-task learning deep network, performing feature classification on the input feature fusion data by using the full link layer corresponding to each task, and respectively linking to branches corresponding to each task to obtain a prediction result of each task.

Referring to fig. 3, as shown in fig. 3, in another embodiment of the step S101, after obtaining the corresponding first operation results respectively output by the plurality of convolutional layers and/or pooling layers in the step S1011, the method may further include the following steps:

in step S1014, at least a part of the first operation result is input to the corresponding sub convolution layers, and the corresponding second operation result having the same dimension is output.

In this embodiment, the sizes of the feature data (i.e., feature maps) output by the cascaded convolutional layers and the pooled layers in the multitask learning deep network are different, and because the sizes of the corresponding first operation results output from the convolutional layers and/or the pooled layers in step S1011 are different, they cannot be directly connected. Thus, in the present embodiment, each convolution layer and/or pooling layer obtained in step S1011 outputs at least a portion of the corresponding first operation result to the corresponding sub-convolution layer, and the convolution kernel size of each sub-convolution layer corresponds to the size of the corresponding first operation result input thereto, so as to obtain the second operation result with the same dimension.

It can be understood that the size of the operation result output by the deeper convolutional layer and the deeper pooling layer is larger than the size of the operation result output by the shallower convolutional layer and the deeper pooling layer; and the larger the number of convolution and pooling layers, the larger the size of the calculation result outputted therefrom. Thus, the size of the first operation result output from the innermost convolutional layer or pooling layer that outputs the first operation result can be set as the standard size for size adjustment, and the size of each convolutional layer or pooling layer output before can be adjusted to the standard size. For example, if the size of the first operation result output by the convolutional layer or the pooling layer at the deepest layer is 6x6x256, the first operation results output by the convolutional layer or the pooling layer before the convolutional layer or the pooling layer at the deepest layer are all adjusted to be 6x6x 256.

In step S1015, the second operation result having the same dimension is input to the full convolution layer, and the third operation result after the dimension reduction processing is output.

In this embodiment, the second operation result with the same dimension is obtained and output to a full convolution layer, the convolution kernel of the full convolution layer is 1 × 1, and then the dimension reduction processing is performed on the second operation result, the third operation result after the dimension reduction processing is output, and the third operation result obtained after the dimension reduction processing is input to the feature fusion full link layer as the first operation result, and the steps S1012 and S1013 are continued.

Further, please refer to fig. 4, fig. 4 is a schematic structural diagram of an embodiment of the multitask learning deep network according to the present invention. As shown in fig. 4, the multitask learning deep network (within the dashed box) of the present embodiment includes a plurality of cascaded convolutional layers and pooling layers, in this embodiment, each pooling layer is also subjected to regularization; the present embodiment takes the cascade to the fifth pooling layer (pool5) as an example, by analogy with the cascade order defined as the first layer of convolutional layer (conv1), the first layer of pooling layer (pool1), the second layer of convolutional layer (pool2), the second layer of pooling layer (pool2), and so on. The training set is input into the cascaded convolutional layers and pooling layers, and corresponding first pooling operation results, first pooling operation results and second pooling operation results are output from the first pooling layer, the third pooling layer and the fifth pooling layer, in this embodiment, the sizes of the operation cores of the first pooling layer, the third pooling layer and the fifth pooling layer are all 3x3, and correspondingly, the sizes of the first pooling operation results, the first pooling operation results and the second pooling operation results are 27x27x96, 13x13x384 and 6x6x256, respectively. The size (6x6x256) of the second pooling operation result output by the second pooling layer is used as a standard for size adjustment, the first pooling operation result with the size of 27x27x96 is input into the sub-convolution layer (conv1a) with the convolution kernel of 4x4, the first pooling operation result with the size of 13x13x384 is input into the sub-convolution layer (conv3a) with the convolution kernel of 2x2, the sizes of the first pooling operation result and the first pooling operation result are adjusted to be 6x6x256 through the sub-convolution layer (conv1a) and the sub-convolution layer (conv3a), and the operation result with the same dimension after adjustment is used as the second operation result. Further referring to fig. 4, the second operation result is input to the full convolution layer (conv _ all) with convolution kernel of 1 × 1, and then the dimension reduction processing is performed on the second operation result, so as to obtain a third operation result with size of 6 × 192. And then, inputting a third operation result into a 3072-dimensional feature vector feature fusion full link layer (fc _ full), and then linking to the full link layer corresponding to each task (a feature point positioning task, a feature point visibility prediction task, a face detection task and a gender identification task), wherein the dimension of the full link layer corresponding to each task is 512, and the full link layer is used for learning and training each task.

According to the invention, the multi-task learning depth network is used for respectively learning the feature point positioning task, the feature point visibility prediction task, the face detection task and the gender identification task, on one hand, the feature point visibility prediction task, the face detection task and the gender identification task related to the feature point positioning are added in the multi-task learning depth network, so that the feature point positioning precision is improved, and other tasks can be executed. On the other hand, the multi-task learning deep network of the embodiment adopts a feature fusion technology to perform feature fusion on feature maps output by a plurality of convolutional layers and/or pooling layers, so as to obtain sufficient data information required by the feature point positioning task. The multi-task learning depth network has high robustness and excellent performance on complex changes such as posture transformation, extreme illumination, exaggerated expression and partial shielding in the image, and realizes high precision and good performance.

Further, the training method of the multi-task learning deep network of the embodiment adds the non-linear activation function after all the convolutional layers and the full link layers, and the embodiment takes a modified linear unit (ReLU) activation function as an example. Further, the multi-task learning deep network of the embodiment does not add any pooling operation in the converged network, because the features extracted by the pooling operation have scale invariance to the local information, and the feature is not needed by the feature point positioning task.

Further, referring to fig. 5, fig. 5 is a flowchart illustrating a training method for a deep multitask learning network according to a second embodiment of the present invention, which is obtained by improving the training method shown in fig. 1 to 3 based on the first embodiment, and a structure of the deep multitask learning network is shown in fig. 4. As shown in fig. 5, the present embodiment may further include the following steps before step S101:

in step S104, the AlexNet network is used to train the face detection task, and the weight corresponding to the face detection task is obtained.

In this embodiment, before training the multitask learning deep network, the network needs to be initialized, and the weight used for initialization is obtained by performing a face detection task on an existing AlexNet network. Among them, AlexNet network is a neural network structural model proposed in 2012.

In step S105, the multitask learning deep network is initialized with the weights.

In this embodiment, the multitask learning deep network provided by the present invention may be initialized according to the weight obtained in step S104.

When the deep network is trained, if a random initial value is adopted, it may be that hidden layer neurons in the deep network are in a saturated state, at this time, a slight adjustment in weights only brings a very slight change to an activation value of the hidden layer neurons, and the slight change also affects remaining neurons in the network, and then brings a corresponding change of a cost function, and as a final result, the weights can be learned very slowly when the network performs a gradient descent algorithm. By changing the distribution of the weights, the network can be improved by initializing the network.

Further, referring to fig. 6, fig. 6 is a schematic flowchart of a training method for a deep multitask learning network according to a third embodiment of the present invention, which is obtained by performing an improvement on the basis of the first embodiment of the training method shown in fig. 1 to fig. 3. As shown in fig. 6, the present embodiment may further include the following steps before step S101:

in step S106, a predicted face region of the images in the training set is calculated.

In this embodiment, before the training set is input into the multitask learning deep network, a predicted face region is calculated for the images in the training set through the RCNN network. The algorithm adopted by the embodiment for calculating the predicted face area is a selective search algorithm.

This embodiment may also be combined with the second embodiment of the training method of the multitask learning deep network shown in fig. 5, and it should be noted that step S106 and step S104 and step S105 have no necessary precedence relationship.

Further, referring to fig. 7, based on the third embodiment of the training method of the deep multitask learning network shown in fig. 6, the step S101 of inputting the training set into the deep multitask learning network for the multitask learning may further include the following steps:

in step S1016, the training set is input into the multitask learning depth network, and the predicted face region is compared with the labeled face region labeled on the image in the training set, so as to obtain a comparison result.

And (3) a training set input value multitask learning deep network, wherein images contained in the training set are marked by human beings according to the description of the training set, and the human face areas marked by the human beings are used as marked human face areas. In this embodiment, after the training set is input into the multitask learning deep network, when learning of each task of the multitask learning is performed, the predicted face region calculated in step S106 needs to be compared with the labeled face region, so as to obtain a comparison result, and a predicted face region meeting a preset condition corresponding to each task is screened from the predicted face regions according to the comparison result.

In this embodiment, the comparison result is an overlapping degree between the predicted face region and the marked face region, and the overlapping degree can reflect a matching degree between the predicted face region and the marked face region.

In step S1017, a predicted face region satisfying a preset condition is selected as a detected face region according to the comparison result.

The calculated overlap degree between each predicted face region and the corresponding labeled face region can be obtained through step S1016, and this embodiment sets a corresponding preset condition for each task in the multi-task learning, that is, the full link layer corresponding to each task only performs corresponding task learning on the predicted face region meeting the preset condition.

In the embodiment, a face region meeting preset conditions in the predicted face region is used as a detected face region, and as the preset conditions corresponding to each task may be different, the corresponding detected face regions screened by each task may be different.

In step S1018, multitask learning is performed on the detected face region.

After the detected face region corresponding to each task is obtained in step S1017, the full link layer corresponding to each task may perform corresponding task learning on the corresponding detected face region that is screened out.

It is to be understood that the present embodiment recognizes that the steps shown in fig. 2 and 3 before the learning of the corresponding task is performed at the full link layer corresponding to each task are performed in step S101.

The training of each task in the multi-task learning of the multi-task learning deep network of the invention is exemplified as follows:

for the face detection task, the preset condition corresponding to the face detection task is that the overlap degree between the predicted face region and the labeled face region is greater than 0.5, or the overlap degree between the predicted face region and the labeled face region is less than 0.35, in other words, in this embodiment, the face detection task is performed on the predicted face region whose overlap degree with the labeled face region is greater than 0.5 or the overlap degree with the labeled face region is less than 0.35. The detected face area with the overlapping degree of the marked face area more than 0.5 is taken as a positive sample, the detected face area with the overlapping degree of the marked face area less than 0.35 is taken as a negative sample, and the formula is as follows:

loss_D＝-(1-l)·log(1-p)-l·log(p)；

therein, loss_DLoss function in this embodiment_DIs a softmax function; for a positive sample, its value of l is 1; for negative samples, the value of l is-1; p represents the probability that the detected face region belongs to a face. In this embodiment, a face probability threshold may be set, and the calculated p value is compared with the face probability threshold, a detected face region corresponding to the p value greater than and/or equal to the face probability threshold is considered as a face, and a detected face region corresponding to the p value less than the face probability threshold is considered as a non-face, so as to perform learning of the face detection task.

For the feature point localization task, the present embodiment uses 21 face feature points already labeled in the AFLW data set. In this embodiment, the preset condition corresponding to the feature point positioning task is that the overlap degree between the predicted face region and the labeled face region is greater than 0.35, that is, the predicted interpersonal region whose overlap degree with the labeled face region is greater than 0.35 is used as the detected face region for learning the feature point positioning task. Wherein, the detected face area is expressed by { x, y, w, h }, (x, y) is the coordinate of the center of the detected face area, and w and h are the width and height of the detected face area respectively. Each feature point is shifted relative to the center (x, y) of the detected face area, and the coordinates of the feature points are normalized by (w, h):

wherein (x)_i,y_i) Coordinates of feature points representing a human face, (a)_i,b_i) And (4) representing the relative value of the coordinates of the feature points of the human face after normalization processing.

In this embodiment, the coordinates of the invisible feature points are set to (0, 0), and for the visible feature points, a predetermined loss function is used to perform learning of the feature point positioning task, and the formula is as follows:

therein, loss_LFor the loss function, the loss function in this embodiment is a euclidean function; n is the number of feature points (in the AFLW dataset, the number of feature points is 21);and normalizing the coordinates of the corresponding predicted feature points to obtain relative coordinates. v. of_iA visibility factor representing a characteristic point, if v_iEqual to 1, it means that the feature point is visible in the detected face region, if v is_iIf the number is equal to 0, it indicates that the feature point is invisible in the detected face region, and the invisible feature point does not participate in back propagation in this embodiment.

And the coordinate values of the feature points are finally calculated according to the relative coordinates, the number of the feature points, the coordinates of the detected face area, the width and the height after normalization according to the coordinates of the corresponding predicted feature points.

For feature point visibility, the present embodiment predicts whether a feature point is visible by learning a visibility factor of the feature point. In this embodiment, the preset condition corresponding to the feature point visibility prediction task is that the overlap degree between the predicted face region and the marked face region is greater than 0.35, that is, the predicted face region with the overlap degree between the predicted face region and the marked face region greater than 0.35 is used as a detected face region, and the feature point visibility prediction task is learned. The formula is as follows:

therein, loss_VIs a loss function, which in this embodiment is a euclidean function; n is the number of feature points (in the AFLW dataset, the number of feature points is 21); if the feature point is visible, its visibility factor v_i1, if the feature point is not visible, the visibility factor is 0, and thereby the predicted value that the feature point can be visible is calculated

For the gender identification task, in this embodiment, the preset condition corresponding to the gender identification task is that the overlap degree between the predicted face region and the labeled face region is greater than 0.5, that is, the predicted face region with the overlap degree between the predicted face region and the labeled face region greater than 0.5 is used as the detected face region, and the gender identification task is learned by using the following formula:

loss_G＝-(1-g)·log(1-p₀)-g·log(p₁)

therein, loss_GFor the loss function, the present embodiment may adopt a cross entropy loss function; (p)₀,p₁) The two-dimensional probability vector is obtained by network calculation, and g is 0 if the gender is male, and g is 1 if the gender is female.

Further, the global loss function of the multi-task learning deep network of the embodiment is a weighted sum of individual loss values of each task, and the calculation formula is as follows:

therein, loss_tIs the loss value of the corresponding t-th task, the weighting parameter lambda_tIs determined by the importance of each task in the total loss, λ in this embodiment_D＝1、λ_L＝5、λ_V＝0.5、λ_G2, a face detection task, a facial feature point positioning task, a specific point visibility prediction task, and a gender identification task are respectively expressed.

It is understood that the learning of each task is performed in the corresponding full link layer, and the loss function corresponding to the full link layer link corresponding to each task is performed only in the learning of each task.

Further, referring to fig. 8, fig. 8 is a schematic structural diagram of a training apparatus for a deep multitask learning network according to an embodiment of the present invention. As shown in fig. 8, the training apparatus 100 of the multitask learning deep network of the embodiment includes a memory 101 and a processor 102 which are connected to each other, wherein the memory 101 stores the multitask learning deep network which has been constructed and the corresponding program data, and furthermore, the memory 101 may also store a training set for training the multitask learning deep network. The processor 102 is configured to execute any of the first to third embodiments of the training methods for a deep multitask learning network shown in fig. 1 to 7 according to the program data, and complete training of the deep multitask learning network.

Further, as shown in fig. 9, in another embodiment, the training device 200 may further include a communication circuit 103 connected to the memory 101 and/or the processor 102 via a bus, and the communication circuit 103 is configured to obtain the training set and input the training set to the processor, in which case the training set may not be stored in the memory 101.

Furthermore, the invention also provides a testing method of the multitask learning deep network. Referring to fig. 10, fig. 10 is a flowchart illustrating a testing method of a multi-task learning deep network according to a first embodiment of the present invention. As shown in fig. 10, the testing method of the multitask learning deep network of the embodiment at least includes the following steps:

in step S201, the image to be detected is input to the two-stage cascade convolutional neural network, and the first face area to be detected included in the image to be detected is output.

In this embodiment, the image to be measured may be an image not used for training of the multitask learning depth network in the training set, or may be an image in another data set; for example, when the above-mentioned AFLW data set is used for training the multitask learning depth network, 25000 images in the AFLW data set are used, and 993 images which are not used for training are used as images to be measured.

Further, in the embodiment, before the image to be detected is input into the trained multitask learning depth network, the input image to be detected is processed through the two-stage cascaded convolutional neural network, so that a first face region to be detected included in the image to be detected is obtained. It should be noted that the first face region to be detected is obtained from the two-stage cascaded convolutional neural network during the testing process of the multi-task learning deep network, and is different from the predicted face region calculated by the multi-task learning deep network during the training process in step S106 shown in fig. 6.

In step S202, the first face region to be detected is input into the multitask learning depth network, a second face region to be detected satisfying a preset condition is selected from the first face region to be detected, and detection results of face detection, feature point positioning, feature point visibility prediction, and gender identification for the second face region to be detected are output.

Inputting the image to be detected of the first face area to be detected into the trained multi-task learning depth network, enabling the multi-task learning depth network to select a second face area to be detected meeting preset conditions from the first face area to be detected, further testing each task in the multi-task on the second face area to be detected, and finally outputting detection results of face detection, feature point positioning, feature point visibility prediction and gender identification.

Further, in this embodiment, after receiving the first face area to be detected obtained by the second-level cascaded convolutional neural network, the multi-task learning deep network calculates the first face area to be detected to obtain a corresponding detection score of each first face area to be detected, and screens out the second face area to be detected from the first face area to be detected according to the detection score. And the screening is to compare the detection score corresponding to each first face area to be detected with a preset score threshold value, screen out the first face area to be detected corresponding to the detection score larger than the preset score threshold value, and take the screened first face area to be detected as the second face area to be detected input into the multitask learning depth network. The preset score threshold may be adjusted according to actual requirements, and in this embodiment, the preset score threshold may be 0.4, 0.5, or 0.6.

In this embodiment, after an image to be tested in a first face region to be tested is input into a trained multi-task learning depth network, a test of each task in a multi-task is performed on a second face region to be tested, the execution content of the multi-task learning depth network is similar to the execution content of the training process of the multi-task learning depth network, and the specific content of the image to be tested in the first face region to be tested is input into the trained multi-task learning depth network for multi-task learning detection, refer to any one of the first embodiment to the third embodiment of the training method of the multi-task learning depth network shown in fig. 1 to fig. 6.

Further, when testing the task of locating and predicting the visibility of the feature points, the coordinates of the feature points need to be transformed into coordinates in the original image, and the transformation formula used is as follows:

wherein,is the relative position of the predicted ith feature point.

In the embodiment, when the trained multi-task learning depth network is tested, the secondary cascade convolution neural network is added before the multi-task learning depth network, the face detection area of the input image to be detected is determined through the secondary cascade convolution neural network, the first face area to be detected is obtained from the image to be detected, and therefore the multi-task learning depth network can perform more prepared multi-task detection according to the first face area to be detected, and the detection precision of each person in the multi-task is improved. The multi-task learning depth network has high robustness and excellent performance on complex changes such as posture transformation, extreme illumination, exaggerated expression and partial shielding in the image, and realizes high precision and good performance.

Further, referring to fig. 11, fig. 11 is a flowchart illustrating an embodiment of step S201 in fig. 10. As shown in fig. 11, step S201 may include the steps of:

in step S2011, the image to be detected is input to the first-stage neural network of the two-stage cascaded convolutional neural network, and a plurality of candidate detection windows respectively labeled as a face region and a non-face region are output.

In this embodiment, this step is performed by the first stage neural network in the two-stage cascaded convolutional deep network. The image to be detected is input into a first-stage neural network of a second-stage cascade convolutional neural network, the first-stage neural network comprises a plurality of cascade convolutional layers and pooling layers, each convolutional layer and pooling layer gradually carries out corresponding operation on the image to be detected, the output image is finally divided into two types, the two types of images are marked, namely a plurality of candidate detection windows marked as a face area and a non-face area are respectively output, and the candidate detection windows are input into the second-stage neural network for subsequent processing.

Referring to fig. 12, as shown in fig. 12, the first-stage neural network of the present embodiment includes a plurality of cascaded convolutional layers and pooling layers, which may include: a first convolutional layer (conv1), a second pooling layer (pool1), a third convolutional layer (conv2) and a fourth convolutional layer (conv 3). The convolution kernel size of the first layer of convolution layer is 3x3, and since determining the face candidate region is a challenging binary classification task in nature compared with other classification and multi-target detection tasks, each layer may need a smaller number of convolution kernels, so that the amount of computation can be reduced by using convolution kernels of 3x3 size, and the depth of the neural network is added, thereby further improving the performance of the neural network. The pooling core of the second pooling layer was 2x2 in size, with maximum pooling operation. The convolution kernel size of the third convolutional layer was 3x 3. The size of the convolution kernel of the fourth convolution layer is 1x1, and setting the size of the convolution kernel to 1x1 can enable the neural network to complete information interaction and information integration across channels, and can perform dimension reduction and/or dimension increase processing on the number of the channels of the convolution kernel.

In other embodiments, the first stage neural network of the two-stage cascaded convolutional neural network may output bounding box regression vectors while outputting a number of candidate detection windows labeled as face regions and non-face regions.

In step S2012, the candidate detection windows are input to a second-level neural network in the second-level cascaded convolutional neural network, the candidate detection windows marked as non-face regions are discarded by the second-level neural network, the candidate detection windows marked as face regions are subjected to bounding box regression processing, a first candidate face region after the bounding box regression processing is output, and the first candidate face region is used as a first face region to be detected.

In this embodiment, this step is performed by a second-stage neural network of the two-stage cascade convolutional neural network. And (3) inputting the plurality of candidate detection windows obtained in the step (S2011) into a secondary neural network, wherein the plurality of candidate detection windows are marked with a face region and a non-face region, and at the moment, the secondary neural network discards the candidate detection windows marked as the non-face region from the plurality of candidate detection windows and reserves the candidate detection windows marked as the face region. And further, carrying out border frame regression processing on the candidate detection window, further obtaining a first candidate face region after the border frame regression processing, and inputting the first candidate face region serving as a first face region to be detected into the multi-task learning deep network so as to test the multi-task learning deep network. In this embodiment, the output first candidate face region includes the position information of the region in the image.

It can be understood that several, even dozens or more face regions may be marked for the same face by the several marked face regions obtained in the first-level neural network, and then in the second-level neural network, the bounding box regression is performed on the multiple face regions of the same face, so that the face regions of the same face are reduced, and the matching precision between the obtained face regions and the face in the image is improved, and at this time, the bounding box regression vector output by the first-level neural network can be used when the bounding box regression processing is performed.

Referring to fig. 13, as shown in fig. 13, the second-level neural network may also include several cascaded convolutional layers and pooling layers, such as a first convolutional layer, a second pooling layer, a third convolutional layer, a fourth pooling layer, a fifth convolutional layer, and a full link layer. Wherein the convolution kernels of the first and third convolutional layers have a size of 3 × 3; the size of the convolution kernel of the fifth convolution layer is 2x 2; the sizes of convolution kernels of the second pooling layer and the fourth pooling layer are 3 multiplied by 3, and maximum pooling operation is adopted; the full link layer is a full link layer of 128-dimensional feature vectors.

As shown in fig. 12 and 13, since the sizes of the images input by the first-stage neural network and the second-stage neural network are different, before the image to be measured is input into the first neural network and the first candidate face region is input into the second-stage neural network, the size of the image to be measured and the size of the first candidate face region need to be adjusted respectively.

Further, referring to fig. 14, fig. 14 is a flowchart illustrating a testing method for a deep multitasking learning network according to a second embodiment of the present invention. As shown in fig. 14, in this embodiment, after step S201 of fig. 10, the following steps may be further included:

in step S203, the first to-be-detected face regions with the overlap degree higher than the preset overlap degree in the first to-be-detected face region are merged to obtain a merged final to-be-detected face region.

It can be understood that, in the first face region to be detected obtained in the two-stage cascade convolutional neural network, several, tens of or even more first face regions to be detected may be obtained for the same face. Furthermore, in this embodiment, it can be considered that the first to-be-detected face regions with higher overlapping degrees in the first to-be-detected face regions obtained by the two-stage cascaded convolutional neural network are obtained from the same face, so that the first to-be-detected face regions with higher overlapping degrees can be combined to reduce the number of the first to-be-detected face regions and improve the detection accuracy.

Further, in this embodiment, the first face areas to be detected obtained by the two-stage cascade convolutional neural network are compared to obtain mutual overlapping degrees between the plurality of first face areas to be detected, and two or more first face areas to be detected, whose overlapping degrees are higher than a preset overlapping degree, are merged to obtain the final face area to be detected. And inputting the obtained final face area to be tested into the multi-task learning deep network to perform subsequent testing steps.

In this embodiment, the first face regions to be detected may be merged through a non maximum suppression algorithm (NMS), where the non maximum suppression algorithm involves selecting a region with the highest score from the first face regions to be detected and discarding all other overlapped regions larger than a specific threshold, and scaling the selected region to a preset size, where the preset size is 227 × 227 in this embodiment. In addition, the preset overlapping degree can be adjusted according to actual needs.

In another embodiment, step S203 may also be executed after step S1012, that is, after the first candidate face region after the bounding box regression processing is obtained in step S1012, the first candidate face region is merged by a non maximum suppression algorithm (NMS). The processing flow for obtaining the final face region to be detected through the NMS in the embodiment is as follows:

in the above-described process flow, the score resetting function Si is as follows:

in the above formula, the NMS uses a hard threshold method in order to determine whether the adjacent first candidate face region can be retained. And finally, obtaining the combined final face area to be detected.

Further, referring to fig. 15, fig. 15 is a flowchart illustrating a third embodiment of the method for testing a multitasking learning deep network according to the present invention. As shown in fig. 15, in this embodiment, after step S201 of fig. 10, the following steps may be further included:

in step S204, the size of the first face region to be detected is adjusted, and the size of the first face region to be detected is adjusted to a preset face region size allowed by the multitask learning depth network.

Since the multitask learning depth network has a requirement on the size of the input face region to be detected, in this embodiment, after the first face region to be detected is obtained by obtaining the second-level cascaded convolutional neural network, the size of the first face region to be detected is adjusted to the preset face region size allowed by the multitask learning depth network. At this time, the first face area to be detected is merged into the first face area to be detected after the size adjustment.

Further, step S204 may be executed after step S203, that is, the size of the merged final face region to be detected is adjusted to a preset face region size allowed by the multitask learning depth network.

Further, referring to fig. 16, fig. 16 is a flowchart illustrating a fourth embodiment of the method for testing a multitasking learning deep network according to the present invention. As shown in fig. 16, the present embodiment finds that, before step S201 in fig. 10, the following steps may also be included:

in step S205, the size of the image to be measured is adjusted to the size of the image to be measured allowed by the two-stage cascaded convolutional neural network.

In this embodiment, before the image to be measured is input into the first stage neural network of the two-stage cascade convolution neural network, the image to be measured is subjected to different size changes, wherein the initial scaling isWherein, S is the minimum size of the first face region to be detected, and 12 is the minimum size of the first face region to be detected that can be received by the first-level neural network. In this embodiment, the processing flow of the image to be measured may be as follows:

wherein the loss function is divided into two parts, relating to face classification and regression of the face region, respectively. Cross entropy loss function to classify as face or non-face regions, for each sample x_iThe formula is as follows:

wherein,actual labels, p, representing the background_iThen the sample x is represented_iIs the probability of a face.

The method comprises the following steps of performing bounding box regression processing by using a square loss function, wherein the regression loss is actually solved by using Euclidean distance, and the formula is as follows:

wherein,represents coordinates obtained by network prediction, andrepresenting the actual background coordinates. y is^boxIs composed of a quadruple formed by the abscissa of the upper left corner, the ordinate of the upper left corner, the length and the width.

The multi-task learning deep network of the embodiment can be actually considered as a three-level network, and the loss function of the multi-task learning deep network comprises two parts, namely face classification and bounding box regression, so that two loss functions need to be trained for the two parts, and each loss function is distributed according to different currencies to form a final objective function. The final objective function of this embodiment is as follows:

the whole training process of the loss function is essentially the process of minimizing the above function, where α j represents the importance of the corresponding task, N represents the number of training samples, in the first stage neural network and the second stage neural networkα_det＝1，α_box＝0.5。Indicating a sample label.

Further, the detection accuracy of the multi-task learning deep network of the present invention for face detection, feature point positioning, feature point visibility prediction, and gender identification will be described:

the detection accuracy of the face detection in this embodiment mainly uses face detection and a standard database (FDDB) to evaluate the corresponding performance. The FDDB database consists of 245 images and 5171 labeled faces, offered by the university of massachusetts, and provides a uniform evaluation code for fairness. According to the test result, when the number of false detections is 100, the testing accuracy of the multi-task learning depth network of this embodiment can reach 86.61%, which is only slightly lower than the optimal accuracy 88.53% (obtained by testing a deep cone Deformable part Model for Face Detection (DP 2MFD Model) for detecting a Face), and as the number of false detections increases, the testing accuracy of the Face Detection of the multi-task learning depth network of this embodiment also correspondingly increases, and when the number of false detections is 250, the testing accuracy can reach 90.1%. For face detection in a multitasking learning deep network, the FDDB data set is very challenging because the data set contains many small and fuzzy faces, and first, adjusting the image to 227x227 input size will distort the face resulting in a reduced detection score. Despite these problems, the multi-task learning deep network of the present embodiment still achieves a good testing effect.

The performance of face detection of the multitask learning deep network of the present embodiment is evaluated by the AFLW data set. The AFLW data set consists of 1000 pictures containing 1132 face samples. Only when the degree of overlap is greater than a preset threshold (which may be set to 0.5 in this embodiment), the degree of overlap is used as a data set for the face test, and the average position of the predicted feature points corresponding to the face region to be tested is calculated. A subset of 450 samples was randomly created from the AFLW dataset and divided by deflection angle into three groups, [0 °, 30 ° ], [30 °, 60 ° ] and [60 °, 90 ° ], each accounting for 1/3. The normalized mean error is used to evaluate the positioning accuracy, but since the method of the present invention involves visibility of feature points, it can be seen that feature points normalize the mean of the estimated error as follows:

wherein, U_iRepresenting the actual feature point coordinates, v_iIs the visibility of the feature point correspondence,for the predicted feature point coordinates, N_tRepresenting the number of test pictures. Wherein | v_i|₁Is the number of visible characteristic points of the ith picture, U_i(j) is U_iJ th column of (d)_iIs the square root of the size of the face bounding box. It is worth noting that d is when the face image is close to the front face_iIn most cases the distance of the pupil center is used, however, considering that the AFLW dataset contains invisible feature points, d_iThe face bounding box size is used. Compared with the existing method, the testing method of the multi-task learning deep network of the embodiment can still obtain a better structure.

Gender identification was assessed by the CelebA dataset and LFWA dataset, which contain gender information. The CelebA and LFWA datasets contain marker images selected from the CelebFaces and LFWA datasets, respectively. The CelebA dataset contained 10000 identities, for a total of 20 million images. The LFWA dataset contains 5327 identities for a total of 13233 images. The multitask learning deep network of the present embodiment achieved 97% accuracy on the CelebA dataset and 93% accuracy on the LFWA dataset.

Further, referring to fig. 17, fig. 17 is a schematic structural diagram of a testing apparatus of a multitasking learning deep network according to an embodiment of the present invention. As shown in fig. 17, the test apparatus 300 of the multitasking learning deep network of the embodiment may include at least a memory 301, a communication circuit 303 and a processor 302 connected to each other; the memory 301 stores a secondary cascade convolution neural network, a multitask learning deep network and program data; the communication circuit 303 is used for acquiring an image to be detected; the processor 302 is configured to execute the above-mentioned testing method for the multitask learning depth network according to the program data, and perform face detection, feature point positioning, feature point visibility prediction, and gender identification on the image to be tested by using the two-stage cascaded convolutional neural network and the multitask learning depth network. In another embodiment, the training set may also be stored directly in the memory 301.

On the other hand, please refer to fig. 18, fig. 18 is a schematic structural diagram of a storage medium according to an embodiment of the present invention. As shown in fig. 18, the storage medium 400 of this embodiment stores at least one program or instruction 401, and the program or instruction 401 is configured to execute any of the first to third embodiments of the training method for a deep multitask learning network shown in fig. 1 to 7 and/or any of the first to third embodiments of the testing method for a deep multitask learning network shown in fig. 10 to 16.

In an embodiment, the storage medium 400 may be the memory in fig. 8, fig. 9 or fig. 17, the storage medium 400 of this embodiment may be a storage chip, a hard disk, or a removable hard disk, or other readable and writable storage means such as a flash disk, an optical disk, or the like, and furthermore, the storage medium may be a server, or the like.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A training method of a multi-task learning deep network is characterized by comprising the following steps:

2. The training method of claim 1, wherein the multi-task learning comprises a feature point localization task, a feature point visibility prediction task, a face detection task, and a gender identification task.

3. The training method according to claim 1, wherein the step of inputting the training set into a plurality of convolutional layers and pooling layers cascaded in the multitask learning deep network step by step, and outputting corresponding first operation results from the plurality of convolutional layers and pooling layers, respectively, comprises:

and inputting the training set into a plurality of cascaded convolutional layers and pooling layers in a multi-task learning deep network step by step, respectively outputting corresponding second operation results from a plurality of convolutional layers and/or a plurality of pooling layers in the plurality of convolutional layers and pooling layers, and taking the second operation results as the first operation results.

4. The training method according to claim 3, further comprising, after the step of outputting the respective second operation results from the plurality of convolutional layers and/or the plurality of pooling layers, respectively, of the convolutional layers and the pooling layers:

and inputting the second operation results into corresponding sub convolution layers respectively, and outputting corresponding third operation results with the same dimensionality.

5. The training method according to claim 4, wherein after the step of outputting the corresponding third operation result with the same dimension, the method further comprises:

and inputting the corresponding third operation result with the same dimensionality into a full convolution layer, outputting a fourth operation result after dimensionality reduction processing, and taking the fourth operation result as the first operation result.

6. The training method according to claim 3, wherein the step of outputting the corresponding second operation result from each of the plurality of convolutional layers and pooling layers comprises:

and respectively outputting a corresponding first pooling operation result, a first pooling operation result and a second pooling operation result from a first pooling layer, a third pooling layer and a fifth pooling layer of the plurality of layers of convolution layers and pooling layers, and taking the first pooling operation result, the first pooling operation result and the second pooling operation result as second operation results.

7. The training method according to claim 1, wherein before the step of inputting the training set stage by stage into a plurality of cascaded convolutional layers and pooling layers in the multitask learning deep network, further comprising:

training a face detection task by using an AlexNet network to obtain a weight corresponding to the face detection task;

initializing the multitask learning deep network with the weights.

8. The training method according to claim 1, wherein before the step of inputting the training set stage by stage into a plurality of cascaded convolutional layers and pooling layers in the multitask learning deep network, further comprising:

calculating a predicted face region of the images in the training set;

comparing the predicted face region with a marked face region marked on the images in the training set to obtain a comparison result;

selecting a face region meeting preset conditions as a detected face region according to the comparison result;

the step-by-step inputting of the training set into the plurality of cascaded convolutional layers and pooling layers in the multi-task learning deep network and outputting of a plurality of first operation results respectively comprises the following steps:

and the training set is input into a plurality of cascaded convolutional layers and pooling layers in the multi-task learning depth network step by step, and a plurality of first operation results are output based on the detected face region.

9. A training device of a multitask learning deep network is characterized by comprising a memory and a processor which are connected with each other;

the processor is configured to perform the training method of any one of claims 1 to 8 according to the program data, and train the multi-task learning deep network with the training set.

10. A storage medium storing program data executable to implement the method of training a multitask learning deep network according to any one of claims 1-8.