CN113239847A

CN113239847A - Training method, device, equipment and storage medium of face detection network

Info

Publication number: CN113239847A
Application number: CN202110581131.3A
Authority: CN
Inventors: 邹昆; 黄迪; 董帅; 李文生
Original assignee: University of Electronic Science and Technology of China Zhongshan Institute
Current assignee: University of Electronic Science and Technology of China Zhongshan Institute
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-08-10

Abstract

The application provides a training method, a device, equipment and a storage medium of a face detection network, wherein the method is based on face detection key points of a first image, tracking the positions of the key points of the human face in the second image to obtain the key points of the human face tracking in the second image, the face tracking key points are defaulted as face labeling key points, then network loss is generated based on the face tracking key points and the face detection key points in the second image to train the trained original detection network (preliminary detection network), so as to adjust the face detection key points in the second image, so that the adjusted face detection key points do not shake, thereby obtaining a target detection network, the probability that the face key points of the target face in the continuous frames are jittered near the real mark position of the target face is reduced to a certain extent.

Description

Training method, device, equipment and storage medium of face detection network

Technical Field

The present application relates to face detection, and in particular, to a method, an apparatus, a device, and a storage medium for training a face detection network.

Background

The existing video face detection method is mainly based on a face detection neural network, and the processing flow of the face detection neural network is generally as follows: the face image is input into a face detection neural network to obtain a detection result output by the face detection neural network, wherein the detection result usually comprises face key points.

The existing video face detection method has the following defects: because the figure in the video can move, when the figure in the video moves, the human face is detected by adopting the existing human face detection neural network, and the human face key points in the continuous frames are easy to shake, namely the human face key points of the same person in the continuous frames shake near the real mark position of the corresponding part of the human face.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a device and a storage medium for training a face detection network, which can prevent the face detection network from shaking to some extent.

In a first aspect, a training method for a face detection network is provided, including:

acquiring a first image and a second image, wherein both the first image and the second image comprise a target face;

inputting the first image and the second image into a preliminary detection network respectively to obtain face detection results of the target face in the first image and the second image, wherein the face detection results comprise a plurality of face detection key points, and the preliminary detection network is obtained after an original detection network is trained;

performing key point tracking on each face detection key point of the target face in the first image to obtain a face tracking key point of the face detection key point in the second image;

calculating network loss according to each face detection key point of the target face in the second image and the face tracking key point of the face detection key point in the second image;

and training the preliminary detection network according to the network loss to obtain a target detection network.

The training method of the face detection network comprises the steps of tracking the positions of face key points in a second image based on the face detection key points of a first image to obtain face tracking key points in the second image, wherein the face tracking key points are defaulted to be face labeling key points, generating network loss based on the face tracking key points and the face detection key points in the second image, training a trained original detection network (primary detection network), adjusting the face detection key points in the second image so that the adjusted face detection key points do not shake, obtaining a target detection network, and reducing the probability that the face key points of a target face in continuous frames shake near the real labeling positions of the target face to a certain extent.

In one embodiment, the face tracking keypoints and the face detection keypoints each comprise keypoint coordinates; the calculating the network loss according to each face detection key point of the target face in the second image and the face tracking key point of the face detection key point in the second image comprises: for each face detection key point of the target face in the second image, calculating a coordinate absolute difference value between the face detection key point and a face tracking key point of the face detection key point in the second image according to a key point coordinate; determining a tracking confidence corresponding to the coordinate absolute difference value according to each coordinate absolute difference value; and calculating the network loss according to each coordinate absolute difference value and the tracking confidence corresponding to the coordinate absolute difference value.

In the above embodiment, when the difference between the face detection key point of the second image and the face tracking key point is small, it is indicated that the tracking of the key point is effective tracking, at this time, the model may be trained through the loss between the face detection key point and the face tracking key point, however, when the difference between the face detection key point of the second image and the face tracking key point is very large, it is indicated that the tracking of the key point is invalid, at this time, the model accuracy may be reduced by training the model through the loss between the face detection key point and the face tracking key point, therefore, the tracking confidence is determined according to the difference between the face detection key point and the face tracking key point, and the loss is adjusted through the tracking confidence, so that the model training accuracy is improved.

In one embodiment, the calculating the network loss according to each coordinate absolute difference value and the tracking confidence corresponding to the coordinate absolute difference value includes: if the coordinate absolute difference is smaller than a preset difference, calculating a key point distance between a face detection key point and a face tracking key point corresponding to the coordinate absolute difference according to the coordinates of the key points, and determining the loss corresponding to the coordinate absolute difference according to the product of the key point distance and the tracking confidence corresponding to the coordinate absolute difference; if the coordinate absolute difference is larger than or equal to the preset difference, obtaining a difference correction value, and determining the loss corresponding to the coordinate absolute difference according to the coordinate absolute difference, the difference correction value and the tracking confidence corresponding to the coordinate absolute difference; and calculating the network loss according to the loss corresponding to each coordinate absolute difference value.

According to the embodiment, when the difference between the face detection key point and the face tracking key point is small or large, different methods are adopted to calculate the network loss, the calculation precision of the network loss is improved, gradient explosion caused by overlarge calculated network loss is prevented, and the precision of the target detection network is further reduced.

In one embodiment, the determining, according to each absolute difference value of the coordinates, a tracking confidence corresponding to the absolute difference value of the coordinates includes: and when the coordinate absolute difference is larger than a threshold difference, setting the tracking confidence degree corresponding to the coordinate absolute difference to be 0.

In the embodiment, when the difference between the face detection key point and the face tracking key point is particularly large and exceeds the threshold difference value, the tracking failure of the key point is determined to be invalid, at this time, the preliminary detection network is not trained according to the loss between the face detection key point and the face tracking key point, and the tracking confidence coefficient is directly set to be 0, so that the accuracy of the network is not reduced.

In one embodiment, the face detection result further includes a classification detection result and a frame detection result; the calculating the network loss according to each face detection key point of the target face in the second image and the face tracking key point of the face detection key point in the second image comprises: acquiring a classification labeling result and a frame labeling result of the target face in the second image and face labeling key points corresponding to each face detection key point; calculating a classification loss according to a classification labeling result and a classification detection result of the target face in the second image, calculating a detection frame loss according to a frame labeling result and a frame detection result of the target face in the second image, calculating a key point prediction loss according to each face detection key point of the target face in the second image and a face labeling key point corresponding to the face detection key point, and calculating a key point tracking loss according to each face detection key point of the target face in the second image and a face tracking key point of the face detection key point in the second image; and calculating the network loss according to the classification loss, the detection frame loss, the key point prediction loss and the key point tracking loss.

The above embodiment, combines multiple losses: the network loss is calculated according to the classification loss, the detection frame loss, the key point prediction loss and the key point tracking loss, so that the preliminary detection network is trained based on the joint loss, and the training speed and precision of the preliminary detection network are improved.

In one embodiment, before the acquiring the first image and the second image which are acquired sequentially, the method further includes: and inputting the second image into the preliminary detection network to obtain a classification labeling result, a frame labeling result and a plurality of face labeling key points of the target face in the second image.

In the above embodiment, the classification labeling result, the frame labeling result, and the face labeling key points corresponding to each face detection key point need to be generated in advance, so that the classification labeling result, the frame labeling result, and the face labeling key points corresponding to each face detection key point can be quickly obtained when the network loss is calculated, and if the classification labeling result, the frame labeling result, and the face labeling key points corresponding to each face detection key point are manually labeled, time and labor are consumed, so that the labor cost is reduced and the obtaining time is shortened by directly detecting the network preliminarily.

In one embodiment, the performing key point tracking on each face detection key point of the target face in the first image to obtain a face tracking key point of the face detection key point in the second image includes: and performing key point tracking on each face detection key point of the target face in the first image by adopting a Lucas-Kanade algorithm to obtain a face tracking key point of the face detection key point in the second image.

In the embodiment, the Lucas-Kanade algorithm has higher robustness to noise, and has a better tracking effect on application scenes with constant brightness, slow motion and consistent space, so that the Lucas-Kanade algorithm is adopted to track the key points of the scenes meeting the conditions so as to achieve a good key point tracking effect.

In a second aspect, there is provided a training apparatus for a face detection network, including:

the image acquisition module is used for acquiring a first image and a second image which are acquired successively, wherein the first image and the second image both comprise a target face;

the network detection module is used for respectively inputting the first image and the second image into a preliminary detection network to obtain face detection results of the target face in the first image and the second image, wherein the face detection results comprise a plurality of face detection key points, and the preliminary detection network is obtained after an original detection network is trained;

the face tracking module is used for performing key point tracking on each face detection key point of the target face in the first image to obtain a face tracking key point of the face detection key point in the second image;

the loss calculation module is used for calculating network loss according to each face detection key point of the target face in the second image and the face tracking key point of the face detection key point in the second image;

and the target training module is used for training the preliminary detection network according to the network loss to obtain a target detection network.

In one embodiment, the face tracking keypoints and the face detection keypoints each comprise keypoint coordinates; the loss calculation module is specifically configured to: for each face detection key point of the target face in the second image, calculating a coordinate absolute difference value between the face detection key point and a face tracking key point of the face detection key point in the second image according to a key point coordinate; determining a tracking confidence corresponding to the coordinate absolute difference value according to each coordinate absolute difference value; and calculating the network loss according to each coordinate absolute difference value and the tracking confidence corresponding to the coordinate absolute difference value.

In one embodiment, the loss calculating module is specifically configured to: if the coordinate absolute difference is smaller than a preset difference, calculating a key point distance between a face detection key point and a face tracking key point corresponding to the coordinate absolute difference according to the coordinates of the key points, and determining the loss corresponding to the coordinate absolute difference according to the product of the key point distance and the tracking confidence corresponding to the coordinate absolute difference; if the coordinate absolute difference is larger than or equal to the preset difference, obtaining a difference correction value, and determining the loss corresponding to the coordinate absolute difference according to the coordinate absolute difference, the difference correction value and the tracking confidence corresponding to the coordinate absolute difference; and calculating the network loss according to the loss corresponding to each coordinate absolute difference value.

In one embodiment, the loss calculating module is specifically configured to: and when the coordinate absolute difference is larger than a threshold difference, setting the tracking confidence degree corresponding to the coordinate absolute difference to be 0.

In one embodiment, the face detection result further includes a classification detection result and a frame detection result; the loss calculation module is specifically configured to: acquiring a classification labeling result and a frame labeling result of the target face in the second image and face labeling key points corresponding to each face detection key point; calculating a classification loss according to a classification labeling result and a classification detection result of the target face in the second image, calculating a detection frame loss according to a frame labeling result and a frame detection result of the target face in the second image, calculating a key point prediction loss according to each face detection key point of the target face in the second image and a face labeling key point corresponding to the face detection key point, and calculating a key point tracking loss according to each face detection key point of the target face in the second image and a face tracking key point of the face detection key point in the second image; and calculating the network loss according to the classification loss, the detection frame loss, the key point prediction loss and the key point tracking loss.

In one embodiment, the apparatus further comprises: and the labeling module is used for inputting the second image into the preliminary detection network to obtain a classification labeling result, a frame labeling result and a plurality of face labeling key points of the target face in the second image.

In one embodiment, the face tracking module is specifically configured to: and performing key point tracking on each face detection key point of the target face in the first image by adopting a Lucas-Kanade algorithm to obtain a face tracking key point of the face detection key point in the second image.

In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the training method of the face detection network as described above when executing the computer program.

In a fourth aspect, a computer-readable storage medium is provided, in which computer program instructions are stored, which, when read and executed by a processor, perform the steps of the training method for a face detection network as described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic diagram illustrating an implementation flow of a training method for a face detection network in an embodiment of the present application;

FIG. 2 is a schematic diagram of a target area provided by an embodiment of the present application;

fig. 3 is a schematic diagram of a face detection key point of a target detection network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a structure of a training apparatus of a face detection network according to an embodiment of the present application;

fig. 5 is a block diagram of an internal structure of a computer device in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In an embodiment, an execution subject of the training method for a face detection network according to the embodiment of the present invention is a device capable of implementing the training method for a face detection network according to the embodiment of the present invention, and the device may include, but is not limited to, a terminal and a server. The terminal comprises a desktop terminal and a mobile terminal, wherein the desktop terminal comprises but is not limited to a desktop computer and a vehicle-mounted computer; mobile terminals include, but are not limited to, cell phones, tablets, laptops, and smartwatches. The server includes a high performance computer and a cluster of high performance computers.

As shown in fig. 1, the training method for a face detection network according to the embodiment of the present invention specifically includes:

step S100, a first image and a second image which are collected successively are obtained, and the first image and the second image both comprise a target face.

The first image is an image corresponding to a current frame in a video sequence; and the second image is an image with the frame sequence 1 greater than that of the first image in the video sequence. For example, the frame sequence of the first image is M, and the frame sequence of the second image is M +1.

The target face is a face contained in both the first image and the second image.

Step S200, inputting the first image and the second image into a preliminary detection network respectively to obtain face detection results of the target face in the first image and the second image, wherein the face detection results comprise a plurality of face detection key points, and the preliminary detection network is obtained after an original detection network is trained.

And the face detection result is a detection result of the target face output by the primary detection network. Illustratively, the face detection result includes a face detection frame, a face classification result, and a face detection key point. The face detection frame is specifically a frame coordinate, the construction of the detection frame can be realized according to the frame coordinate, and further, the face in the image is marked according to the constructed detection frame; the face classification result is specifically a classification label, that is, the content marked by the detection frame is classified, for example, when the content marked by the detection frame is a face, the classification label is 1, otherwise, the classification label is 0, for example, when the content marked by the detection frame is an animal, the classification label is 0; the method includes the steps of obtaining face key points, specifically key point coordinates, for detecting the face key points of a target face, marking the face key points in the target face in an image containing the target face according to the key point coordinates recorded by the face key points, and displaying the face key points in a marking mode, wherein the face key points are some points in the target face, and face recognition of the target face can be achieved according to the points, for example, the face key points of the target face are points of two eye corners and points of two mouth corners, the number of the face key points can be more than one, for example, 3, 5 or 1000, and the number of the face key points can be determined according to a specific application scene.

When the input of the preliminary detection network is an image containing a human face, the output of the preliminary detection network is a human face detection result; when the input of the preliminary detection network is an image without a face, the face detection result output by the preliminary detection network is empty, that is, the frame coordinates corresponding to the face detection frame, the classification labels corresponding to the face classification result, and the face detection key points cannot be detected.

In order to obtain a preliminary detection network, an original detection network needs to be constructed, and then the original detection network is trained. Exemplarily, the original detection network comprises a MobileNet backbone network, a feature pyramid network and a context connection network, wherein the MobileNet backbone network is used for extracting features of an input image to obtain a feature map; the feature pyramid network and the context connection network are used for connecting feature images extracted from different network layers, so that a larger receptive field is formed, and a face detection result is obtained. After the original detection network is built, the original detection network needs to be trained, and the training method is as follows: acquiring a current training image, inputting the current training image into an original detection network, and acquiring a face detection result corresponding to the current training image, wherein the current training image is a face image used for training the original detection network; acquiring a face labeling result corresponding to a current training image, wherein the face labeling result corresponding to the current training image is a result obtained by labeling a face in the current training image; calculating training loss according to a face labeling result and a face detection result corresponding to the current training image, wherein the training loss is the loss of the training original detection network; and training the original detection network according to the training loss to obtain a primary detection network. Illustratively, the dataset that trains the raw detection network is the WiderFace dataset.

The preliminary detection network is obtained by training an original detection network, so the preliminary detection network has higher detection precision and detection real-time performance, but the existing measurement method for measuring the detection precision is the difference between all the face detection key points and the corresponding labeling key points in the image, so although the difference between all the face detection key points and the corresponding labeling key points meets the requirement of network training, for a single face detection key point, the position difference possibly exists between the single face detection key point and the real face key points, namely, the shake occurs, and the preliminary detection network needs to be retrained to obtain a target detection network, so the detection accuracy of each key point is improved, and the shake phenomenon is prevented.

Step S300, carrying out key point tracking on the face detection key points of the target face in the first image to obtain face tracking key points of the target face in the second image.

The face tracking key point is a face key point in the second image obtained in a key point tracking manner, and specifically is a coordinate position of a face key point of a target face in the second image obtained in a key point tracking manner. Illustratively, an optical flow method is adopted to track key points, and face tracking key points are obtained. The optical flow method is used for evaluating deformation between two images, the optical flow method is used for calculating optical flow, the optical flow is the change of coordinate positions of the same object point between the two images, the object point refers to a point in a real scene, the optical flow method assumes that the color of an object does not have large and obvious change in two frames before and after, a constraint equation is obtained based on the assumption, and the optical flow calculation is realized through the constraint equation. It should be noted that the method for tracking the key points is not limited to the optical flow method, and other methods may be used to track the key points as long as the method can determine the relationship between the coordinate positions of the same pixel point in different images.

Because the key point tracking is carried out on the face detection key point of the target face in the first image, the position of the face tracking key point in the tracked second image in the target face and the position of the face detection key point in the first image in the target face do not change greatly, so that when the primary detection network is trained, the face tracking key point in the second image can be regarded as a face labeling key point, the network loss is calculated according to the face detection key point and the face tracking key point in the second image, and the primary detection network is trained, so that the key point shaking phenomenon can be solved to a certain extent by the trained target detection network.

Since the key point tracking is performed on the face detection key point of the target face in the first image, and the face tracking key point in the tracked second image is regarded as the face labeling key point, it is necessary to ensure that the position of the face detection key point in the first image in the target face is not excessively shifted or shaken compared with the position of the face key point in the target face, that is, the position of the face detection key point in the first image of the target face output by the preliminary detection network in the target face is not excessively different from the position of the face key point in the target face, it can be understood that, if the position of the face detection key point in the first image of the target face output by the preliminary detection network in the target face is largely different from the position of the face key point in the target face, the position of the face tracking key point in the target face tracked by the key point tracking manner is inevitably different from that of the face detection key point in the target face The positions of the face key points in the target face have larger difference, and if the face tracking key points with larger difference are used as face labeling key points to train the preliminary detection network, the training precision of the preliminary detection network is reduced to a certain extent.

Therefore, the first image for training the preliminary detection network needs to be screened to a certain extent, and problematic images are eliminated, that is, images with large differences between the positions of the face detection key points and the positions of the face key points in the face are eliminated, so that the training precision of the preliminary detection network is improved. A method of acquiring a first image is provided, comprising: acquiring a candidate image set, wherein the candidate image set comprises a plurality of candidate images, and each candidate image comprises a target face; inputting the candidate images into a preliminary detection network to obtain a plurality of face detection key points corresponding to each candidate image output by the preliminary detection network; comparing the positions of a plurality of face detection key points corresponding to each candidate image with the positions of face key points of a target face respectively to obtain a comparison result of each candidate image, wherein the comparison result is a result with small difference or a result with large difference; and taking the candidate image with the comparison result being a result with smaller difference as the first image.

A method for judging the difference size according to the comparison result is provided: a target area is set. Specifically, a target region is constructed according to the positions of face key points of a target face in the target face, for example, 5 elliptical target regions shown in fig. 2, where the 5 elliptical target regions are constructed based on corresponding face key points, respectively, and when a certain number of face detection key points in a candidate image are located in corresponding target regions in the target face, the comparison result of the candidate image is determined as a result with a small difference; and when the face detection key points less than the certain number in the candidate image are positioned in the corresponding target area in the target face, determining the comparison result of the candidate image as a result with larger difference. Thereby selecting a first image from the plurality of candidate images for training the preliminary detection network.

Illustratively, the target area is obtained by means of manual labeling.

In one embodiment, considering that the Lucas-Kanade algorithm has higher robustness to noise and has a better tracking effect on application scenes with constant brightness, slow motion and consistent space, the Lucas-Kanade algorithm is adopted to track key points of scenes meeting the above conditions so as to achieve a good key point tracking effect. Specifically, the step 300 of performing key point tracking on the face detection key point of the target face in the first image to obtain the face tracking key point of the target face in the second image includes: and performing key point tracking on each face detection key point of the target face in the first image by adopting a Lucas-Kanade algorithm to obtain a face tracking key point of the face detection key point in the second image.

Specifically, the first image and the second image are used as the input of the Lucas-Kanade algorithm, and the optical flow between the first image and the second image output by the Lucas-Kanade algorithm is obtained; and obtaining a face tracking key point (coordinate) of the target face in the second image according to the optical flow between the first image and the second image and the face detection key point (coordinate) of the target face in the first image.

And step S400, calculating network loss according to the face detection key points and the face tracking key points of the target face in the second image.

And network loss, namely loss of a preliminary detection network for training. Since the face tracking key points are regarded as face labeling key points, the preliminary detection network can be trained by calculating the network loss based on the face detection key points and the face tracking key points in the second image.

And S500, training the primary detection network according to the network loss to obtain a target detection network.

And at the moment, the network loss is propagated reversely, and the network parameters in the preliminary detection network are adjusted through a gradient descent algorithm until the calculated network loss is less than the preset loss, so that the target detection network is obtained. And then, the target detection network is used for detecting the input human face images with continuous frame sequences, and human face key points in the obtained continuous frames do not generate huge jitter any more. As shown in fig. 3, the face detection key points (the second row in fig. 3) obtained by the target detection network according to the embodiment of the present invention show that the face detection key points are better maintained at the positions of the face key points of the target face, and excessive jitter or offset does not occur.

In one embodiment, when the difference between the face detection key point and the face tracking key point of the second image is small, it is indicated that the tracking of the key point is effective tracking, at this time, the model may be trained through the loss between the face detection key point and the face tracking key point, but when the difference between the face detection key point and the face tracking key point of the second image is particularly large, it is indicated that the tracking of the key point is invalid, at this time, the model accuracy may be reduced by training the model through the loss between the face detection key point and the face tracking key point, therefore, the tracking confidence is determined according to the difference between the face detection key point and the face tracking key point, and the loss is adjusted through the tracking confidence, so as to improve the model training accuracy. Specifically, the face tracking key points and the face detection key points both include key point coordinates; the step 400 of calculating a network loss according to each face detection key point of the target face in the second image and the face tracking key point of the face detection key point in the second image includes:

step 401, for each face detection key point of the target face in the second image, calculating a coordinate absolute difference between the face detection key point and a face tracking key point of the face detection key point in the second image according to the key point coordinates.

The coordinate absolute difference is the absolute value of the coordinate difference and reflects the key points of face trackingAnd the size of the position difference between the face detection key points. For example, the number of face key points is 5, and the key point coordinates of the ith personal face detection key point in the second image are (x)_i，y_i) Wherein 1 is<＝i<And i is a positive integer, and the coordinates of the key point of the ith personal face tracking key point in the second image are (X)_i，Y_i) Then, the absolute difference in coordinates between the ith personal face detection key point and the ith personal face tracking key point is: | x_i-X_i|+|y_i-Y_iBy such a method, 5 coordinate absolute differences can be obtained.

Step 402, determining a tracking confidence corresponding to the coordinate absolute difference value according to each coordinate absolute difference value.

And the tracking confidence coefficient is used for reflecting whether the tracking is effective or not, when the difference between the face detection key point and the face tracking key point is small, the tracking is considered to be effective, and the tracking confidence coefficient can be set to be larger, on the contrary, when the difference between the face detection key point and the face tracking key point is large, the tracking is considered to be gradually ineffective, and at the moment, the tracking confidence coefficient can be set to be smaller.

Two methods of determining tracking confidence are provided.

Firstly, acquiring a preset difference value; if the coordinate absolute difference is smaller than a preset difference, determining the tracking confidence as a first confidence; and if the coordinate absolute difference is greater than or equal to the preset difference, determining the tracking confidence as a second confidence, wherein the first confidence is greater than the second confidence.

Illustratively, the preset difference is 1, the first confidence is 1, and the second confidence is 0.1.

Secondly, acquiring a preset difference value; dividing the coordinate absolute difference value by a preset difference value to obtain a difference value proportion; acquiring a first conversion coefficient and a second conversion coefficient, wherein the first conversion coefficient is a negative number, and the second conversion coefficient is a positive number; and obtaining the tracking confidence corresponding to the coordinate absolute difference according to the difference proportion, the first conversion coefficient and the second conversion coefficient.

In order to ensure that the difference proportion and the tracking confidence degree are in negative correlation, namely the difference proportion is smaller, the difference between the face detection key point and the face tracking key point is considered to be smaller, the tracking confidence degree is set to be larger, the difference proportion is larger, the difference between the face detection key point and the face tracking key point is considered to be larger, the tracking confidence degree is set to be larger, therefore, the first conversion coefficient is set to be a negative number, and the second conversion coefficient is set to be a positive number.

For example, if the first conversion factor is-0.8, the second conversion factor is 1.14, and the difference ratio is 0.3, then the tracking confidence is 0.3 × (-0.8) +1.14 ═ 0.9; if the difference ratio is 0.8, the tracking confidence is 0.8 × (-0.8) +1.14 ═ 0.5; if the difference ratio is 1.2, the tracking confidence is 1.2 × (-0.8) +1.14 ═ 0.18.

It can be seen that, the larger the difference ratio is, the smaller the tracking confidence degree is, even smaller than 0, and in order to prevent the tracking confidence degree from being a negative number, when the value of the tracking confidence degree calculated according to the difference ratio, the first conversion coefficient and the second conversion coefficient is smaller than or equal to a first preset confidence degree, the calculated tracking confidence degree is set as a first preset confidence degree; the smaller the difference ratio, the greater the tracking confidence, even greater than 1, and in order to prevent the tracking confidence from being too great and greater than 1, when the value of the tracking confidence calculated from the difference ratio, the first conversion coefficient, and the second conversion coefficient is greater than or equal to a second preset confidence, the calculated tracking confidence is set as the second preset confidence. For example, the first preset confidence is 0.1, that is, when the tracking confidence is less than or equal to 0.1, the tracking confidence is set to 0.1; the second preset confidence is 1, that is, when the tracking confidence is greater than or equal to 1, the tracking confidence is set to 1.

And 403, calculating the network loss according to each coordinate absolute difference value and the tracking confidence corresponding to the coordinate absolute difference value.

Illustratively, the network loss is obtained by multiplying the coordinate absolute difference value by the tracking confidence corresponding to the coordinate absolute difference value. For example, the coordinate absolute difference is 3, and the tracking confidence corresponding to the coordinate absolute difference is 0.9, so that the network loss is 3 × 0.9 — 2.7.

In one embodiment, when the difference between the face detection key point and the face tracking key point is small or large, different methods are adopted to calculate the network loss, so that the calculation precision of the network loss is improved, gradient explosion caused by overlarge calculated network loss is prevented, and the precision of a target detection network is further reduced. Specifically, the step 403 of calculating the network loss according to each coordinate absolute difference value and the tracking confidence corresponding to the coordinate absolute difference value includes:

step 403A, if the coordinate absolute difference is smaller than a preset difference, calculating a key point distance between a face detection key point and a face tracking key point corresponding to the coordinate absolute difference according to the key point coordinates, and determining a loss corresponding to the coordinate absolute difference according to a product of the key point distance and a tracking confidence corresponding to the coordinate absolute difference.

For example, the number of face key points is 5, and the ith personal face detection key point in the second image (where, 1)<＝i<5) is (x)_i，y_i) The coordinate of the key point of the ith personal face tracking key point in the second image is (X)_i，Y_i) Then, the keypoint distance between the ith personal face detection keypoint and the ith personal face tracking keypoint is: (x)_i-X_i)²+(y_i-Y_i)²The distance of the key point corresponding to the ith personal face detection key point and the tracking confidence coefficient a_iMultiplying to obtain the ith key point tracking sub-loss (namely the loss corresponding to the ith coordinate absolute difference value) as follows: a is_i×((x_i-X_i)²+(y_i-Y_i)²). Of course, in order to further correct the loss, a weight p (a decimal between 0 and 1) may be set, and the weight p may be multiplied by the loss to obtain a corrected loss: p is a_i×((x_i-X_i)²+(y_i-Y_i)²). For example, p is 0.5, a_i＝1。

Step 403B, if the coordinate absolute difference is greater than or equal to the preset difference, obtaining a difference correction value, and determining a loss corresponding to the coordinate absolute difference according to the coordinate absolute difference, the difference correction value, and a tracking confidence corresponding to the coordinate absolute difference.

For example, the absolute difference of coordinates is | x_i-X_i|+|y_i-Y_iL, the difference correction value is m, the tracking confidence corresponding to the coordinate absolute difference value is a_iThen, the ith keypoint tracking sub-penalty (i.e., the penalty corresponding to the absolute difference of the ith coordinate) is: a is_i×(|x_i-X_i|+|y_i-Y_iI-m), e.g. m-0.5, a_i＝1。

And 403C, calculating network loss according to the loss corresponding to each coordinate absolute difference value.

And adding the losses corresponding to the calculated absolute difference values of the coordinates to obtain the tracking loss of the key point, and obtaining the network loss according to the tracking loss of the key point. For example, if the target face has 2 key points of the face, the 1 st key point corresponds to the absolute difference value a of the coordinates, and the 2 nd key point corresponds to the absolute difference value B of the coordinates, where the absolute difference value a of the coordinates is smaller than the preset difference value, the 1 st key point tracking sub-loss is obtained based on step 403A, and the absolute difference value B of the coordinates is greater than or equal to the preset difference value, the 2 nd key point tracking sub-loss is obtained based on step 403B, and the 1 st key point tracking sub-loss and the 2 nd key point tracking sub-loss are added to obtain the key point tracking loss.

The larger the value of the network loss is, the more adverse the gradient drop is, the more gradient explosion is easy to occur, and further the network precision is reduced, therefore, when the absolute difference value of the coordinates is smaller, the purpose of reducing the network loss can be achieved by calculating the distance of the key point, and when the absolute difference value of the coordinates is larger, the larger the network loss is inevitably caused by calculating the network loss by the distance of the key point, therefore, the absolute difference value of the coordinates is corrected according to the difference correction value, and the relatively smaller network loss is obtained.

In one embodiment, when the difference between the face detection key point and the face tracking key point is particularly large and exceeds a threshold difference value, the tracking failure of the key point is definitely disabled, for example, in some scenes with fast motion, the situation of optical flow tracking failure is easily occurred, at this time, the preliminary detection network is not trained according to the loss between the face detection key point and the face tracking key point, and the tracking confidence is directly set to 0, so as to avoid reducing the network accuracy. Specifically, the step 402 of determining the tracking confidence corresponding to the coordinate absolute difference value according to each coordinate absolute difference value includes: and when the coordinate absolute difference is larger than a threshold difference, setting the tracking confidence degree corresponding to the coordinate absolute difference to be 0.

For example, the absolute difference of coordinates is | x_i-X_i|+|y_i-Y_iL, the preset difference is 1, the threshold difference is 2, when the absolute difference of the coordinates is larger than the preset difference, the formula a is adopted_i×(|x_i-X_i|+|y_i-Y_iI-m) calculating a loss corresponding to the coordinate absolute difference value, but when the coordinate absolute difference value is greater than a preset difference value and greater than a threshold difference value, although according to the formula a_i×(|x_i-X_i|+|y_i-Y_iI-m) the loss corresponding to the absolute difference of the coordinates is calculated, but since the tracking confidence is 0 at this time, the calculated loss corresponding to the absolute difference of the coordinates will also be 0 at this time.

In one embodiment, a training method is provided that combines a plurality of penalties: the network loss is calculated according to the classification loss, the detection frame loss, the key point prediction loss and the key point tracking loss, so that the preliminary detection network is trained based on the joint loss, and the training speed and precision of the preliminary detection network are improved. Specifically, the face detection result further includes a classification detection result and a frame detection result; the step 400 of calculating a network loss according to each face detection key point of the target face in the second image and the face tracking key point of the face detection key point in the second image includes:

step 400A, obtaining the classification labeling result, the frame labeling result and the face labeling key point corresponding to each face detection key point of the target face in the second image.

The classification labeling result is a result obtained by classifying the target face in the second image in advance; a frame marking result is a detection frame (detection frame coordinates) which is determined for the target face in the second image in advance; the face labeling key points are face labeling key points (key point coordinates) obtained by labeling the face key points of the target face in the second image in advance.

Step 400B, calculating a classification loss according to the classification labeling result and the classification detection result of the target face in the second image, calculating a detection frame loss according to the frame labeling result and the frame detection result of the target face in the second image, calculating a key point prediction loss according to each face detection key point of the target face in the second image and a face labeling key point corresponding to the face detection key point, and calculating a key point tracking loss according to each face detection key point of the target face in the second image and a face tracking key point of the face detection key point in the second image.

Subtracting the classification labeling result from the classification detection result to obtain a classification value, and taking the absolute value of the classification value as a classification loss; calculating the distance between the frame marking result and the frame detection result, taking the distance between the frame marking result and the frame detection result as the frame detection loss, and taking the distance between the frame marking result and the frame detection result as the distance between the coordinates; calculating the distance between the face detection key point and the face labeling key point, taking the distance between the face detection key point and the face labeling key point as the key point prediction loss, and taking the distance between the face detection key point and the face labeling key point as the distance between the coordinates; the calculation of the keypoint tracking loss refers to steps 401 to 402, which are not described in detail herein.

And step 400C, calculating network loss according to the classification loss, the detection frame loss, the key point prediction loss and the key point tracking loss.

And obtaining a first weight, a second weight, a third weight and a fourth weight, and carrying out weighted summation on the classification loss, the detection frame loss, the key point prediction loss and the key point tracking loss based on the first weight, the second weight, the third weight and the fourth weight so as to obtain the network loss.

It should be noted that, in order to obtain higher training accuracy, the preliminary detection network is trained by using multiple sets of the first image and the second image. Specifically, for each group of first images and second images, determining classification loss, detection frame loss, keypoint prediction loss and keypoint tracking loss corresponding to the second images in the group; based on the first weight, the second weight, the third weight and the fourth weight, carrying out weighted summation on the classification loss, the detection frame loss, the key point prediction loss and the key point tracking loss corresponding to the second image in the group to obtain the group loss corresponding to the second image in the group; and summing the group losses corresponding to the plurality of second images to obtain the final network loss.

In one embodiment, the classification labeling result, the frame labeling result and the face labeling key points corresponding to each face detection key point need to be generated in advance, so that the classification labeling result, the frame labeling result and the face labeling key points corresponding to each face detection key point can be quickly obtained when the network loss is calculated, and if the classification labeling result, the frame labeling result and the face labeling key points corresponding to each face detection key point are manually labeled, time and labor are consumed, so that the labor cost is reduced and the obtaining time is shortened by directly detecting the network preliminarily. Specifically, before the step 100 of acquiring the first image and the second image which are successively acquired, the method further includes:

and step 000, inputting the second image into the preliminary detection network to obtain a classification labeling result, a frame labeling result and a plurality of face labeling key points of the target face in the second image.

The initial detection network is obtained by training the original detection network, so that the initial detection network has higher detection precision and detection real-time performance, a plurality of labeling results can be generated by adopting the initial detection network, and the labeling efficiency is improved.

In one embodiment, as shown in fig. 4, there is provided an apparatus 400 for training a face detection network, including:

the image acquisition module 410 is configured to acquire a first image and a second image which are acquired sequentially, where the first image and the second image both include a target face;

a network detection module 420, configured to input the first image and the second image into a preliminary detection network respectively, to obtain face detection results of the target face in the first image and the second image, where the face detection results include a plurality of face detection key points, and the preliminary detection network is obtained by training an original detection network;

a face tracking module 430, configured to perform key point tracking on each face detection key point of the target face in the first image, to obtain a face tracking key point of the face detection key point in the second image;

a loss calculation module 440, configured to calculate a network loss according to each face detection key point of the target face in the second image and a face tracking key point of the face detection key point in the second image;

and the target training module 450 is configured to train the preliminary detection network according to the network loss to obtain a target detection network.

The training device of the face detection network tracks the positions of the face key points in the second image based on the face detection key points of the first image to obtain face tracking key points in the second image, the face tracking key points are defaulted to be face labeling key points, then network loss is generated based on the face tracking key points and the face detection key points in the second image to train a trained original detection network (primary detection network) so as to adjust the face detection key points in the second image, so that the adjusted face detection key points cannot shake, a target detection network is obtained, and the probability that the face key points of a target face in continuous frames shake near the real labeling positions of the target face is reduced to a certain extent.

In one embodiment, the face tracking keypoints and the face detection keypoints each comprise keypoint coordinates; the loss calculating module 440 is specifically configured to: for each face detection key point of the target face in the second image, calculating a coordinate absolute difference value between the face detection key point and a face tracking key point of the face detection key point in the second image according to a key point coordinate; determining a tracking confidence corresponding to the coordinate absolute difference value according to each coordinate absolute difference value; and calculating the network loss according to each coordinate absolute difference value and the tracking confidence corresponding to the coordinate absolute difference value.

In an embodiment, the loss calculating module 440 is specifically configured to: if the coordinate absolute difference is smaller than a preset difference, calculating a key point distance between a face detection key point and a face tracking key point corresponding to the coordinate absolute difference according to the coordinates of the key points, and determining the loss corresponding to the coordinate absolute difference according to the product of the key point distance and the tracking confidence corresponding to the coordinate absolute difference; if the coordinate absolute difference is larger than or equal to the preset difference, obtaining a difference correction value, and determining the loss corresponding to the coordinate absolute difference according to the coordinate absolute difference, the difference correction value and the tracking confidence corresponding to the coordinate absolute difference; and calculating the network loss according to the loss corresponding to each coordinate absolute difference value.

In an embodiment, the loss calculating module 440 is specifically configured to: and when the coordinate absolute difference is larger than a threshold difference, setting the tracking confidence degree corresponding to the coordinate absolute difference to be 0.

In one embodiment, the face detection result further includes a classification detection result and a frame detection result; the loss calculating module 440 is specifically configured to: acquiring a classification labeling result and a frame labeling result of the target face in the second image and face labeling key points corresponding to each face detection key point; calculating a classification loss according to a classification labeling result and a classification detection result of the target face in the second image, calculating a detection frame loss according to a frame labeling result and a frame detection result of the target face in the second image, calculating a key point prediction loss according to each face detection key point of the target face in the second image and a face labeling key point corresponding to the face detection key point, and calculating a key point tracking loss according to each face detection key point of the target face in the second image and a face tracking key point of the face detection key point in the second image; and calculating the network loss according to the classification loss, the detection frame loss, the key point prediction loss and the key point tracking loss.

In one embodiment, the apparatus 400 further comprises: and the labeling module is used for inputting the second image into the preliminary detection network to obtain a classification labeling result, a frame labeling result and a plurality of face labeling key points of the target face in the second image.

In one embodiment, the face tracking module 430 is specifically configured to: and performing key point tracking on each face detection key point of the target face in the first image by adopting a Lucas-Kanade algorithm to obtain a face tracking key point of the face detection key point in the second image.

In one embodiment, as shown in fig. 5, a computer device is provided, which may be a terminal or a server in particular. The computer device comprises a processor, a memory and a network interface which are connected through a system bus, wherein the memory comprises a nonvolatile storage medium and an internal memory, the nonvolatile storage medium of the computer device stores an operating system, and also stores a computer program, and when the computer program is executed by the processor, the processor can realize the training method of the face detection network. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM). The internal memory may also store a computer program, and when the computer program is executed by the processor, the computer program may cause the processor to execute a training method for a face detection network. Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The training method for the face detection network provided by the application can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in fig. 5. The memory of the computer device may store in it the various program templates that make up the training means of the face detection network. Such as an image acquisition module and a network detection module.

A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:

acquiring a first image and a second image which are acquired successively, wherein the first image and the second image both comprise a target face;

In one embodiment, a computer readable storage medium is provided, storing a computer program that, when executed by a processor, causes the processor to perform the steps of:

It should be noted that the above training method of the face detection network, the training apparatus of the face detection network, the computer device and the computer readable storage medium belong to a general inventive concept, and the contents in the embodiments of the training method of the face detection network, the training apparatus of the face detection network, the computer device and the computer readable storage medium are mutually applicable.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A training method of a face detection network is characterized by comprising the following steps:

2. The training method of claim 1, wherein the face tracking keypoints and the face detection keypoints each comprise keypoint coordinates;

the calculating the network loss according to each face detection key point of the target face in the second image and the face tracking key point of the face detection key point in the second image comprises:

for each face detection key point of the target face in the second image, calculating a coordinate absolute difference value between the face detection key point and a face tracking key point of the face detection key point in the second image according to a key point coordinate;

determining a tracking confidence corresponding to the coordinate absolute difference value according to each coordinate absolute difference value;

and calculating the network loss according to each coordinate absolute difference value and the tracking confidence corresponding to the coordinate absolute difference value.

3. The training method of claim 2, wherein said calculating a network loss based on each of said coordinate absolute difference values and a tracking confidence level associated with said coordinate absolute difference values comprises:

if the coordinate absolute difference is smaller than a preset difference, calculating a key point distance between a face detection key point and a face tracking key point corresponding to the coordinate absolute difference according to the coordinates of the key points, and determining the loss corresponding to the coordinate absolute difference according to the product of the key point distance and the tracking confidence corresponding to the coordinate absolute difference;

if the coordinate absolute difference is larger than or equal to the preset difference, obtaining a difference correction value, and determining the loss corresponding to the coordinate absolute difference according to the coordinate absolute difference, the difference correction value and the tracking confidence corresponding to the coordinate absolute difference;

and calculating the network loss according to the loss corresponding to each coordinate absolute difference value.

4. The training method of claim 3, wherein said determining a tracking confidence level for each of said coordinate absolute difference values comprises:

and when the coordinate absolute difference is larger than a threshold difference, setting the tracking confidence degree corresponding to the coordinate absolute difference to be 0.

5. The training method of claim 1, wherein the face detection result further comprises a classification detection result and a frame detection result;

acquiring a classification labeling result and a frame labeling result of the target face in the second image and face labeling key points corresponding to each face detection key point;

calculating a classification loss according to a classification labeling result and a classification detection result of the target face in the second image, calculating a detection frame loss according to a frame labeling result and a frame detection result of the target face in the second image, calculating a key point prediction loss according to each face detection key point of the target face in the second image and a face labeling key point corresponding to the face detection key point, and calculating a key point tracking loss according to each face detection key point of the target face in the second image and a face tracking key point of the face detection key point in the second image;

and calculating the network loss according to the classification loss, the detection frame loss, the key point prediction loss and the key point tracking loss.

6. The method of claim 5, further comprising, prior to said acquiring the first image and the second image acquired sequentially:

and inputting the second image into the preliminary detection network to obtain a classification labeling result, a frame labeling result and a plurality of face labeling key points of the target face in the second image.

7. The method of claim 1, wherein performing keypoint tracking on each face detection keypoint of the target face in the first image to obtain a face tracking keypoint of the face detection keypoint in the second image comprises:

and performing key point tracking on each face detection key point of the target face in the first image by adopting a Lucas-Kanade algorithm to obtain a face tracking key point of the face detection key point in the second image.

8. An apparatus for training a face detection network, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the training method as claimed in any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, having stored thereon computer program instructions, which, when read and executed by a processor, perform the steps of the training method of any one of claims 1 to 7.