CN108664893B

CN108664893B - Face detection method and storage medium

Info

Publication number: CN108664893B
Application number: CN201810290187.1A
Authority: CN
Inventors: 黄海清; 王金桥; 陈盈盈; 刘智勇; 郑碎武; 杨旭; 黄志明; 谢德坤; 田�健
Original assignee: Fujian Haijing Technology Development Co ltd
Current assignee: Fujian Haijing Technology Development Co ltd
Priority date: 2018-04-03
Filing date: 2018-04-03
Publication date: 2022-04-29
Anticipated expiration: 2038-04-03
Also published as: CN108664893A

Abstract

A face detection method and a storage medium, the method includes the following steps: 102, respectively inputting a batch of same training images to a lightweight network and a complex network; 104, filtering output results of classification graphs of the lightweight network and the complex network by adopting a difficult sample mining method; 106, constructing a comprehensive loss function, wherein the comprehensive loss function comprises a knowledge distillation loss function or a label-based face detection loss function, and the knowledge distillation loss function is obtained according to the output results of the classification images of the lightweight network and the complex network; step 108, updating parameters of the lightweight network based on the loss function, and not updating parameters of the complex network; and step 110, repeating the steps until the lightweight network is trained to be converged. The neural network algorithm for quickly adjusting the parameters is provided, and the problem of face detection of a light neural network is solved.

Description

Face detection method and storage medium

Technical Field

The invention belongs to the technical field of image processing and pattern recognition, and particularly relates to a human face detection method which can be applied to the fields of safety monitoring, human-computer interaction and the like.

Background

Face detection is an important technology that is required in many computer vision applications, such as face tracking, face alignment, face recognition, etc. In recent years, due to the development of convolutional neural networks, the performance of face detection is obviously improved. However, existing face detection models are typically slow to compute because they require a relatively large neural network to maintain good face detection performance. Although detection frameworks based on one-step methods are also proposed to speed up detection (e.g. SSD, YOLO), they are still not fast enough for practical application scenarios, especially in CPU-based environments. On the other hand, if the speed requirement is met by reducing the parameters of the convolutional network, the performance of the detector will be significantly degraded. Therefore, it is a very challenging task to obtain a light-weight face detector with good performance.

Knowledge distillation (knowledge distillation) is a technique that enables small networks to mimic learning large networks, thereby improving the performance of small networks. The effectiveness of knowledge distillation has been validated on classification and metric learning tasks. For the detection task, there is no way to directly use the original knowledge distillation (knowledge distillation) technology, because the output of the detector has the problem of unbalanced classes (the background class is much more than other classes), and if only the analog learning of all the outputs like the classification task is performed, a good performance cannot be obtained. Most lightweight detectors are based on a one-step approach, rather than a two-step approach, because of the speed advantage of the former. Compared with the two-step method, the one-step method lacks a regional nomination network for eliminating negative samples, so the problem of category imbalance is more serious.

Disclosure of Invention

Therefore, a novel neural network algorithm which can be suitable for a one-step method to improve the performance of a lightweight detection model and quickly adjust parameters needs to be provided, and the problem of face detection of a lightweight neural network is solved. In the present invention, the inventor provides a face detection method, which includes the following steps:

102, respectively inputting a batch of same training images to a lightweight network and a complex network;

104, filtering output results of classification graphs of the lightweight network and the complex network by adopting a difficult sample mining method;

106, constructing a comprehensive loss function, wherein the comprehensive loss function comprises a knowledge distillation loss function or a label-based face detection loss function, and the knowledge distillation loss function is obtained according to the output results of the classification images of the lightweight network and the complex network;

step 108, updating parameters of the lightweight network based on the loss function, and not updating parameters of the complex network;

and step 110, repeating the steps until the lightweight network is trained to be converged.

Preferably, the filtering by the difficult sample mining method specifically comprises:

setting a threshold T for judging whether a certain probability in the classification map has enough confidence; t is a hyper-parameter, the value range is 0 to 1, each index in the classification graph is traversed, and when the probability of the index in the lightweight network is greater than T and the probability in the complex network is less than T, the index is added into the set S_m(ii) a Or, when the index has a probability less than T in the lightweight network and greater than T in the complex network, the index is also added to S_m。

Alternatively, the knowledge distillation loss function is:

wherein p is⁽ⁱ⁾Is the ith probability score, q, in the classification map of the complex network⁽ⁱ⁾It is the ith probability score of the lightweight network classification graph.

Further, the air conditioner is provided with a fan,

the label-based face detection loss function is as follows:

L_G＝L_cls+L_reg

wherein L is_clsIs a class two Softmax loss function for classification, L_regIs a robust regression loss function for regression;

the comprehensive loss function is weighted by a knowledge distillation loss function and a face detection loss function based on a label:

L＝L_G+cL_KD

c is the equilibrium coefficient.

Specifically, the method further comprises the steps of constructing a lightweight network and a complex network;

constructing a face detection model based on a convolutional neural network, taking the face detection model as a complex network, and training the complex network to be convergent;

and constructing a face detection model of a convolutional neural network with the same framework as the complex network to serve as a lightweight network, wherein the number of each layer of filter in the framework of the lightweight network is smaller than that of the complex network.

A knowledge-based distillation face detection storage medium storing a computer program which, when executed, performs the steps of:

Specifically, the filtering performed by the difficult sample mining method specifically comprises the following steps:

Alternatively, the knowledge distillation loss function is:

Further, the air conditioner is provided with a fan,

the label-based face detection loss function is as follows:

L_G＝L_cls+L_reg

L＝L_G+cL_KD

c is the equilibrium coefficient.

Specifically, the computer program further performs steps when executed to construct a lightweight network, a complex network;

Different from the prior art, the technology adopts a standardized calculation means, indexes which are not available in the original evaluation system are introduced in the whole evaluation process, and parameters are unified, so that the quantitative standards are relatively unified, and therefore, the method solves the problem of real-time analysis of network public opinion dynamics.

Drawings

Fig. 1 is a flowchart of a face detection method according to an embodiment.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

In the embodiment shown in fig. 1, we can see a face detection method, which includes the following steps:

and step 100, constructing a face detection model based on a convolutional neural network as a teacher network, and training the model until convergence.

The framework of teacher networks is usually the same as student networks, but the number of filters per layer is several times that of student networks, so the performance is better. In order to make the complexity of the conventional convolutional neural network simple, in this context, the teacher network can be replaced with a complex network, and the student network can also be replaced with a lightweight network, which is characterized by being a face detection model of the convolutional neural network in the same frame as the complex network, and the number of each layer of filter in the frame of the lightweight network is smaller than that of the complex network. The teacher's network training method is the same as the conventional detection model, and taking the invention as an example, the loss function of the teacher's network is L_G＝L_cls+L_reg. Wherein L is_clsIs a class two Softmax loss function for classification, L_regIs a robust regression loss function (smooth L) for regression₁). Constructing a lightweight network and a complex network;

the student network is the detection model to be obtained finally, and the parameters of the student network are initialized randomly by using the Xavier method.

In other specific embodiments, the present invention may perform the above preparation steps in advance and directly start with step 102, and input a batch of the same training images for the lightweight network and the complex network, respectively;

the training image may not be processed, or a data augmentation technique may be performed therein, specifically as follows:

for each training image input, a data augmentation technique is used, thereby increasing the generalization performance of the model. Taking the present invention as an example, data augmentation comprises the steps of:

(1) color dithering operation: parameters such as brightness, contrast, saturation, and the like of the training image are randomly adjusted with a probability of 0.5, respectively.

(2) Random clipping operation: on this training image, 5 square sub-images were randomly cropped. Wherein 1 is the largest square sub-image, and the side length of the other 4 square sub-images is 0.3-1.0 times of the short side of the training image. Of these 5 square sub-images, 1 piece was randomly selected as the final training sample.

(3) And (3) horizontal turning operation: for this selected training sample, the horizontal flipping operation was randomly performed with a probability of 0.5.

(4) Scale transformation operation: and scaling the training samples obtained through the operation to 1024 × 1024 sizes, and sending the training samples into a network for training.

Step 104, filtering by adopting a difficult sample mining method aiming at the output results of the classification maps (classification maps) of the lightweight network and the complex network; thereby solving the problems of unbalanced classification and low fitting efficiency.

The knowledge distillation (knowledge distillation) method is intended for student networks to output the same result as a teacher's network as much as possible by imitating the teacher's network. In the neural network, the information of the later layers is more closely related to the final prediction result, and better supervision information can be provided for simulating learning. Therefore, the last layer is suitable for the students to imitate the network learning. In the single-step method-based face detection framework, the last layer has two modules, namely a classification map and a regression map. The knowledge distillation is effective, and provides the soft label information learned by the teacher through the network to the student network, and compared with the original label, the soft labels have more accurate and smooth information, thereby being more beneficial to network learning. In the face detection, the labeling label of the regression frame is originally a real number and is relatively accurate; the label labels of the classification task are only 0 and 1, and are not very accurate. Therefore, the classification map is more suitable for knowledge distillation (knowledge distillation) learning.

A typical classification map (classification map) based on the one-step method has an output size of 2N × H × W, where N is the number of anchor blocks, 2 indicates the probability that each anchor block needs to predict the positive class and the negative class, H is the height of the classification map, and W is its width. Since the probabilities of the positive and negative classes are normalized and the sum is always 1, only the probability of the positive class can be focused when performing knowledge distillation, and thus the output of the classification map can be simplified to N × H × W. In the training process, the teacher network and the student network output the result of a classification map respectively, and for the two results, it needs to determine which indexes of the classification map should be filtered and which indexes can be used for knowledge distillation (knowledge distillation).

Then, step 106 is further performed to construct a synthetic loss function, which includes a knowledge distillation loss function or a label-based face detection loss function, and the knowledge distillation loss function is obtained according to the output results of the classification diagrams of the lightweight network and the complex network. In some optional embodiments, the knowledge distillation loss function is constructed to make the classification map (classification map) result of the current student network as close as possible to the classification map of the teacher network, and as a specific embodiment, the knowledge distillation loss function is:

Further, during the training process, besides the knowledge distillation (knowledge distillation) loss function, there is also a conventional label-based face detection loss function, which is consistent with the region nomination network in the classical detection framework, fast RCNN:

the label-based face detection loss function is as follows:

L_G＝L_cls+L_reg

wherein L is_clsIs a class two Softmax loss function for classification, L_regIs a robust regression loss function for regression; during training, the loss function based on knowledge distillation (knowledge distillation) and the loss function based on the label are added to form a final comprehensive loss function.

The comprehensive loss function is a knowledge distillation loss function and:

L＝L_G+cL_KD

c is a coefficient for balancing two loss functions, fixed at 50 in the present invention, and the optimal value should be determined by a specific scenario.

As shown in fig. 1, step 108 is then performed to update the parameters of the lightweight network and not to update the parameters of the teacher network based on the loss function. In this step, parameters of the student network are updated by using a back propagation algorithm according to the obtained comprehensive loss function, so that one training is completed. The parameters of the teacher's network do not need to be updated and therefore need to be frozen during training.

Step 110, repeat the above steps 102-108 until the lightweight network training is converged.

By applying the design of the steps, the precision of the lightweight face detector can be effectively improved, so that the face detection can achieve satisfactory detection effect on equipment with limited computing resources. Since the detection model and the classification and metric learning model are different in network structure, a knowledge distillation (knowledge distillation) method cannot be directly used for the detection task. Invention of the inventionPeople find that the regression graph in the face detection model based on the single-step method does not have enough effective information for learning, and the classification map (classification map) can provide effective soft label information, so that the classification map (classification map) is used as a medium for the student network and teacher network to transfer knowledge. In addition, a large number of negative class samples exist in the output result of the classification map (classification map), which causes the problem of class imbalance. The invention provides a method for mining a difficult sample, which is used for filtering a simple negative sample, so that the category is balanced, and meanwhile, a simple positive sample is filtered, so that the knowledge distillation (knowledge distillation) efficiency is higher. During training, the loss function based on knowledge distillation (knowledge distillation) and the loss function based on the label are added in a proper proportion to form a complete loss function. In specific implementation, the method also comprises the following steps of inputting the test image into the trained student network model and outputting a detection result frame. Since the number of output test frames is very large, they need to be screened and combined. In this embodiment, first, most detection frames are screened out by the confidence threshold T being 0.05, and then the top N is selected according to the confidence_a400 detection frames. Then using non-maximum value to inhibit and remove repeated detection frame, and selecting top N according to confidence coefficient_bAnd (5) obtaining the final detection result by 200 detection frames.

Finally, the knowledge distillation (knowledge distillation) -based training method provided by the invention can effectively improve the detection capability of the lightweight face detection model.

Alternatively, the knowledge distillation loss function is:

Further, the air conditioner is provided with a fan,

the label-based face detection loss function is as follows:

L_G＝L_cls+L_reg

the comprehensive loss function is a knowledge distillation loss function and:

L＝L_G+cL_KD

c is the equilibrium coefficient.

and constructing a face detection model of a convolutional neural network with the same framework as the complex neural network to serve as a lightweight network, wherein the number of each layer of filter in the framework of the lightweight network is smaller than that of the complex network.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising … …" or "comprising … …" does not exclude the presence of additional elements in a process, method, article, or terminal that comprises the element. Further, herein, "greater than," "less than," "more than," and the like are understood to exclude the present numbers; the terms "above", "below", "within" and the like are to be understood as including the number.

As will be appreciated by one skilled in the art, the above-described embodiments may be provided as a method, apparatus, or computer program product. These embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. All or part of the steps in the methods according to the embodiments may be implemented by a program instructing associated hardware, where the program may be stored in a storage medium readable by a computer device and used to execute all or part of the steps in the methods according to the embodiments. The computer devices, including but not limited to: personal computers, servers, general-purpose computers, special-purpose computers, network devices, embedded devices, programmable devices, intelligent mobile terminals, intelligent home devices, wearable intelligent devices, vehicle-mounted intelligent devices, and the like; the storage medium includes but is not limited to: RAM, ROM, magnetic disk, magnetic tape, optical disk, flash memory, U disk, removable hard disk, memory card, memory stick, network server storage, network cloud storage, etc.

The various embodiments described above are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a computer apparatus to produce a machine, such that the instructions, which execute via the processor of the computer apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer apparatus to cause a series of operational steps to be performed on the computer apparatus to produce a computer implemented process such that the instructions which execute on the computer apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Although the embodiments have been described, once the basic inventive concept is obtained, other variations and modifications of these embodiments can be made by those skilled in the art, so that the above embodiments are only examples of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes using the contents of the present specification and drawings, or any other related technical fields, which are directly or indirectly applied thereto, are included in the scope of the present invention.

Claims

1. A face detection method is characterized by comprising the following steps:

step 100, constructing a face detection model based on a convolutional neural network as a teacher network, and training the model until convergence;

the framework of the teacher network is the same as that of the student network, but the number of the filters on each layer is several times that of the student network, the teacher network and the complex network can be replaced mutually, the student network can also be replaced with a light weight network, the light weight network is characterized in that the light weight network is a face detection model of a convolutional neural network with the same framework as the complex network, and the number of the filters on each layer in the framework of the light weight network is smaller than that of the complex network;

102, respectively inputting a batch of same training images to a lightweight network and a complex network; 104, aiming at output results of classification graphs of the light-weight network and the complex network, filtering by adopting a difficult sample mining method, so that the problems of unbalanced classification and low fitting efficiency are solved;

the output size of a typical classification map based on a single-step method is 2 NxHxW, wherein N is the number of anchor points, 2 represents the probability that each anchor point needs to predict a positive class and a negative class, H is the height of the classification map, and W is the width of the classification map, because the probabilities of the positive class and the negative class are standardized and are always added to be 1, only the probability of the positive class is concerned during knowledge distillation, so the output of the classification map is simplified to be NxHxW, in the training process, a teacher network and a student network respectively output the result of one classification map, and for the two results, it needs to be decided which indexes of the classification map should be filtered and which indexes are used for knowledge distillation;

step 110, repeating the above steps until the lightweight network is trained to be converged;

through the steps, the precision of the lightweight face detector can be effectively improved, the face detection can also obtain a satisfactory detection effect on equipment with limited computing resources, because the detection model and the classification and measurement learning model are different in network structure, the knowledge distillation method cannot be directly used for detection tasks, the classification map can provide effective soft label information, the classification map is used as a medium for transferring knowledge of a student network and a teacher network, a large number of negative samples exist in the output result of the classification map, and a difficult sample mining method is used for filtering simple negative samples, so that the classification is balanced, and meanwhile, simple positive samples are filtered, and the knowledge distillation efficiency is higher;

during training, adding a loss function based on knowledge distillation and a loss function based on a label in a proper proportion to form a complete loss function, inputting a test image into a trained student network model, outputting detection result frames, screening and combining the output detection frames due to the fact that the number of the output detection frames is very large, screening most detection frames through a confidence threshold value T of 0.05, selecting 400 detection frames before N _ a according to confidence, then removing repeated detection frames through non-maximum inhibition, and selecting 200 detection frames before N _ b according to confidence, so as to obtain a final detection result; finally, the knowledge distillation-based training method can effectively improve the detection capability of the lightweight face detection model.

2. The face detection method of claim 1, wherein the filtering by the hard sample mining method specifically comprises:

setting a threshold T for judging whether a certain probability in the classification map has enough confidence; t is a hyper-parameter, the value range is0 to 1, each index in the classification graph is traversed, and when the probability of the index in the lightweight network is greater than T and the probability in the complex network is less than T, the index is added into the set S_m(ii) a Or, when the index has a probability less than T in the lightweight network and greater than T in the complex network, the index is also added to S_m。

3. The face detection method according to claim 2,

the knowledge distillation loss function is:

4. The face detection method according to claim 2 or 3,

the label-based face detection loss function is as follows:

L_G＝L_cls+L_reg

L＝L_G+cL_KD

c is the equilibrium coefficient.

5. A face detection storage medium, storing a computer program that, when executed, performs the steps of:

6. The storage medium of claim 5, wherein the filtering by the hard sample mining method is specifically:

7. The face detection storage medium of claim 6,

the knowledge distillation loss function is:

8. The face detection storage medium of claim 6 or 7,

the label-based face detection loss function is as follows:

L_G＝L_cls+L_reg

L＝L_G+cL_KD

c is the equilibrium coefficient.