CN108427941B

CN108427941B - Method for generating face detection model, face detection method and device

Info

Publication number: CN108427941B
Application number: CN201810307489.5A
Authority: CN
Inventors: 何泽强
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-04-08
Filing date: 2018-04-08
Publication date: 2020-06-02
Anticipated expiration: 2038-04-08
Also published as: CN108427941A

Abstract

The embodiment of the application discloses a method for generating a face detection model, a face detection method and a face detection device. One embodiment of the method comprises: acquiring an initial face detection model, and taking the acquired initial face detection model as a current face detection model; obtaining a sample set, and selecting a sample from the sample set; the following training steps are performed: inputting a sample set into a current face detection model with a plurality of convolution layers, and selecting a plurality of target characteristic graphs; determining a face region loss value in the target feature map; determining a target loss value based on the face region loss value; determining a total loss value; and the total loss value is propagated reversely in the current face detection model to obtain an updated face detection model. The embodiment of the application carries out back propagation on the current face detection model, and improves the accuracy and recall rate of the generated face detection model.

Description

Method for generating face detection model, face detection method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method for generating a face detection model, a face detection method and a face detection device.

Background

Face detection is an important link in automatic recognition systems, and the technology is increasingly widely applied. It generally refers to any given image, which is searched by a certain strategy to determine whether the image contains a human face. If the human face is contained, the position, size, posture and the like of the human face can be returned.

The existing face detection needs to be carried out through a trained neural network. Specifically, the face detection is to input an image into a neural network, so as to obtain a face detection result of the image.

Disclosure of Invention

The embodiment of the application provides a method for generating a face detection model, a face detection method and a face detection device.

In a first aspect, an embodiment of the present application provides a method for generating a face detection model, including: acquiring an initial face detection model, and taking the acquired initial face detection model as a current face detection model; acquiring a sample set, wherein samples in the sample set comprise sample face images and marking information, and the marking information is used for marking faces contained in the sample face images; adopting a sample set, and executing the following training steps on the current face detection model: inputting a sample set into a current face detection model with a plurality of convolution layers, and selecting a plurality of feature maps determined by different convolution layers as a plurality of target feature maps; for each target feature map in the plurality of target feature maps, determining a face region loss value between a face region and a face label in the label information in the target feature map; determining a target loss value based on the face region loss value; determining the weighted sum of a plurality of target loss values corresponding to the plurality of target feature maps as a total loss value; the total loss value is reversely propagated in the current face detection model to update the parameters of the current face detection model, so that an updated face detection model is obtained; and determining the updated face detection model as the generated face detection model in response to the fact that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold.

In some embodiments, for each target feature map of the plurality of target feature maps, determining a face region loss value between the face region and the face label in the label information in the target feature map comprises: determining the probability that the face position corresponding to the face region in the feature map in the sample face image contains a face; determining the deviation of the face position and the face marked by the marking information of the sample face image; based on the determined probabilities and deviations, a face region loss value between the face region and the face label in the label information is determined.

In some embodiments, determining the target loss value based on the face region loss value comprises: determining a target loss value based on a weighted sum of the face region loss value and at least one of: a head region loss value and a torso region loss value.

In some embodiments, the labeling information is also used for labeling the head included in the sample face image; the head region loss value is determined by the following steps: determining the probability that the head position corresponding to the head region in the sample face image contains the head in the feature image; determining the deviation of the head position and the head marked by the marking information of the sample face image; and determining a head region loss value between the head region and the head label in the label information based on the probability and the deviation corresponding to the determined head position.

In some embodiments, the labeling information is also used for labeling a torso contained in the sample face image; the trunk area loss value is determined by the following steps: determining the probability that the corresponding trunk position of the trunk area in the sample face image contains the trunk in the feature image; determining the deviation between the trunk position and the trunk marked by the marking information of the sample face image; and determining a torso region loss value between the torso region and the torso label in the label information based on the probability and the deviation corresponding to the determined torso position.

In some embodiments, the training step further comprises: and taking the updated face detection model as the current face detection model and continuing to execute the training step in response to the fact that the total loss value corresponding to the updated face detection model is not less than the preset loss value threshold.

In a second aspect, an embodiment of the present application provides a face detection method, including: acquiring a target face image; inputting a target face image into a pre-trained face detection model to obtain a face area; the pre-trained face detection model is a current face detection model generated by the method of any one of the first aspect.

In a third aspect, an embodiment of the present application provides an apparatus for generating a face detection model, including: the system comprises an acquisition unit, a judging unit and a judging unit, wherein the acquisition unit is configured to acquire an initial face detection model and take the acquired initial face detection model as a current face detection model; the system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is configured to acquire a sample set, samples in the sample set comprise a sample face image and labeling information, and the labeling information is used for labeling faces contained in the sample face image; a training unit, the training unit comprising: a selecting subunit configured to input the sample set into a current face detection model having a plurality of convolution layers, and select a plurality of feature maps determined by different convolution layers as a plurality of target feature maps; a loss value determining subunit, configured to determine, for each of the plurality of target feature maps, a face region loss value between the face region and the face label in the label information in the target feature map; determining a target loss value based on the face region loss value; a total loss determining subunit configured to determine a weighted sum of a plurality of target loss values corresponding to the plurality of target feature maps as a total loss value; the updating subunit is configured to reversely propagate the total loss value in the current face detection model to update parameters of the current face detection model, so as to obtain an updated face detection model; and the generating subunit is configured to determine the updated face detection model as the generated face detection model in response to that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold.

In some embodiments, the loss value determining subunit is further configured to: determining the probability that the face position corresponding to the face region in the feature map in the sample face image contains a face; determining the deviation of the face position in the feature map and the face marked by the marking information of the sample face image; based on the determined probabilities and deviations, a face region loss value between the face region and the face label in the label information is determined.

In some embodiments, the loss value determining subunit is further configured to: determining a target loss value based on a weighted sum of the face region loss value and at least one of: a head region loss value and a torso region loss value.

In some embodiments, the training unit further comprises: and the model updating subunit is configured to respond that the total loss value corresponding to the updated face detection model is not less than a preset loss value threshold value, use the updated face detection model as a current face detection model, and input the current face detection model into the training unit.

In a fourth aspect, an embodiment of the present application provides a face detection apparatus, including: an image acquisition unit configured to acquire a target face image; the region determining unit is configured to input a target face image into a pre-trained face detection model to obtain a face region; the pre-trained face detection model is the current face detection model generated by the apparatus for generating a face detection model according to any one of the third aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the embodiments of the method for generating a face detection model.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the method as in any of the embodiments of the method for generating a face detection model.

In a seventh aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement a method such as any of the embodiments of the face detection method.

In an eighth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method as in any of the embodiments of the face detection method.

The method and the device for generating the face detection model provided by the embodiment of the application firstly obtain an initial face detection model, and use the obtained initial face detection model as a current face detection model. And acquiring a sample set, wherein the samples in the sample set comprise sample face images and labeling information, and the labeling information is used for labeling the faces contained in the sample face images. The following training steps are performed: step 1, inputting a sample set into a current face detection model with a plurality of convolutional layers, and selecting a plurality of feature maps determined by different convolutional layers as a plurality of target feature maps; step 2, determining a face area loss value between a face area and a face label in label information in each target feature map in the plurality of target feature maps; and step 3: determining a target loss value based on the face region loss value; step 4, determining the weighted sum of a plurality of target loss values corresponding to a plurality of target characteristic graphs as a total loss value; the total loss value is reversely propagated in the current face detection model to update the parameters of the current face detection model, so that an updated face detection model is obtained; and 5, determining the updated face detection model as the generated face detection model in response to the fact that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold value. According to the embodiment of the application, the current face detection model is subjected to back propagation by using the weighted sum of the target loss values corresponding to the target characteristic graphs, so that the accuracy and the recall rate of the generated face detection model are improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating a face detection model according to the present application;

FIG. 3 is a schematic diagram of an application scenario of a method for generating a face detection model according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating a face detection model according to the present application;

FIG. 5 is a flow diagram of one embodiment of a face detection method according to the present application;

FIG. 6 is a schematic structural diagram of an embodiment of an apparatus for generating a face detection model according to the present application;

FIG. 7 is a schematic structural diagram of an embodiment of a face detection apparatus according to the present application;

FIG. 8 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the present method for generating a face detection model or apparatus for generating a face detection model may be applied.

As shown in fig. 1, the system architecture 100 may include

terminals

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminals

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminals

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminals

101, 102, 103 may be installed with various communication client applications, such as a photo processing application, a face recognition application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

Here, the

terminals

101, 102, and 103 may be hardware or software. When the

terminals

101, 102, 103 are hardware, they may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminals

101, 102, 103 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

When the

terminals

101, 102, 103 are hardware, an image capturing device may be further mounted thereon. The image acquisition device can be various devices capable of realizing the function of acquiring images, such as a camera, a sensor and the like. The user can use the image acquisition device on the

terminal

101, 102, 103 to acquire the face image of the user or other people.

The server 105 may be a server providing various services, such as a background server providing support for pictures displayed on the

terminals

101, 102, 103. The background server may analyze and otherwise process the received data such as the sample set, and feed back a processing result (e.g., the generated face detection model) to the terminal device.

It should be noted that the method for generating the face detection model provided in the embodiment of the present application may be executed by the server 105 or the

terminals

101, 102, 103, and accordingly, the apparatus for generating the face detection model is generally disposed in the server 105 or the

terminals

101, 102, 103.

It should be understood that the number of terminals, networks, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, and servers, as desired for an implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating a face detection model according to the present application is shown. The method for generating the face detection model comprises the following steps:

step 201, obtaining an initial face detection model, and using the obtained initial face detection model as a current face detection model.

In this embodiment, an executing subject (e.g., a server shown in fig. 1) of the method for generating a face detection model may acquire an initial face detection model. And, the obtained initial face detection model is taken as the current face detection model. The face detection model is used for detecting faces in the images.

In practice, the initial face detection model may be an existing variety of convolutional neural network models created based on machine learning techniques. The convolutional neural network model may have various existing convolutional neural network structures (e.g., DenseBox, VGGNet, ResNet, SegNet, etc.). The storage location of the current face detection model is not limited in this application.

Step 202, a sample set is obtained, wherein samples in the sample set include a sample face image and annotation information, and the annotation information is used for annotating a face included in the sample face image.

In this embodiment, the executing entity may obtain a sample set and select a sample from the sample set. The labeling information is information labeling the exact face position contained in the sample face image. The execution subject marks the face. For example, the execution subject may use a rectangular frame to define the position of the face. Specifically, the defined position can be represented by two abscissa values and two ordinate values in at least one set of coordinates of opposite angles of the defined position, for example, the information labeled on the face can be represented as (x)₁,y₁,x₂,y₂)。

The following training step 203 is performed:

in the present embodiment, step 203 is decomposed into 6 sub-steps, namely sub-step 2031, sub-step 2032, sub-step 2033, sub-step 2034, sub-step 2035, and sub-step 2036.

Substep 2031, inputting the sample set into a current face detection model with multiple convolutional layers, and selecting multiple feature maps determined by different convolutional layers as multiple target feature maps.

In the sub-step, the execution subject inputs the sample set into a current face detection model with a plurality of convolution layers, and can obtain some feature maps determined by different convolution layers through the current face detection model. Then, the execution body may select a plurality of feature maps from the feature maps as a plurality of target feature maps.

The feature map is an image representing features in the face image. For example, the feature map may include location information of the face in the image, such as a face region (x)₁,y₁,x₂,y₂). Different convolutional layers in the current face detection model are arranged in sequence, a feature map obtained by the former convolutional layer can be used as the input of the latter convolutional layer, and the latter convolutional layer can further process the input feature map to obtain a new feature map. The feature maps obtained by processing the plurality of convolutional layers are getting smaller and smaller in size. Different profiles are defined by different convolutional layers. In the feature map obtained by each convolution layer, the pixel value of each pixel point is generated after the convolution operation is performed on the image input into the convolution layer, and each pixel point in the feature map can embody the feature of a certain position of the image input into the human face detection model.

Sub-step 2032, for each target feature map of the plurality of target feature maps, determining a face region loss value between the face region in the target feature map and the face label in the label information.

In the substep, for each of the plurality of target feature maps, the execution subject determines a face region loss value between the face region and the face label in the label information in the target feature map. For example, the face region and the face label may be used as parameters, and input into a specified loss function (loss function), so that a loss value between the two may be calculated.

In practice, the loss function is typically used to measure how different the predicted values (e.g., face regions) of the model are from the true values (e.g., face labels). In general, the loss function is a non-negative real-valued function. The loss function may be set according to actual requirements.

In some optional implementations of this embodiment, sub-step 2032 may comprise the steps of:

determining the probability that the face position corresponding to the face region in the feature map in the sample face image contains a face;

determining the deviation of the face position and the face marked by the marking information of the sample face image;

based on the determined probabilities and deviations, a face region loss value between the face region and the face label in the label information is determined.

In this embodiment, the probability may be obtained and output from the current face detection model. The annotation information here is the annotation information of the sample face image.

In particular, the annotated face may be considered to be a true value. The face position defined by the face detection model is difficult to be completely consistent with the true value. Therefore, in general, the face region has a deviation between the corresponding face position in the sample face image and the true value.

Specifically, the executing body may determine the respective loss values by using the loss function in the above step. Specifically, the probability may be used to determine a confidence loss (confidence loss), and the deviation may be used to determine a localization loss (localization loss). And then determining the weighted sum of the confidence loss and the positioning loss as a characteristic loss value of a certain part. For example, the execution subject may determine the confidence loss by using the probability that the face position corresponding to the face region in the sample face image contains a face, and determine the localization loss by using the deviation between the face position and the labeled face. Then, the execution subject determines a weighted sum of the confidence loss and the localization loss of the face region according to the preset weight of the confidence loss and the preset weight of the localization loss, and takes the weighted sum as a face feature loss value.

Sub-step 2033, for each target feature map of the plurality of target feature maps, determining a target loss value based on the face region loss value.

In this sub-step, the execution subject determines a target loss value based on the face region loss value of the target feature map. Specifically, the execution subject may directly determine the face region loss value as the target loss value. Alternatively, the target loss value may be calculated by inputting the loss value of the face region into a preset formula or model, or by multiplying the loss value by a preset coefficient.

Substep 2034 determines a weighted sum of the target loss values corresponding to the target feature maps as a total loss value.

In this substep, the execution agent weights a plurality of target loss values corresponding to the plurality of target feature maps to obtain a weighted sum, and determines the weighted sum as a total loss value. The target loss value corresponding to the target feature map is the target loss value determined based on the face region in the target feature map.

Specifically, different weights may be set in advance for the respective target loss values according to actual conditions.

In some alternative implementations of this embodiment, the larger the size, the greater the weight of the target feature map.

In this embodiment, the feature map with a larger size contains more abundant feature information, and the feature information in the feature map with a smaller size contains a smaller amount of information due to multiple processes. By giving larger weight to the target feature map with larger size, the accuracy and the recall rate of the face detection model can be improved by using richer information.

Substep 2035, propagating the total loss value in the current face detection model in the reverse direction to update the parameters of the current face detection model, and obtaining the updated face detection model.

In the sub-step, the executing body may perform back propagation (back propagation) on the total loss value in the current face detection model, and update parameters in the face detection model to obtain an updated face detection model. The parameters may be various parameters in the current face detection model, such as values of components of a convolution kernel in the current face detection model. When training the current face detection model, the executing body does not only consider a single feature map, but uses a plurality of feature maps as the influencing factors of the current face detection model. The execution main body can train the current face detection model by utilizing back propagation so as to enable the loss value corresponding to the trained face detection model to be minimum, and further enable the deviation between the face position defined by the obtained face detection model and the true value to be minimum, and further realize more accurate face detection.

Sub-step 2036, in response to that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold, determining the updated face detection model as the generated face detection model.

In this sub-step, the executing agent may determine that the current face detection model has been trained and determine the updated face detection model as the generated face detection model in response to that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold. The total loss value corresponding to the updated face detection model is the total loss value of the target loss values corresponding to the target feature maps obtained by the updated face detection model.

Specifically, by comparing the total loss value with a preset loss value threshold, it can be determined that the training of the face detection model is completed under various conditions. As an example, if multiple samples are selected in step 202, the subject may determine that the face detection model training is complete if the total loss value of each sample is less than the loss value threshold. For another example, the execution subject may count the proportion of the sample set occupied by the sample with the total loss value smaller than the loss value threshold. And when the proportion reaches a preset sample proportion (such as 95 percent), the face detection model can be determined to be trained.

A preset penalty value threshold may generally be used to represent an ideal case of a degree of inconsistency between a predicted value (i.e., a face region) and a true value (a face label). That is, when the total loss value does not exceed the preset loss value threshold, the predicted value may be considered to be close to or approximate the true value. The preset loss value threshold can be set according to actual requirements.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating a face detection model according to the present embodiment. In the application scenario of fig. 3, an electronic device 301 obtains an initial face detection model 302 from a local or other electronic device, and uses the obtained initial face detection model as a current face detection model 303; acquiring a sample set 304, wherein samples in the sample set comprise sample face images and labeling information, and the labeling information is used for labeling faces contained in the sample face images; the following training steps are performed: inputting the sample set into a current face detection model 303 with a plurality of convolution layers, and selecting a plurality of feature maps determined by different convolution layers as a plurality of target feature maps 305; for each target feature map in the plurality of target feature maps, determining a face region loss value 306 between the face region and a face label in the label information in the target feature map; determining a target loss value based on the face region loss value 307; determining the weighted sum of a plurality of target loss values corresponding to a plurality of target feature maps as a total loss value 308, and reversely propagating the total loss value in the current face detection model to update the parameters of the current face detection model to obtain an updated face detection model 309; and determining the updated face detection model as the generated face detection model 310 in response to that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold.

The multiple target feature maps adopted in the embodiment of the application contain rich information, so that the current face detection model is subjected to back propagation by using the weighted sum of the multiple target loss values corresponding to the multiple target feature maps, and the accuracy and the recall rate of the generated face detection model are improved. In practice, when the proportion of face positions occupying a sample face image is large, the feature map with a small size contains less information, but the face area is more representative, which is beneficial for training a face detection model.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating a face detection model is shown. The process 400 of the method for generating a face detection model comprises the following steps:

step 401, obtaining an initial face detection model, and using the obtained initial face detection model as a current face detection model.

In this embodiment, an electronic device (e.g., a server shown in fig. 1) on which the method for generating a face detection model operates may obtain an initial face detection model. And, the obtained initial face detection model is taken as the current face detection model. The face detection model is used for detecting faces in the images.

Step 402, a sample set is obtained, wherein samples in the sample set include sample face images and labeling information, and the labeling information is used for labeling faces, heads and trunks included in the sample face images.

In this embodiment, the executing entity may obtain a sample set and select a sample from the sample set. The labeling information is information labeling the exact face position contained in the sample face image. The execution subject marks the face. For example, the execution subject may use a rectangular frame to define the position of the face. In particular, at least one set of diagonal coordinates of the delineated location may be employed to represent the delineated location.

In practice, the specified sample set may be obtained in a variety of ways. For example, the selection is random or in a predetermined order. The selection in the preset order may be selection according to the number of the samples, and the like.

The following training step 403 is performed:

in the present embodiment, step 403 is decomposed into 7 sub-steps, i.e., sub-step 4031, sub-step 4032, sub-step 4033, sub-step 4034, sub-step 4035, sub-step 4036, and sub-step 4037.

Substep 4031, input the sample set into the current face detection model with multiple convolutional layers, and select multiple feature maps determined by different convolutional layers as multiple target feature maps.

In the sub-step, the executed subject sample face image is input into a current face detection model with a plurality of convolution layers, and some feature maps determined by different convolution layers can be obtained through the current face detection model. Then, the execution body may select a plurality of feature maps from the feature maps as a plurality of target feature maps.

Different convolutional layers in the current face detection model are arranged in sequence, a feature map obtained by the former convolutional layer can be used as the input of the latter convolutional layer, and the latter convolutional layer can process the input feature map to obtain a new feature map. The size of the multiple feature maps obtained through the multiple convolutional layer processing is getting smaller. The feature map is an image representing features in the face image. For example, the feature map may include location information of the face in the image, such as a face region (x)₁,y₁,x₂,y₂). The face detection model in this embodiment is a convolutional neural network. Different profiles are defined by different convolutional layers. Different regions of the feature map may contain features at different locations of the sample face image.

Substep 4032, for each of the plurality of target feature maps, determining a face region loss value between the face region in the target feature map and the face label in the label information.

In the substep, for each of the plurality of target feature maps, the execution subject determines a face region loss value between the face region and the face label in the label information in the target feature map. For example, the face region and the face label may be used as parameters and input into a specified loss function, so that a loss value between the two may be calculated.

Sub-step 4033 determines a target loss value based on a weighted sum of the face region loss value, the head region loss value, and the torso region loss value.

In this sub-step, the execution subject may calculate a weighted sum of the face region loss value, the head region loss value, and the torso region loss value determined in the sub-step 4032, based on the face region loss value, the predetermined head region loss value, and the predetermined torso region loss value, and weights respectively set in advance for the face region loss value, the head region loss value, and the torso region loss value, and determine a target loss value based on the weighted sum.

Here, the predetermined head region loss value may be determined in various ways. For example, the head region loss value may be calculated using a loss function.

In one specific example, the head region loss value is determined by: determining the probability that the head position corresponding to the head region in the sample face image contains the head in the feature image; determining the deviation of the head position and the head marked by the marking information of the sample face image; and determining a head region loss value between the head region and the head label in the label information based on the probability and the deviation corresponding to the determined head position.

In this example, the head region and the torso region have corresponding head positions and torso positions, respectively, in the sample face image. The execution body may determine a probability that the head position contains a head. In practice, the above probability can be obtained and output from the current face detection model. In addition, human body detection can be carried out on the sample face image, and the probability can be obtained.

In particular, the annotated head may be considered to be a true value. The head position defined by the face detection model is difficult to be completely consistent with the true value. Therefore, in general, the head region has a deviation between the corresponding head position in the sample face image and the true value.

In addition, the executing entity may determine a confidence loss (confidence loss) by using the probability and determine a localization loss (localization loss) by using the deviation. The weighted sum of confidence loss and localization loss is then determined as the characteristic loss value of the head.

The predetermined torso-region loss value may also be determined in a variety of ways. For example, the torso region loss value may be calculated using a loss function.

In one specific example, the torso region loss value is determined by: determining the probability that the corresponding trunk position of the trunk area in the sample face image contains the trunk in the feature image; determining the deviation between the trunk position and the trunk marked by the marking information of the sample face image; and determining a torso region loss value between the torso region and the torso label in the label information based on the probability and the deviation corresponding to the determined torso position.

Here, the weights previously set for the face region loss value, the head region loss value, and the trunk region loss value may be obtained based on historical data analysis or may be set based on experience of a technician. For example, the weight of the face feature loss value, the weight of the head feature loss value, and the weight of the torso feature loss value may be set equal. For another example, the weight of the face region loss value may be greater than the weight of the head region loss value, and the weight of the head region loss value may be greater than the weight of the torso region loss value. Here, by setting a large weight to the face feature loss value, the influence of the face can be increased to enhance the sensitivity of the current face detection model for detecting the face, taking the face, the head, and the torso in the sample face image into full consideration.

In determining the target loss value based on the weighted sum, the execution body may determine the weighted sum as the target loss value, or may input the weighted sum into a preset formula, a preset model, or multiply the weighted sum by a preset coefficient to obtain the target loss value.

Substep 4034 determines the weighted sum of the target loss values corresponding to the target feature maps as a total loss value.

In this substep, the execution body weights a plurality of target loss values corresponding to the plurality of target feature maps to obtain a weighted sum. And determines the weighted sum as the total loss value.

And a substep 4035, reversely propagating the total loss value in the current face detection model to update parameters of the current face detection model, so as to obtain an updated face detection model.

In this sub-step, the execution subject may reversely propagate the total loss value in the current face detection model, and update parameters in the face detection model to obtain an updated face detection model. The parameters may be various parameters in the current face detection model, such as values of components of a convolution kernel in the current face detection model. By determining the weighted sum between the target loss values, the executing subject does not consider only a single feature map but uses a plurality of feature maps as the influencing factors of the current face detection model when training the current face detection model. The execution subject can train the current face detection model by using back propagation, so that the model can detect the face more accurately.

Step 4036, in response to that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold, the updated face detection model is determined as the generated face detection model.

In this sub-step, the execution subject may determine that the current face detection model is trained completely in response to that a total loss value of the region of the updated face detection model and the annotation information is smaller than a preset loss value threshold, and determine the updated face detection model as the generated face detection model. As an example, if there are multiple samples selected in step 202, the subject may determine that the face detection model training is complete if the total loss value of each sample is less than the preset loss value threshold. For another example, the execution subject may count the proportion of the sample with the total loss value smaller than the preset loss value threshold value in the sample set. And when the proportion reaches a preset sample proportion (such as 95 percent), the face detection model can be determined to be trained.

A preset penalty value threshold may generally be used to represent an ideal case of a degree of inconsistency between a predicted value (i.e., a face region) and a true value (a face label). That is, when the total loss value is less than the preset loss value threshold, the predicted value may be considered to be close to or approximate the true value. The preset loss value threshold can be set according to actual requirements.

Step 4037, in response to that the total loss value corresponding to the updated face detection model is not less than the preset loss value threshold, the updated face detection model is used as the current face detection model, and the training step is continuously executed.

In the sub-step, if the total loss value of the execution subject in response to the updated region of the face detection model and the annotation information is not less than the preset loss value threshold, it may be determined that the current face detection model is not trained, and the updated face detection model is used as the current face detection model. Thereafter, the executing agent may continue to perform the training step. As an example, if a plurality of samples are selected in step 402, the executing subject may determine that the current face detection model is not trained in the case that the total loss value of each sample is not less than the preset loss value threshold. As another example, the executive may count the proportion of samples in the sample set whose total loss value is less than the preset loss value threshold. And if the proportion does not reach the preset sample proportion (such as 95%), it can be determined that the current face detection model is not trained.

In the embodiment, the face detection model can be generated by adopting the information about the face in the sample, and the head information and the trunk information except the face, so that the accuracy and the recall rate of the face detection model can be improved. In addition, in the embodiment, the total loss value related to the face region loss value, the head region loss value, and the trunk region loss value is used to perform back propagation on the current face detection model, and parameters of the model are adjusted to further improve the accuracy of the face detection model.

With further reference to fig. 5, a flow 500 of one embodiment of a method of face detection is shown. The process 500 of the face detection method includes the following steps:

step 501, obtaining a target face image.

In this embodiment, the execution subject (for example, the server shown in fig. 1) of the method for detecting the face may acquire the target face image from a local or other electronic device. The face image here is an image in which a face is presented.

Step 502, inputting a target face image into a pre-trained face detection model to obtain a face region; the pre-trained face detection model is the current face detection model generated by the method in any one of the embodiments shown in fig. 2 and fig. 4.

In this embodiment, the execution subject inputs the target face image into a pre-trained face detection model to obtain a detection result. The detection result is a face area. The pre-trained face detection model is the current face detection model generated by the method in any one of the embodiments shown in fig. 2 and fig. 4.

In the embodiment, the face detection is performed by using the face detection model with the parameters adjusted through back propagation, so that the accuracy and recall rate of the detection result can be improved.

With further reference to fig. 6, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating a face detection model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 6, the apparatus for generating a face detection model of the present embodiment includes: an acquisition unit 601, a sample acquisition unit 602, and a training unit 603. The acquiring unit 601 is configured to acquire an initial face detection model, and use the acquired initial face detection model as a current face detection model; a sample obtaining unit 602 configured to obtain a sample set, where samples in the sample set include a sample face image and annotation information, and the annotation information is used to annotate a face included in the sample face image; a training unit, the training unit comprising: a selecting subunit 6031 configured to input the sample set into a current face detection model having a plurality of convolutional layers, and select a plurality of feature maps determined by different convolutional layers as a plurality of target feature maps; a loss value determining subunit 6032 configured to determine, for each of the plurality of target feature maps, a face region loss value between the face region and the face label in the label information in the target feature map; determining a target loss value based on the face region loss value; a total loss determination subunit 6033 configured to determine a weighted sum of a plurality of target loss values corresponding to the plurality of target feature maps as a total loss value; an updating subunit 6034, configured to reversely propagate the total loss value in the current face detection model, so as to update parameters of the current face detection model to obtain an updated face detection model; the training unit further comprises: a generating sub-unit 6035 configured to determine the updated face detection model as the generated face detection model in response to that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold.

In this embodiment, the acquisition unit 601 may acquire an initial face detection model. And, the obtained initial face detection model is taken as the current face detection model. The face detection model is used for detecting faces in the images.

In this embodiment, the sample acquiring unit 602 may acquire a sample set and select a sample therefrom. The labeling information is information labeling the exact face position contained in the sample face image. The execution subject marks the face. For example, the execution subject may use a rectangular frame to define the position of the face. In particular, at least one set of diagonal coordinates of the delineated location may be employed to represent the delineated location.

In this embodiment, the selecting unit 6031 inputs the sample set into a current face detection model with a plurality of convolutional layers, and may obtain some feature maps determined by different convolutional layers through the current face detection model. Then, the execution body may select a plurality of feature maps from the feature maps as a plurality of target feature maps.

In the present embodiment, the loss value determination subunit 6032 determines a face region loss value between the face region and the face label in the label information in the target feature map. For example, the face region and the face label may be used as parameters and input into a specified loss function, so that a loss value between the two may be calculated.

In this embodiment, the total loss determination subunit 6033 weights a plurality of target loss values corresponding to the plurality of target feature maps to obtain a weighted sum. And determines the weighted sum as the total loss value.

In this embodiment, the updating sub-unit 6034 may reversely propagate the total loss value in the current face detection model, and update the parameters in the face detection model to obtain an updated face detection model. The parameters may be various parameters in the current face detection model, such as values of components of a convolution kernel in the current face detection model. By determining the weighted sum between the target loss values, the executing subject does not consider only a single feature map but uses a plurality of feature maps as the influencing factors of the current face detection model when training the current face detection model. The execution subject can train the current face detection model by using back propagation, so that the model can detect the face more accurately.

In this embodiment, the generating sub-unit 6035 may determine that the training of the current face detection model is completed and determine the updated face detection model as the generated face detection model in response to that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold.

In some optional implementations of this embodiment, the loss value determining subunit is further configured to: determining the probability that the face position corresponding to the face region in the feature map in the sample face image contains a face; determining the deviation of the face position in the feature map and the face marked by the marking information of the sample face image; based on the determined probabilities and deviations, a face region loss value between the face region and the face label in the label information is determined.

In some optional implementations of this embodiment, the loss value determining subunit is further configured to: determining a target loss value based on a weighted sum of the face region loss value and at least one of: a head region loss value and a torso region loss value.

In some optional implementations of this embodiment, the annotation information is further used for annotating a head included in the sample face image; the head region loss value is determined by the following steps: determining the probability that the head position corresponding to the head region in the sample face image contains the head in the feature image; determining the deviation of the head position and the head marked by the marking information of the sample face image; and determining a head region loss value between the head region and the head label in the label information based on the probability and the deviation corresponding to the determined head position.

In some optional implementations of this embodiment, the labeling information is further used to label a trunk included in the sample face image; the trunk area loss value is determined by the following steps: determining the probability that the corresponding trunk position of the trunk area in the sample face image contains the trunk in the feature image; determining the deviation between the trunk position and the trunk marked by the marking information of the sample face image; and determining a torso region loss value between the torso region and the torso label in the label information based on the probability and the deviation corresponding to the determined torso position.

In some optional implementations of this embodiment, the training unit further includes: and the model updating subunit is configured to respond that the total loss value corresponding to the updated face detection model is not less than a preset loss value threshold value, use the updated face detection model as a current face detection model, and input the current face detection model into the training unit.

With further reference to fig. 7, the present application provides an embodiment of an apparatus for face detection, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 5, and the apparatus may be applied to various electronic devices.

As shown in fig. 7, the apparatus 700 for detecting a human face of the present embodiment includes: an image acquisition unit 701 and a region determination unit 702. The image acquisition unit 701 is configured to acquire a target face image; a region determining unit 702 configured to input a target face image into a pre-trained face detection model to obtain a face region; the pre-trained face detection model is the current face detection model generated by any one of the embodiments corresponding to fig. 5.

In this embodiment, the image acquisition unit 701 may acquire the target face image from a local or other electronic device. The face image here is an image in which a face is presented.

In this embodiment, the region determining unit 702 inputs the target face image into a pre-trained face detection model to obtain a detection result. The detection result is a face region, such as a face border (x)₁,y₁,x₂,y₂). The pre-trained face detection model is the current face detection model generated by the method in any one of the embodiments shown in fig. 2 and fig. 4.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 801. It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a sample acquisition unit, a training unit, and a generation subunit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, the acquisition unit may also be described as a "unit to acquire an initial face detection model".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring an initial face detection model, and taking the acquired initial face detection model as a current face detection model; acquiring a sample set, wherein samples in the sample set comprise sample face images and marking information, and the marking information is used for marking faces contained in the sample face images; the following training steps are performed: inputting a sample set into a current face detection model with a plurality of convolution layers, and selecting a plurality of feature maps determined by different convolution layers as a plurality of target feature maps; for each target feature map in the plurality of target feature maps, determining a face region loss value between a face region and a face label in the label information in the target feature map; determining a target loss value based on the face region loss value; determining the weighted sum of a plurality of target loss values corresponding to the plurality of target feature maps as a total loss value; the total loss value is reversely propagated in the current face detection model to update the parameters of the current face detection model, so that an updated face detection model is obtained; and determining the updated face detection model as the generated face detection model in response to the fact that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for generating a face detection model, comprising:

acquiring an initial face detection model, and taking the acquired initial face detection model as a current face detection model;

acquiring a sample set, wherein samples in the sample set comprise sample face images and labeling information, and the labeling information is used for labeling faces contained in the sample face images;

adopting the sample set to execute the following training steps on the current face detection model:

inputting the sample set into a current face detection model with a plurality of convolution layers, and selecting a plurality of feature maps determined by different convolution layers as a plurality of target feature maps;

for each target feature map in the plurality of target feature maps, determining a face region loss value between a face region and a face label in the label information in the target feature map;

determining a target loss value based on the face region loss value;

wherein the determining a target loss value based on the face region loss value comprises: determining a target loss value based on the face region loss value and a weighted sum of at least one of: a head region loss value and a torso region loss value;

determining a weighted sum of a plurality of target loss values corresponding to the plurality of target feature maps as a total loss value;

the total loss value is reversely propagated in the current face detection model to update parameters of the current face detection model, so that an updated face detection model is obtained;

and determining the updated face detection model as the generated face detection model in response to the fact that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold.

2. The method of claim 1, wherein the determining, for each of the plurality of target feature maps, a face region loss value between a face region in the target feature map and a face label in the label information comprises:

determining a face region loss value between the face region and a face label in the label information based on the determined probability and deviation.

3. The method of claim 1, wherein the labeling information is further used for labeling a head included in the sample face image;

the head region loss value is determined by the following steps:

determining the probability that the head position corresponding to the head region in the sample face image contains the head in the feature map;

determining the deviation of the head position and the head marked by the marking information of the sample face image;

and determining a head region loss value between the head region and the head label in the label information based on the probability and the deviation corresponding to the determined head position.

4. The method of claim 1, wherein the labeling information is further used for labeling a torso included in the sample face image;

the torso region loss value is determined by:

determining the probability that the corresponding trunk position of the trunk area in the sample face image contains the trunk in the feature map;

determining the deviation between the trunk position and a trunk marked by marking information of the sample face image;

determining a torso region loss value between the torso region and a torso label in the label information based on the probability and the deviation corresponding to the determined torso position.

5. The method of claim 1, wherein the training step further comprises:

and taking the updated face detection model as the current face detection model and continuously executing the training step in response to the fact that the total loss value corresponding to the updated face detection model is not less than a preset loss value threshold value.

6. A face detection method, comprising:

acquiring a target face image;

inputting the target face image into a pre-trained face detection model to obtain a face area; wherein the pre-trained face detection model is the current face detection model generated by the method of any one of claims 1 to 4.

7. An apparatus for generating a face detection model, comprising:

the system comprises an acquisition unit, a judging unit and a judging unit, wherein the acquisition unit is configured to acquire an initial face detection model and take the acquired initial face detection model as a current face detection model;

the system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is configured to acquire a sample set, samples in the sample set comprise a sample face image and labeling information, and the labeling information is used for labeling a face contained in the sample face image;

a training unit, the training unit comprising:

a selecting subunit configured to input the sample set into a current face detection model having a plurality of convolution layers, and select a plurality of feature maps determined by different convolution layers as a plurality of target feature maps;

a loss value determining subunit, configured to determine, for each of the plurality of target feature maps, a face region loss value between a face region and a face label in the label information in the target feature map; determining a target loss value based on the face region loss value;

wherein the loss value determination subunit is further configured to: determining a target loss value based on the face region loss value and a weighted sum of at least one of: a head region loss value and a torso region loss value;

a total loss determining subunit configured to determine a weighted sum of a plurality of target loss values corresponding to the plurality of target feature maps as a total loss value;

an updating subunit, configured to reversely propagate the total loss value in the current face detection model to update parameters of the current face detection model, so as to obtain an updated face detection model;

and the generating subunit is configured to determine the updated face detection model as the generated face detection model in response to that the total loss value corresponding to the updated face detection model is smaller than a preset loss value threshold.

8. The apparatus of claim 7, wherein the loss value determination subunit is further configured to:

determining the deviation of the face position and the face marked by the marking information of the sample face image in the feature map;

9. The apparatus of claim 7, wherein the labeling information is further used for labeling a head included in the sample face image;

the head region loss value is determined by the following steps:

10. The apparatus of claim 7, wherein the labeling information is further used for labeling a torso included in the sample face image;

the torso region loss value is determined by:

11. The apparatus of claim 7, wherein the training unit further comprises:

and the model updating subunit is configured to respond that the total loss value corresponding to the updated face detection model is not less than a preset loss value threshold value, use the updated face detection model as a current face detection model, and input the current face detection model into the training unit.

12. A face detection apparatus comprising:

an image acquisition unit configured to acquire a target face image;

the region determining unit is configured to input the target face image into a pre-trained face detection model to obtain a face region; wherein the pre-trained face detection model is the current face detection model generated by the apparatus for generating a face detection model according to any one of claims 7 to 11.

13. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

14. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1-5.

15. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by one or more processors, cause the one or more processors to implement the method of claim 6.

16. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method of claim 6.