CN116665266A

CN116665266A - Face detection method and device, electronic equipment and storage medium

Info

Publication number: CN116665266A
Application number: CN202310560226.6A
Authority: CN
Inventors: 廖科华; 梁书举; 张毫
Original assignee: Shenzhen KTC Commercial Technology Co Ltd
Current assignee: Shenzhen KTC Commercial Technology Co Ltd
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-08-29

Abstract

The invention relates to the technical field of face detection, and discloses a face detection method, a face detection device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a face image to be detected, inputting the face image into a feature extraction network in a face detection model, and extracting basic features of a face to obtain an initial face feature map; inputting the initial face feature map into an improved PANet network, and carrying out feature enhancement processing on the initial face feature map to obtain an enhanced face feature map; respectively inputting the enhanced face feature images into a CPM detection module for operation to obtain optimized face feature images; respectively inputting the optimized face feature images into a Head classification module for classification to obtain classification results; and performing non-maximum suppression processing on the classification result to obtain a face detection result. The embodiment of the invention can improve the detection precision and effect of the face detection model deployed on Soc.

Description

Face detection method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of face detection, in particular to a face detection method, a face detection device, electronic equipment and a storage medium.

Background

The face detection technology is a computer vision technology for determining the position and the size of a face in a digital image, is one of basic technologies of man-machine interaction, and is also a basic stone of a face analysis algorithm. With the rapid development of computer vision technology, deep learning methods are continuously popularized, face detection technology based on neural networks is widely applied, and the face detection technology is supported by hardware of a Soc (System on Chip) which is not separated from the floor, so that a face detection algorithm on the Soc needs to have stronger robustness and generalization capability. However, due to the limitation of the area and the power consumption of the Soc chip, the computing resource on the Soc is limited, and the existing face detection technology based on the neural network has the problems of low detection precision and poor detection effect when applied to the Soc, so that the face detection technology is difficult to be well applied to an actual scene.

Disclosure of Invention

The embodiment of the invention provides a face detection method, a device, electronic equipment and a storage medium, and aims to solve the problems of low detection precision and poor detection effect of the existing face detection method applied to Soc.

In a first aspect, an embodiment of the present invention provides a face detection method, including:

acquiring a face image to be detected, inputting the face image into a feature extraction network in a face detection model to extract basic features of a face to obtain an initial face feature image, wherein the face detection model comprises an improved PANet network, a CPM detection module, a Head classification module and the feature extraction network;

Inputting the initial face feature map into the improved PANet network, and carrying out feature enhancement processing on the initial face feature map to obtain an enhanced face feature map;

inputting the enhanced face feature map into the CPM detection module for operation to obtain an optimized face feature map;

inputting the optimized face feature map into the Head classification module for classification to obtain a classification result;

and performing non-maximum suppression processing on the classification result to obtain a face detection result.

In a second aspect, an embodiment of the present invention further provides a face detection apparatus, including:

the extraction unit is used for acquiring a face image to be detected, inputting the face image into a feature extraction network in a face detection model to extract basic features of a face to obtain an initial face feature image, wherein the face detection model comprises an improved PANet network, a CPM detection module, a Head classification module and the feature extraction network;

the enhancement unit is used for inputting the initial face feature map into the improved PANet network, and carrying out feature enhancement processing on the initial face feature map to obtain an enhanced face feature map;

the optimization unit is used for inputting the enhanced face feature image into the CPM detection module to perform operation to obtain an optimized face feature image;

The classifying unit is used for inputting the optimized face feature map into the Head classifying module to classify to obtain a classifying result;

and the processing unit is used for carrying out non-maximum value inhibition processing on the classification result to obtain a face detection result.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor implements the method when executing the computer program.

In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the above method.

The embodiment of the invention provides a face detection method, a face detection device and a storage medium. Wherein the method comprises the following steps: acquiring a face image to be detected, inputting the face image into a feature extraction network in a face detection model to extract basic features of a face to obtain an initial face feature image, wherein the face detection model comprises an improved PANet network, a CPM detection module, a Head classification module and the feature extraction network; inputting the initial face feature map into the improved PANet network, and carrying out feature enhancement processing on the initial face feature map to obtain an enhanced face feature map; inputting the enhanced face feature map into the CPM detection module for operation to obtain an optimized face feature map; inputting the optimized face feature map into the Head classification module for classification to obtain a classification result; and performing non-maximum suppression processing on the classification result to obtain a face detection result. According to the technical scheme, the improved PANet network, the CPM detection module, the Head classification module and the feature extraction network are mounted in the face detection model, feature extraction, enhancement, optimization and classification processing are carried out on the face image to be detected, the detection precision of the face detection model is improved, and a good detection effect can be obtained when the face detection model is applied to Soc.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a face detection method according to an embodiment of the present invention;

fig. 2 is a network configuration diagram of the MobileNetV3 network in fig. 1;

FIG. 3 is a schematic view of feature weighting of the attention module in the MobeleNetV3 network of FIG. 1;

fig. 4 is a schematic sub-flowchart of a face detection method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the operation of the improved PANet network of FIG. 1;

FIG. 6 is a schematic diagram of the CPM detection module of FIG. 1;

fig. 7 is a block diagram of a workflow of a face detection method according to an embodiment of the present invention;

FIG. 8 is a schematic view of attribute of a feature map of each module of a face detection model according to an embodiment of the present invention;

fig. 9 is a schematic block diagram of a face detection apparatus according to an embodiment of the present invention;

fig. 10 is a schematic block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Referring to fig. 1, fig. 1 is a flowchart of a face detection method according to an embodiment of the present invention. The face detection method is described in detail below. As shown in fig. 1, the method includes the following steps S100 to S140.

S100, acquiring a face image to be detected, and inputting the face image into a feature extraction network in a face detection model to extract basic features of a face to obtain an initial face feature map, wherein the face detection model comprises an improved PANet network, a CPM detection module, a Head classification module and the feature extraction network.

In the embodiment of the invention, the feature extraction network in the face detection model is a MobileNet V3 network, the face image to be detected is obtained, and the face image to be detected is input into the MobileNet V3 network for face basic feature extraction. Because the face images to be detected may have images with different face sizes, for example, in the present embodiment, there are small-size faces, medium-size faces, and large-size faces (i.e., small, medium, and large faces), in other embodiments, the face images may further include images that cover a part of the faces, and in order to better detect the face targets with different sizes, improvement on the MobileNetV3 network is required. FIG. 2 is a diagram of a network configuration of a MobileNet V3 network, in FIG. 2, input represents the size of an Input image, e.g., 480 ² X 3 represents that the size of the input face image to be detected is 480 x 480, and the number of channels is 3; bneck denotes the inverse residual block, 3×3 denotesThe convolution kernel size of the depth convolution; # out represents the channel size of the output; exp_size represents the dimension of the 1 x 1 convolution output of the first upscale in bneck; SE indicates whether the attention mechanism is used, NL indicates whether a nonlinear activation function is currently used, RE indicates a relu activation function, and HS indicates an h-swish activation function; s is stride (stride); NBN indicates that the convolution of the classifier section does not use BN layer, k indicates num_class, i.e. the initial classification number. Notably, the first bneck structure, its exp_size, and the output dimension are identical, i.e., the first 1×1 convolution does not do the dimension-lifting process. It should be noted that, in the embodiment of the present invention, the expression of the h-swish activation function in the MobileNetV3 network is shown in formula (1-1), and the expression of the ReLU6 activation function is shown in formula (1-2):

ReLU6(x)＝min(max(x,0),6) (1-2)

Wherein x represents an input characteristic value, max represents maximization, min represents minimization, and ReLU6 (x) represents an output result of a ReLU6 activation function; h-swish represents the output of the h-swish activation function.

In the mobilenet v3 network, the bneck (residual pouring module) is a basic module and also a core module of the mobilenet v3 network, and mainly realizes: the channels may separate convolution + SE channel attention mechanism + residual connection. Specifically, attention configuration information of an attention module in the residual pouring module is obtained, whether the attention module is used in the residual pouring module is judged based on the attention configuration information, if the attention module is configured in the residual pouring module and is configured to be used, a proper weight is distributed to important channels by using a channel attention mechanism of the attention module, and then a face image to be detected is input into the residual pouring module to carry out residual pouring calculation so as to extract face basic characteristics, and an initial face characteristic diagram is obtained. And adding an SE module (attention module) into the bneck, wherein the core principle is that pooling processing is carried out for each channel, and then an output vector is obtained through two full-connection layers. This output vector represents the importance of each channel to the original feature matrix (i.e., the input face image to be detected), with greater weight given to it the more important. The number of the nodes of the first full-connection layer is equal to 1/4 of the number of the channels, and the number of the nodes of the second full-connection layer is consistent with the number of the channels. As shown in fig. 3, fig. 3 is a schematic view of feature weighting of the attention module in the MobeleNetV3 network. Firstly, each channel is changed into a value by adopting average pooling, then the output of channel weight is obtained after passing through two full connection layers (FC 1 and FC 2), and then the new weighted feature matrix is obtained by multiplying the channel weight back to the original feature matrix. Notably, the fully connected layer (FC 2) uses a Hard-Sigmoid activation function instead of a relu activation function.

In the embodiment of the invention, since the existing face detection model is generally based on the mobilenet v1 network in the Retinaface algorithm as the feature extraction network, in the invention, the mobilenet v3 network is used to replace the mobilenet v1 network in the Retinaface algorithm, and compared with the mobilenet v1 network, the mobilenet v3 network has the following advantages: 1. the mobilenet v3 network uses Network Architecture Search (NAS) to optimize the network structure, whereas the mobilenet v1 network is designed manually, which is faster than the mobilenet v1 network search; 2. a channel attention mechanism module (namely an SE module) is introduced, so that the expression capacity of the extracted features is enhanced; 3. the non-linear expression capacity is improved by using the h-swish activation function; 4. the MobileNet V3 modifies the output structure of the last part, improves the use average pooling, reduces the connection of bottleneck layers of the MobileNet V3 network, reduces the calculation amount and the reasoning time, and reduces the recognition delay. Therefore, in theory, the MobileNet V3 network is superior to the MobileNet V1 network in terms of feature extraction capability, acceleration reasoning and calculation capability, and can improve the detection precision and real-time detection performance of the face detection model.

S110, inputting the initial face feature map into the improved PANet network, and carrying out feature enhancement processing on the initial face feature map to obtain an enhanced face feature map.

In the embodiment of the invention, a face image to be detected is input into a mobilenet v3 network to extract basic face features and output initial face feature images, wherein the initial face feature images comprise a first initial face feature image P1, a second initial face feature image P2 and a third initial face feature image P3, and the first initial face feature image P1, the second initial face feature image P2 and the third initial face feature image P3 respectively represent initial face feature images with different sizes. Specifically, in the embodiment of the present invention, the sizes of the first initial face feature map P1, the second initial face feature map P2, and the third initial face feature map P3 are respectively reduced to 8 times, 16 times, and 32 times of the original image size, for example, if the original image size is 480×480, the size of the first initial face feature map P1 output by the mobilenet v3 network is 60×60; the size of the second initial face feature map P2 is 30×30; the third initial face feature map P3 has a size of 15×15. Inputting the first initial face feature map P1, the second initial face feature map P2 and the third initial face feature map P3 into an improved PANet network to perform feature enhancement processing on the first initial face feature map P1, the second initial face feature map P2 and the third initial face feature map P3 to obtain an enhanced face feature map.

In the embodiment of the invention, PANet (Path Aggregation Network) network, also called path aggregation network, is a network which is enhanced by a bottom-up path, and utilizes the transmission of reverse information flow to further improve the information interaction between network features of each layer, thereby shortening the information path between the features of the lower layer and the features of the top layer, but the features of the PANet deep output layer are all obtained by convoluting-upsampling the features of the lower layer, and the fusion of the information of the input features of the original lower layer is lacking. Thus, in order to add the original low-level input feature information to the deep features, the present invention improves the PANet network by designing an improved PANet network, wherein the improved PANet network comprises a bottom-level feature network, a middle-level feature network and a top-level feature network. And respectively inputting the first initial face feature map P1, the second initial face feature map P2 and the third initial face feature map P3 into the bottom layer feature network, the middle layer feature network and the top layer feature network to perform convolution-sampling processing to obtain an enhanced face feature map.

Referring to fig. 4, in an embodiment, for example, in the embodiment of the present invention, the step S110 includes the following steps S111-S114.

S111, sampling the third initial face feature map to obtain a first sampled face feature, inputting the second initial face feature map and the first sampled face feature into the middle layer feature network to obtain a convolution face feature, sampling the convolution face feature to obtain a second sampled face feature, and inputting the first initial face feature map and the second sampled face feature into the bottom layer feature network to obtain a first enhanced face feature map.

In the embodiment of the present invention, in order to enhance the feature information of the first initial face feature map P1, the feature information of the second initial face feature map P2 and the third initial face feature map P3 is blended in the processing procedure of the first initial face feature map P1. Specifically, bilinear sampling and upsampling are performed on the third initial face feature map P3 to obtain a first sampled face feature, and the second initial face feature map P2 and the first sampled face feature are input into the intermediate layer feature network to perform convolution operation to obtain a convolution face feature P _2mid For the convolution face feature P _2mid Bilinear sampling and upsampling are carried out to obtain a second sampled face feature, the first initial face feature map P1 and the second sampled face feature are input into the bottom feature network to carry out convolution operation to obtain a first enhanced face feature map P _1out 。

S112, sampling the first initial face feature image and the first enhanced face feature image to obtain a bottom sampling feature, and inputting the bottom sampling feature and the convolution face feature into a middle layer feature network to perform convolution operation to obtain a second enhanced face feature image.

In the embodiment of the present invention, bilinear sampling and downsampling are performed on the first initial face feature map P1Sampling to obtain a third sampled face feature, and mapping the first enhanced face feature map P _1out Performing bilinear sampling and downsampling to obtain a fourth sampled face feature, taking the third sampled face feature and the fourth sampled face feature as bottom sampled features, and taking the bottom sampled features and the convolution face feature P _2mid Inputting the middle layer characteristic network to perform convolution operation to obtain a second enhanced face characteristic diagram P _2out 。

S113, sampling the second initial face feature map and the second enhanced face feature map to obtain middle layer sampling features, and carrying out convolution operation on the middle layer sampling features and the third initial face feature map P3 input top layer feature network to obtain a third enhanced face feature map.

In the embodiment of the present invention, bilinear sampling and downsampling are performed on the second initial face feature map P2 to obtain a fifth sampled face feature, and the second enhanced face feature map P is processed _2out Performing bilinear sampling and downsampling to obtain a sixth sampled face feature, taking the fifth sampled face feature and the sixth sampled face feature as intermediate layer sampling features, inputting the intermediate layer sampling features and the third initial face feature map P3 into the top layer feature network, and performing convolution operation to obtain a third enhanced face feature map P _3out 。

And S114, taking the first enhanced face feature map, the second enhanced face feature map and the third enhanced face feature map as enhanced face feature maps.

In the embodiment of the invention, the three-layer feature network outputs a first enhanced face feature map P _1out The middle layer characteristic network outputs a second enhanced face characteristic diagram P _2out The top-layer feature network outputs a third enhanced face feature map P _3out The improved PANet network output result is the enhanced face feature map. Specifically, referring to fig. 5, fig. 5 is a schematic diagram illustrating the operation of the improved PANet network. In fig. 5, bilinear represents Bilinear sampling, scale=2 represents upsampling, and scale=0.5 represents downsampling. Using Bilinear, scale=2 samplesWhen the size of the initial face feature map is doubled; with Bilinear, scale=0.5, the initial face feature map size is doubled, 3×3×64Conv represents a convolution kernel size of 3×3, and the number of channels is 64. The operation expressions for sampling the initial face feature map are shown in formulas (1-3) to (1-8):

P3 _mid _＝ P3 _in (1-3)

P2 _mid _＝ Conv(P3 _in + Upsample(P3 _mid ) ) (1-4)

P1 _mid _＝ Conv(P1 _in + Upsample(P2 _mid ) ) (1-5)

P1 _out _＝ P1 _mid (1-6)

P2 _out _＝ Conv(Downsample(P1 _in ) + Downsample(P1 _out ) + P2 _mid ) (1-7)

P3 _out _＝ Conv(Downsample(P2 _in ) + Downsample(P2 _out ) + P3 _mid ) (1-8)

Wherein P is _3in Representing the input values of the top-level feature network, P _3mid Representing intermediate values of the top level feature network, P _3out Representing the output value of the top level feature network; p (P) _2in Representing the input values of the top-level feature network, P _2mid Representing intermediate values of the top level feature network, P _2out Representing the output value of the intermediate layer feature network; p (P) _1in Representing the input values of the top-level feature network, P _1mid Representing intermediate values of the top level feature network, P _1out Representing the output value of the underlying feature network; conv represents convolution operation, down sample represents downsampling, and Upsample represents upsampling. In the present embodiment, P _2mid The value of the (a) is a convolution face feature obtained by carrying out convolution operation on the second initial face feature image and the first sampling face feature input middle layer feature network.

In the embodiment of the invention, the improved PANet network is specifically improved compared with the PANet network: 1. increasing the information flow between the input features of the bottom layer and the output features of the upper layer, namely P _2in To P _3out ，P _1in To P _2out Two information streams are transmitted; 2. adopts bilinear interpolation sampling (i.e. double sampling), overcomes the defect of discontinuous nearest neighbor interpolation, reduces the blocking phenomenon of linear features, and performs up-or down-sampling operation by controlling the size of parameter Scale; 3. and adding a weight matrix after upsampling or downsampling the size feature images, and adjusting the contribution degree of each size feature image. The weighted feature fusion method adopts rapid normalized fusion, and the definition is shown in formulas (1-9):

A small value is set to avoid zero denominator while normalizing the value of each weight to between 0, 1. The three-layer feature network of the improved PANet network can be used for respectively extracting the face features of small size, medium size and large size, and the newly added information flow transmission from bottom to top can be used for fusing the feature information of large size into the upper-layer small target information, so that the information richness of the feature map is improved, and the detection is more accurate and reliable.

S120, inputting the enhanced face feature map into the CPM detection module for operation to obtain an optimized face feature map.

In the embodiment of the invention, the improved PANet network carries out the feature enhancement to the three-dimension enhanced face feature images, and the three-dimension enhanced face feature images are respectively input into the CPM detection module for optimization processing to obtain the optimized face feature images. The CPM (Context-sensitive Predict Module) detection module is a Context sensitive structure that combines the advantages of both SSH and DSSD network models. SSH introduces strades with different sizes into three convolution layers with different depths through transverse expansion to increase the network receptive field, so that the detection precision is improved; the DSSD model improves the depth of the network by adding a residual error module. Therefore, the CPM detection module uses a wider and deeper network to fuse the context information around the target face, and improves the expression capability of the prediction model by introducing the context information. As shown in fig. 6, fig. 6 is a schematic diagram of a CPM detection module, where k represents kernel_size, s represents stride, p represents padding, and +represents point-by-point addition. Outputting X after passing through a Conv_Bn1x1 module, and adjusting the number of channels, wherein the size of a feature map is w multiplied by h multiplied by 128, namely a residual error module; the feature map output size after Concat is w×h×64 (64=32+16+16). Notably, the LeakReLU activation function is enabled inside both Conv_Bn1x1 and Conv_Bn modules, while the feature map uses the relu activation function after Conv2 d. The predictive expression capacity of the face detection model is improved through the CPM detection module, and the face detection precision is further improved.

S130, inputting the enhanced face feature map into the CPM detection module for operation to obtain an optimized face feature map.

In the embodiment of the invention, the CPM detection module performs feature optimization processing and outputs an optimized face feature map, wherein the optimized face feature map comprises a first optimized face feature map, a second optimized face feature map and a third optimized face feature map. And inputting the three optimized face feature images into a Head classification module for classification. Specifically, the Head classification module includes a first Head classification module, a second Head classification module and a third Head classification module, where the first Head classification module, the second Head classification module and the third Head classification module include a face frame regression module (BoxHead), a classification regression module (ClassHead) and a face key point regression module (landmark Head), and attribute parameters of each Head classification module are different, where attribute parameters (w×h×d, w: wide, h: high, d: dimension) of the face frame regression module of the first Head classification module: 60×60×4, attribute parameters of the classification regression module: attribute parameters of the face key point regression module of 60×60×2: 60×60×10; attribute parameters of a face frame regression module of the second Head classification module: 30×30×4, attribute parameters of the classification regression module: attribute parameters of the face key point regression module of 30×30×2: 30×30×10; attribute parameters of a face frame regression module of the third Head classification module: 15×15×4, attribute parameters of the classification regression module: attribute parameters of 15×15×2 face key point regression module: 15×15×10. Specifically, inputting the first optimized face feature map into a face frame regression module, a classification regression module and a face key point regression module of the first Head classification module to carry out regression classification, so as to obtain a first face frame regression classification result, a first regression classification result and a first face key point regression classification result; inputting the second optimized face feature map into a face frame regression module, a classification regression module and a face key point regression module of a second Head classification module to carry out regression classification to obtain a second face frame regression classification result, a second regression classification result and a second face key point regression classification result; inputting the third optimized face feature map into a face frame regression module, a classification regression module and a face key point regression module of a third Head classification module to carry out regression classification to obtain a third face frame regression classification result, a third regression classification result and a third face key point regression classification result; and taking the first face frame regression classification result, the first face key point regression classification result, the second face frame regression classification result, the second face key point regression classification result, the third face frame regression classification result, the third regression classification result and the third face key point regression classification result as classification results output by the Head classification module. In this embodiment, the output dimension of the classification result output by the Head classification module is w×h×d (d=2, 4, 10), where w represents the width; h represents the height; d=2 represents the dimension of the classification regression module (classification head), detecting whether it is classified as a face; d=4 represents the dimension of a face frame regression module (box head), and the coordinate positions of two points of the upper left corner and the lower right corner of the face frame are detected; d=10 represents the dimension of the face key point regression module (landmark head), representing 5 x values and y values corresponding to 5 face key point positions, respectively.

And S140, performing non-maximum suppression processing on the classification result to obtain a face detection result.

In the embodiment of the invention, after the classification result output by the Head classification module is obtained, the classification result of the same kind is spliced (Concat), namely the first face frame regression classification result, the second face frame regression classification result and the third face frame regression classification result are spliced to obtain a first prediction result; splicing the first regression classification result, the second regression classification result and the third regression classification result to obtain a second prediction result; and splicing the first face key point regression classification result, the second face key point regression classification result and the third face key point regression classification result to obtain a third prediction result, and then removing the prediction result which has higher overlap ratio and is relatively inaccurate in calibration through non-maximum suppression (NMS) processing of a post-processing technology to finally obtain a face detection result.

For a better understanding of the workflow of the face detection method according to the embodiment of the present invention, please refer to fig. 7, and refer to fig. 7, which is a block diagram of the workflow of the face detection method according to the embodiment of the present invention, and the workflow of the face detection model is described below with reference to the block diagram: firstly, inputting a face image to be detected into a trunk feature extraction network MobileNet V3 to extract face basic features, and outputting 3 initial face feature images (P1, P2 and P3) with different scales; secondly, the initial face feature map is used as input of an improved PANet network, the improved PANet network is input for feature enhancement processing, and a three-layer feature network structure of the improved PANet network is used for respectively outputting a first enhancement face feature map, a second enhancement face feature map and a third enhancement face feature map and extracting large-size face features, medium-size face features and small-size face features; then, the first enhanced face feature map, the second enhanced face feature map and the third enhanced face feature map are respectively input into a CPM detection module for enhancing a receptive field and introducing context information, and are used for improving the detection effect of the small-size face and the shielding face; the Head classification module comprises a face box regression (BboxHead), a classification regression (ClsHead) and a face key point regression (LdmHead), wherein the Head network uses anchor blocks with different sizes and proportions to generate candidate areas, classifies and regresses each anchor block, and is used for acquiring classification results from a feature layer and splicing (Concat); finally, removing the classification result with higher coincidence degree and relatively inaccurate calibration by a post-processing technology non-maximum suppression (NMS) to finally obtain the face detection result. The human face detection model is a lightweight model, has small required memory, can be deployed on Soc and edge equipment to detect human faces, and solves the problems of low human face detection precision, short recognition distance and poor instantaneity of a human face detection algorithm transplanted on Soc. Practical application data prove that the face detection model provided by the invention has the accuracy of face recognition of more than 96% in a short distance (within 3 m), and the accuracy of face recognition in a medium distance (4-8 m) is more than 70%. It can be appreciated that the face recognition accuracy decreases with the increase of the recognition distance, but the decrease of the face recognition accuracy is still within the acceptable range, and the requirements of basic scenes (such as video/conference, smart home) can be satisfied in terms of detection accuracy, detection speed, detection range and the like.

Referring to fig. 8, fig. 8 is a schematic view of attribute of a feature map of each module of a face detection model according to an embodiment of the present application; taking 480×480 as an example of input picture size, extracting a network MobileNetV3 through a trunk feature, and then outputting a dimension w×h×c (w, h=60, 30,15; c=64, 128, 256); then the unified channel number of each convolution layer passing through 1 multiplied by 64 is 64, namely c= [64,64,64]; the improved PANet network and CPM detection module perform feature enhancement and optimization to enrich multi-scale information, but do not change the dimension attribute of each scale; after being classified by the Head classification module, the output dimensions of each scale are w×h×d (d=2, 4, 10); and then splicing (Concat) the same classification results of the three feature images with different sizes to finally obtain a face detection result image with the output of 1×9450×d, wherein 9450=15×15×2+30×30×2+60×60×2.

In the embodiment of the application, a Ranger21 optimizer is introduced to replace an SGD optimizer when the face detection model is trained, and the Ranger21 optimizer is used for updating and calculating network parameters affecting model training parameters and model output so as to enable the network parameters to approach or reach an optimal value, thereby minimizing (or maximizing) a loss function. The Ranger21 optimizer integrates a number of new optimization ideas, and the use of the AdamW optimizer as its core (alternatively MadGrad) in combination with other components can significantly improve verification accuracy and training speed, and has a smoother training curve. The Ranger21 optimizer includes components including: adaptive gradient clipping, gradient centralization, positive and negative momentum, norm loss, steady weight decay, linear learning rate warm-up, exploration-utilization learning rate planning, lookahead, softplus transformation, and gradient normalization. Experimental data prove that the Ranger21 optimizer has a better training and optimizing effect compared with an SGD optimizer.

Fig. 9 is a schematic block diagram of a face detection apparatus 200 according to an embodiment of the present invention. As shown in fig. 9, the present invention also provides a face detection apparatus 200 corresponding to the above face detection method. The face detection apparatus 200 includes means for performing the face detection method described above, and may be configured in an electronic device. Specifically, referring to fig. 9, the face detection apparatus 200 includes an extraction unit 201, an enhancement unit 202, an optimization unit 203, a classification unit 204, and a processing unit 205.

The extracting unit 201 is configured to obtain a face image to be detected, input the face image into a feature extraction network in a face detection model to extract basic features of a face to obtain an initial face feature map, where the face detection model includes an improved PANet network, a CPM detection module, a Head classification module, and the feature extraction network; the enhancing unit 202 is configured to input the initial face feature map into the improved PANet network, and perform feature enhancement processing on the initial face feature map to obtain an enhanced face feature map; the optimizing unit 203 is configured to input the enhanced face feature map to the CPM detection module for operation to obtain an optimized face feature map; the classifying unit 204 is configured to input the enhanced face feature map into the CPM detection module for operation to obtain an optimized face feature map; and the processing unit is used for carrying out non-maximum value inhibition processing on the classification result to obtain a face detection result.

In some embodiments, for example, the extracting unit 201 includes an acquiring unit and a first calculating subunit.

The acquisition unit is used for acquiring the attention configuration information of the attention module in the residual error pouring module; the first computing subunit is configured to determine whether the attention module is used in the inverse residual module based on the attention configuration information, input the face image into the inverse residual module to perform inverse residual computation to extract basic face features, and obtain an initial face feature map.

In some embodiments, for example, the enhancement unit 202 includes a first sampling subunit, a second sampling subunit, a third sampling subunit, and a first as subunit.

The first sampling subunit is configured to sample the third initial face feature map to obtain a first sampled face feature, input the second initial face feature map and the first sampled face feature into the middle layer feature network to perform convolution operation to obtain a convolution face feature, sample the convolution face feature to obtain a second sampled face feature, and input the first initial face feature map and the second sampled face feature into the bottom layer feature network to perform convolution operation to obtain a first enhanced face feature map; the second sampling subunit is configured to sample the first initial face feature map and the first enhanced face feature map to obtain a bottom sampling feature, and input the bottom sampling feature and the convolution face feature into a middle layer feature network to perform convolution operation to obtain a second enhanced face feature map; the third sampling subunit is configured to sample the second initial face feature map and the second enhanced face feature map to obtain an intermediate layer sampling feature, and perform convolution operation on the intermediate layer sampling feature and the third initial face feature map input top layer feature network to obtain a third enhanced face feature map; the first serving subunit is configured to use the first enhanced face feature map, the second enhanced face feature map, and the third enhanced face feature map as enhanced face feature maps.

In some embodiments, for example, the second sampling subunit includes a fourth sampling subunit, a fifth sampling subunit, a second as subunit, and a second computing subunit.

The fourth sampling subunit is configured to perform bilinear sampling and downsampling on the first initial face feature map to obtain a third sampled face feature; the fifth sampling subunit is configured to perform bilinear sampling and downsampling on the first enhanced face feature map to obtain a fourth sampled face feature; the second serving subunit is configured to use the third sampled face feature and the fourth sampled face feature as bottom sampling features; and the second computing subunit is used for inputting the bottom layer sampling feature and the convolution face feature into the middle layer feature network to perform convolution operation to obtain a second enhancement face feature map.

In some embodiments, for example, the third sampling subunit includes a sixth sampling subunit, a seventh sampling subunit, a third as subunit, and a third computation subunit.

The sixth sampling subunit is configured to perform bilinear sampling and downsampling on the second initial face feature map to obtain a fifth sampled face feature; the seventh sampling subunit is configured to perform bilinear sampling and downsampling on the second enhanced face feature map to obtain a sixth sampled face feature; the third sub-unit is used for taking the fifth sampled face feature and the sixth sampled face feature as middle-layer sampling features; and the third computing subunit is used for inputting the middle layer sampling feature and the third initial face feature map into the top layer feature network to perform convolution operation to obtain a third enhanced face feature map.

In some embodiments, for example, the classification unit 204 includes a first classification subunit, a second classification subunit, a third classification subunit, and a fourth classification subunit.

The first classification subunit is configured to input the first optimized face feature map into a face frame regression module, a classification regression module and a face key point regression module of the first Head classification module to perform regression classification, so as to obtain a first face frame regression classification result, a first regression classification result and a first face key point regression classification result; the second classification subunit is configured to input the second optimized face feature map into a face frame regression module, a classification regression module, and a face key point regression module of a second Head classification module to perform regression classification, so as to obtain a second face frame regression classification result, a second regression classification result, and a second face key point regression classification result; the third classification subunit is configured to input the third optimized face feature map into a face frame regression module, a classification regression module, and a face key point regression module of a third Head classification module to perform regression classification, so as to obtain a third face frame regression classification result, a third regression classification result, and a third face key point regression classification result; the fourth sub-unit is configured to use the first face frame regression classification result, the first face key point regression classification result, the second face frame regression classification result, the second face key point regression classification result, the third face frame regression classification result, the third regression classification result, and the third face key point regression classification result as classification results output by the Head classification module.

In some embodiments, for example, the processing unit 205 includes a first splicing subunit, a second splicing subunit, a third splicing subunit, and a first processing subunit.

The first splicing subunit is configured to splice the first face frame regression classification result, the second face frame regression classification result and the third face frame regression classification result to obtain a first prediction result; the second splicing subunit is configured to splice the first regression classification result, the second regression classification result and the third regression classification result to obtain a second prediction result; the third splicing subunit is configured to splice the first face key point regression classification result, the second face key point regression classification result, and the third face key point regression classification result to obtain a third prediction result; the first processing subunit is configured to perform non-maximum suppression processing on the first prediction result, the second prediction result, and the third prediction result to obtain a face detection result.

The face detection apparatus described above may be implemented in the form of a computer program that is executable on an electronic device as shown in fig. 10.

Referring to fig. 10, fig. 10 is a schematic block diagram of an electronic device according to an embodiment of the present invention. The electronic device 300 is an electronic device having a face detection function.

Referring to fig. 10, the electronic device 300 includes a processor 302, a memory, and a network interface 305, which are connected by a system bus 301, wherein the memory may include a non-volatile storage medium 303 and an internal memory 304.

The non-volatile storage medium 303 may store an operating system 3031 and a computer program 3032. The computer program 3032, when executed, may cause the processor 302 to perform a face detection method.

The processor 302 is used to provide computing and control capabilities to support the operation of the overall electronic device 300.

The internal memory 304 provides an environment for the execution of a computer program 3032 in the non-volatile storage medium 303, which computer program 3032, when executed by the processor 302, causes the processor 302 to perform a face detection method.

The network interface 305 is used for network communication with other devices. It will be appreciated by those skilled in the art that the structure shown in fig. 10 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the electronic device 300 to which the present inventive arrangements are applied, and that a particular electronic device 300 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

It should be appreciated that in embodiments of the present invention, the processor 302 may be a central processing unit (Central Processing Unit, CPU), the processor 302 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program may be stored in a storage medium that is a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present invention also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program. The computer program, when executed by a processor, causes the processor to perform any of the embodiments of the face detection method described above.

The storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that can store program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing an electronic device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A face detection method, comprising:

2. The face detection method according to claim 1, wherein the feature extraction network is a MobileNetV3 network, the MobileNetV3 network includes a back-residual module, the feature extraction network for inputting the face image into a face detection model extracts a face basic feature to obtain an initial face feature map, and the method includes:

acquiring attention configuration information of an attention module in the back-residual error module;

and judging whether the attention module is used in the back-residual module based on the attention configuration information, inputting the face image into the back-residual module for back-residual calculation so as to extract face basic characteristics, and obtaining an initial face characteristic diagram.

3. The face detection method according to claim 2, wherein the initial face feature map includes a first initial face feature map, a second initial face feature map, and a third initial face feature map; the PANet network comprises a bottom layer characteristic network, a middle layer characteristic network and a top layer characteristic network; inputting the initial face feature map into the improved PANet network, and performing feature enhancement processing on the initial face feature map to obtain an enhanced face feature map, wherein the method comprises the following steps:

Sampling the third initial face feature map to obtain a first sampled face feature, inputting the second initial face feature map and the first sampled face feature into the middle layer feature network to perform convolution operation to obtain a convolution face feature, sampling the convolution face feature to obtain a second sampled face feature, inputting the first initial face feature map and the second sampled face feature into the bottom layer feature network to perform convolution operation to obtain a first enhanced face feature map;

sampling the first initial face feature image and the first enhancement face feature image to obtain a bottom sampling feature, and performing convolution operation on the bottom sampling feature and the convolution face feature input middle layer feature network to obtain a second enhancement face feature image;

sampling the second initial face feature image and the second enhancement face feature image to obtain middle layer sampling features, and inputting the middle layer sampling features and the third initial face feature image into a top layer feature network to perform convolution operation to obtain a third enhancement face feature image;

and taking the first enhanced face feature map, the second enhanced face feature map and the third enhanced face feature map as enhanced face feature maps.

4. A face detection method according to claim 3, wherein the sampling the first initial face feature map and the first enhanced face feature map to obtain a bottom sampling feature, and performing a convolution operation on the bottom sampling feature and the convolution face feature input intermediate layer feature network to obtain a second enhanced face feature map, includes:

bilinear sampling and downsampling are carried out on the first initial face feature map to obtain third sampled face features;

bilinear sampling and downsampling are carried out on the first enhancement face feature map to obtain fourth sampled face features;

taking the third sampled face feature and the fourth sampled face feature as bottom sampled features;

and inputting the bottom layer sampling feature and the convolution face feature into the middle layer feature network to carry out convolution operation to obtain a second enhancement face feature map.

5. The face detection method of claim 4, wherein the sampling the second initial face feature map and the second enhanced face feature map to obtain an intermediate layer sampling feature, and inputting the intermediate layer sampling feature and the third initial face feature map into a top layer feature network to perform convolution operation to obtain a third enhanced face feature map, comprising:

Bilinear sampling and downsampling are carried out on the second initial face feature map to obtain fifth sampled face features;

bilinear sampling and downsampling are carried out on the second enhancement face feature map to obtain sixth sampling face features;

taking the fifth sampled face feature and the sixth sampled face feature as middle layer sampling features;

and inputting the middle layer sampling feature and the third initial face feature image into the top layer feature network to perform convolution operation to obtain a third enhanced face feature image.

6. The face detection method of claim 5, wherein the optimized face feature map comprises a first optimized face feature map, a second optimized face feature map, and a third optimized face feature map; the Head classifying module comprises a first Head classifying module, a second Head classifying module and a third Head classifying module, wherein the first Head classifying module, the second Head classifying module and the third Head classifying module comprise a face frame regression module, a classification regression module and a face key point regression module; inputting the optimized face feature map into the Head classification module for classification to obtain a classification result, wherein the method comprises the following steps:

Inputting the first optimized face feature map into a face frame regression module, a classification regression module and a face key point regression module of the first Head classification module to carry out regression classification to obtain a first face frame regression classification result, a first regression classification result and a first face key point regression classification result;

inputting the second optimized face feature map into a face frame regression module, a classification regression module and a face key point regression module of a second Head classification module to carry out regression classification to obtain a second face frame regression classification result, a second regression classification result and a second face key point regression classification result;

inputting the third optimized face feature map into a face frame regression module, a classification regression module and a face key point regression module of a third Head classification module to carry out regression classification to obtain a third face frame regression classification result, a third regression classification result and a third face key point regression classification result;

and taking the first face frame regression classification result, the first face key point regression classification result, the second face frame regression classification result, the second face key point regression classification result, the third face frame regression classification result, the third regression classification result and the third face key point regression classification result as classification results output by the Head classification module.

7. The face detection method of claim 6, wherein the performing non-maximum suppression processing on the classification result to obtain a face detection result includes:

splicing the first face frame regression classification result, the second face frame regression classification result and the third face frame regression classification result to obtain a first prediction result;

splicing the first regression classification result, the second regression classification result and the third regression classification result to obtain a second prediction result;

splicing the first face key point regression classification result, the second face key point regression classification result and the third face key point regression classification result to obtain a third prediction result;

and performing non-maximum suppression processing on the first prediction result, the second prediction result and the third prediction result to obtain a face detection result.

8. A face detection apparatus, comprising:

9. An electronic device comprising a memory and a processor, the memory having a computer program stored thereon, the processor implementing the method of any of claims 1-7 when executing the computer program.

10. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any of claims 1-7.