CN111178183B

CN111178183B - Face detection method and related device

Info

Publication number: CN111178183B
Application number: CN201911296824.7A
Authority: CN
Inventors: 吴伟华; 康春生; 曾儿孟; 郭云
Original assignee: SHENZHEN HARZONE TECHNOLOGY CO LTD
Current assignee: SHENZHEN HARZONE TECHNOLOGY CO LTD
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2023-05-23
Anticipated expiration: 2039-12-16
Also published as: CN111178183A

Abstract

The embodiment of the application discloses a face detection method and a related device, which are applied to electronic equipment, wherein a face detection model is preconfigured in the electronic equipment, the face detection model comprises a pyramid model, a double-attention module and a multi-task loss function, and the method comprises the following steps: acquiring a target face image; inputting the target face image into a pyramid model to obtain a plurality of first feature images with different scales; inputting each of the plurality of first feature maps into a dual-attention module to perform operation to obtain a plurality of second feature maps, wherein the dual-attention module comprises a space attention module and a channel attention module; feature fusion is carried out on the plurality of second feature images to obtain an intermediate feature image; and inputting the intermediate feature map into a multi-task loss function to obtain a target face detection result, wherein the multi-task loss function comprises a plurality of tasks, and each task corresponds to a task label. By adopting the face detection method and device, face detection accuracy can be improved.

Description

Face detection method and related device

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a face detection method and a related device.

Background

Face detection is one of the most sufficient problems to be studied in all target detection sub-directions, has strong application value in the aspects of security monitoring, personnel and evidence comparison, man-machine interaction, social interaction, entertainment and the like, and is also the first step of the whole face recognition algorithm. The face detection specifically refers to a process of detecting a given picture or video by applying a certain strategy, judging whether a face exists or not, and if so, locating the position, the size and the gesture of each face. However, the face has complicated detail change, shielding, external conditions such as illumination, contrast change and the like, and the face detection precision is reduced.

Disclosure of Invention

The embodiment of the application provides a face detection method and a related device, which can improve the face detection precision of a person.

In a first aspect, an embodiment of the present application provides a face detection method, applied to an electronic device, where a face detection model is preconfigured in the electronic device, the face detection model includes a pyramid model, a dual-attention module, and a multi-task loss function, and the method includes:

acquiring a target face image;

inputting the target face image into the pyramid model to obtain a plurality of first feature images with different scales;

Inputting each of the plurality of first feature maps into the dual-attention module for operation to obtain a plurality of second feature maps, wherein the dual-attention module comprises a spatial attention module and a channel attention module, and each second feature map is a feature map added with attention;

performing feature fusion on the plurality of second feature images to obtain an intermediate feature image;

and inputting the intermediate feature map into the multi-task loss function to obtain a target face detection result, wherein the multi-task loss function comprises a plurality of tasks, and each task corresponds to a task label.

In a second aspect, an embodiment of the present application provides a face detection apparatus, which is applied to an electronic device, where a face detection model is preconfigured in the electronic device, where the face detection model includes a pyramid model, a dual-attention module, and a multi-task loss function, and the apparatus includes: the device comprises an acquisition unit, a multi-scale decomposition unit, an input unit, a fusion unit and a detection unit, wherein,

the acquisition unit is used for acquiring a target face image;

the multi-scale decomposition unit is used for inputting the target face image into the pyramid model to obtain a plurality of first feature images with different scales;

The input unit is used for inputting each of the plurality of first feature images into the dual-attention module for operation to obtain a plurality of second feature images, the dual-attention module comprises a spatial attention module and a channel attention module, and each second feature image is a feature image added with attention;

the fusion unit is used for carrying out feature fusion on the plurality of second feature images to obtain an intermediate feature image;

the detection unit is used for inputting the intermediate feature map into the multi-task loss function to obtain a target face detection result, wherein the multi-task loss function comprises a plurality of tasks, and each task corresponds to a task label.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps in the first aspect of the embodiment of the present application.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program causes a computer to perform some or all of the steps as described in the first aspect of the embodiments of the present application.

In a fifth aspect, embodiments of the present application provide a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps described in the first aspect of the embodiments of the present application. The computer program product may be a software installation package.

By implementing the embodiment of the application, the following beneficial effects are achieved:

it can be seen that, the face detection method and the related device described in the embodiments of the present application are applied to an electronic device, in which a face detection model is preconfigured in the electronic device, the face detection model includes a pyramid model, a dual-attention module and a multitask loss function, a target face image is obtained, the target face image is input to the pyramid model, a plurality of first feature maps with different scales are obtained, each feature map in the plurality of first feature maps is input to the dual-attention module for operation, a plurality of second feature maps are obtained, the dual-attention module includes a spatial attention module and a channel attention module, each second feature map is a feature map with added attention, feature fusion is performed on the plurality of second feature maps, an intermediate feature map is obtained, the intermediate feature map is input to the multitask loss function, and a target face detection result is obtained, wherein the multitask loss function includes a plurality of tasks, each task corresponds to a task tag, thus, a pyramid network model with the attention module is fused is constructed, each branch in the structure adopts the dual-attention module to enhance feature representation, and then all features with different scales are fused; finally, a multi-task loss function is adopted to classify and regress the human face, so that the accuracy of human face detection can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1A is a schematic flow chart of a face detection method according to an embodiment of the present application;

fig. 1B is a schematic flow chart of another face detection method according to an embodiment of the present application;

FIG. 1C is a schematic diagram of an attention module according to an embodiment of the present application;

fig. 2 is a flowchart of another face detection method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of another electronic device according to an embodiment of the present application;

fig. 4A is a functional unit composition block diagram of a face detection apparatus provided in an embodiment of the present application;

fig. 4B is a functional unit block diagram of another face detection apparatus according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The electronic device described in the embodiments of the present application may include a smart Phone (such as an Android mobile Phone, an iOS mobile Phone, a Windows Phone mobile Phone, etc.), a tablet computer, a palm computer, a vehicle event recorder, a traffic guidance platform, a server, a notebook computer, a mobile internet device (MID, mobile Internet Devices), or a wearable device (such as a smart watch, a bluetooth headset), etc., which are merely examples, but not limited to, the above electronic device may also be a video matrix, and the electronic device is not limited thereto.

The embodiments of the present application are described in detail below.

In the related art, the conventional face detection method mainly adopts manual design features, and the features belong to low-level features, so that the problem of face detection is greatly limited. Deep learning has its inherent advantages in extracting high-level semantic features, and can learn effective feature representation from big data for specific tasks. Most of face detection based on deep learning only considers global features of the face, and neglects the importance of local features. Because the faces of different types have consistency in form and structure, especially the faces of the same type have great difference, the probability of false recognition is higher. Therefore, in the face detection problem, compared with the characteristics such as morphology, outline and the like, the detail texture characteristics play a leading role.

In addition, although many face detection methods have achieved significant results, such as using a more powerful backbone model for detection; combining features from the plurality of detected feature maps using a feature pyramid architecture; more dense anchors are designed and use larger context information, etc. These methods and techniques have been shown to successfully build a powerful face detector and improve the human-level performance of most images. But still has obvious performance gap, especially for face images with small scale, blurring and partial shielding, the condition of missing detection and false detection easily occurs during detection, which can influence the accuracy of face detection, thereby further influencing the accuracy of face recognition and retrieval.

Aiming at the problem that part of the face is prone to missed detection and false detection in detection due to blurring and shielding, the embodiment of the application provides a face detection method which is applied to electronic equipment, wherein a face detection model is preconfigured in the electronic equipment, the face detection model comprises a pyramid model, a double-attention module and a multi-task loss function, and the method comprises the following steps:

acquiring a target face image;

In the embodiment of the application, a pyramid network model fusing attention modules is constructed, each branch in the structure adopts a double-attention module to enhance feature representation, and then all features with different scales are fused; finally, a multi-task loss function is adopted to classify and regress the human face, so that the accuracy of human face detection can be improved.

Referring to fig. 1A, fig. 1A is a flow chart of a face detection method provided in an embodiment of the present application, which is applied to an electronic device, where a face detection model is preconfigured in the electronic device, and the face detection model includes a pyramid model, a dual-attention module, and a multi-task loss function, as shown in the figure, the face detection method includes:

101. and acquiring a target face image.

The target face image may be an image including a target face, or the target face image may be an image including only a target face, and a face detection model is preconfigured in the electronic device, where the face detection model may include a pyramid model, a dual-attention module, and a multi-task loss function, the pyramid model may be used to implement multi-scale decomposition, the dual-attention module may be used to enhance feature saliency, and the multi-task loss function may implement face detection. The multitasking loss function includes a plurality of loss functions, each loss function corresponds to a task, each task corresponds to a task tag, and the task can be at least one of the following: whether a person is detected, a type of person is detected, a frame of person is detected, a property of person is detected, and the like, which are not limited herein. The face type may be at least one of: national faces, round faces, duck egg faces, etc., are not limited herein. The face attribute may be at least one of: a man face, a woman face, a child face, a big face, a chinese face, a foreigner face, etc., without limitation.

In one possible example, the step 101 of acquiring the target face image includes the following steps:

11. acquiring a target environment parameter;

12. determining a target shooting parameter corresponding to the target environmental parameter according to a mapping relation between a preset environmental parameter and the shooting parameter;

13. shooting the target face according to the target shooting parameters to obtain a first image;

14. and carrying out image segmentation on the first image to obtain the target face image.

In this embodiment of the present application, the environmental parameter may be at least one of the following: ambient light, weather, temperature, humidity, geographical location, magnetic field disturbance intensity, etc., without limitation, the shooting parameters may be at least one of the following: the sensitivity ISO, exposure time, white balance parameter, photographing mode, color temperature, and the like are not limited herein. Wherein the environmental parameter may be collected by an environmental sensor, which may be at least one of: ambient light sensors, weather sensors, temperature sensors, humidity sensors, positioning sensors, magnetic field detection sensors, and the like, are not limited herein. The mapping relation between the preset environmental parameters and the shooting parameters can be stored in the electronic equipment in advance.

In a specific implementation, the electronic device can acquire the target environment parameters, and determine the target shooting parameters corresponding to the target environment parameters according to the mapping relation between the preset environment parameters and the shooting parameters, further, the target face can be shot according to the target shooting parameters to obtain a first image, and the first image is subjected to image segmentation to obtain a target face image, so that a shooting image suitable for the environment can be obtained, and the image only comprising the target face can be extracted based on the shooting image, thereby being beneficial to improving the follow-up face detection precision.

Between the above steps 13 to 14, the method may further include the following steps:

a1, determining an image quality evaluation value of the first image;

a2, performing image enhancement processing on the first image when the image quality evaluation value is lower than a preset threshold value;

in the step 14, the image segmentation is performed on the first image to obtain the target face image, specifically:

and carrying out image segmentation on the first image after the image enhancement processing to obtain the target face image.

In a specific implementation, at least one image quality evaluation index may be used to perform image quality evaluation on the image, where the image quality evaluation index may be at least one of the following: average luminance, sharpness, entropy, etc., are not limited herein. The image enhancement algorithm may be at least one of: wavelet transformation, image sharpening, gray stretching, histogram equalization, etc., are not limited herein.

In a specific implementation, the electronic device may determine an image quality evaluation value of the first image, and when the image quality evaluation value is lower than a preset threshold, perform image enhancement processing on the first image, and perform image segmentation on the first image after the image enhancement processing to obtain the target face image, otherwise, when the image quality evaluation value is greater than or equal to the preset threshold, directly perform image segmentation on the first image to obtain the target face image, so that image segmentation accuracy can be improved, and subsequent face detection is facilitated.

Further, in one possible example, the step A2 of performing image enhancement processing on the first image may include the following steps:

a21, dividing the first image into a plurality of areas;

a22, determining a definition value of each region in the plurality of regions to obtain a plurality of definition values;

a23, selecting a definition value lower than a preset definition value from the definition values, and acquiring a corresponding region to obtain at least one target region;

a24, determining the distribution density of the feature points corresponding to each region in the at least one target region to obtain at least one distribution density of the feature points;

A25, determining a characteristic point distribution density grade corresponding to the at least one characteristic point distribution density to obtain at least one characteristic point density distribution grade;

a26, determining a target image enhancement algorithm corresponding to the at least one characteristic point density distribution level according to a mapping relation between the preset characteristic point distribution density level and the image enhancement algorithm;

and A27, performing image enhancement processing on the corresponding target area according to a target image enhancement algorithm corresponding to the at least one characteristic point density distribution level to obtain the first image after the image enhancement processing.

The preset definition value can be set by a user or default by the system. The mapping relation between the preset characteristic point distribution density level and the image enhancement algorithm can be stored in the electronic equipment in advance, and the image enhancement algorithm can be at least one of the following: wavelet transformation, image sharpening, gray stretching, histogram equalization, etc., are not limited herein.

In a specific implementation, the electronic device may divide the first image into a plurality of regions, where each region has the same or different area, and may further determine a sharpness value of each region in the plurality of regions to obtain a plurality of sharpness values, select a sharpness value lower than a preset sharpness value from the plurality of sharpness values, and obtain a region corresponding to the sharpness value to obtain at least one target region, and further determine a feature point distribution density corresponding to each region in the at least one target region to obtain at least one feature point distribution density, where each region corresponds to one feature point distribution density, and feature point distribution density=feature point total number/region area of one region. The electronic device may further store a mapping relationship between the feature point distribution density and the feature point distribution density level in advance, and further determine a feature point distribution density level corresponding to each feature point distribution density in the at least one feature point distribution density according to the mapping relationship, so as to obtain the at least one feature point distribution density level.

Further, the electronic device may determine a target image enhancement algorithm corresponding to each feature point distribution density level in the at least one feature point distribution density level according to a mapping relationship between the preset feature point distribution density level and the image enhancement algorithm, and perform image enhancement processing on a corresponding target area according to the target image enhancement algorithm corresponding to the at least one feature point distribution density level, so as to obtain a first image after the image enhancement processing, so that over-enhancement of areas with good image quality can be prevented, and areas with different image quality are likely to have different image quality, so that image enhancement can be implemented in a targeted manner, and further image quality is improved.

102. And inputting the target face image into the pyramid model to obtain a plurality of first feature images with different scales.

The electronic device may input the target face image into the pyramid model, so as to obtain a plurality of first feature maps with different scales, that is, each first feature map corresponds to one scale, that is, the resolutions of the plurality of first feature maps are different.

103. And inputting each of the plurality of first feature maps into the dual-attention module for operation to obtain a plurality of second feature maps, wherein the dual-attention module comprises a spatial attention module and a channel attention module, and each second feature map is a feature map with added attention.

In particular implementations, the dual attention module may include a spatial attention module and a channel attention module. The electronic device inputs each of the plurality of first feature maps into the dual-attention module for operation to obtain a plurality of second feature maps, and specifically, the electronic device can prioritize the first feature map with the lowest scale. The second profile may be a profile of an add-on attention module.

In a possible example, the step 103, inputting each of the plurality of first feature maps into the dual-attention module for operation to obtain a plurality of second feature maps may include the following steps:

b1, performing deconvolution operation on the first feature map of the ith layer to obtain a deconvolution operation result, wherein the first feature map of the ith layer is not the first feature map with the largest scale in the plurality of first feature maps;

and B2, adjusting model parameters of a dual-attention module of a first feature map of the upper layer of the first feature map of the ith layer according to the feature map after deconvolution operation.

And B3, calculating the first characteristic diagram of the layer through the adjusted dual-attention module to obtain a second characteristic diagram.

In this embodiment of the present application, the attention module (sa+ca) may be incorporated into the pyramid network structure (pyramid model), and then the attention module of the upper layer is supervised by the features of the lower layer. The values of alpha and beta are updated by adjusting the relationship between the underlying features and the features after spatial attention and channel attention are added, and cosine similarity calculation is used here, as shown in fig. 1B, taking the 6-layer scale as an example, and the feature vector obtained by deconvolution processing of the i (i=3, 4,5, 6) th layer is

The feature vector after spatial attention processing is +.>

The feature vector after the channel attention processing is +.>

Then it is possible to obtain:

because the local context information is fused in the attention module, the global context information of the bottom layer and the features of the upper layer is also connected in the pyramid network structure, and the combination not only fully utilizes the context information of the local and global features, but also does not add excessive additional parameters in the network layer.

31. determining spatial features and channel features of a jth first feature map, wherein the jth first feature map is any one of the plurality of first feature maps;

32. carrying out softmax operation on the spatial features to obtain spatial feature weights;

33. carrying out softmax operation on the channel characteristics to obtain channel characteristic weights;

34. performing mul operation on the spatial feature weight and the jth first feature map to obtain an intermediate spatial feature;

35. performing mul operation on the channel characteristic weight and the jth first characteristic diagram to obtain an intermediate channel characteristic;

36. And performing mul operation on the intermediate space feature, the intermediate channel feature and the jth first feature map to obtain a second feature map corresponding to the jth first feature map.

In a specific implementation, taking a jth first feature map as an example, the jth first feature map is any one of the plurality of first feature maps. The electronic device may determine a spatial feature and a channel feature of a jth first feature map, where the jth first feature map is any one of the first feature maps, and further on one hand, may perform a softmax operation on the spatial feature to obtain a spatial feature weight, and on the other hand, may perform a softmax operation on the channel feature to obtain a channel feature weight, further, perform a mul operation on the spatial feature weight and the jth first feature map to obtain an intermediate spatial feature, and perform a mul operation on the channel feature weight and the jth first feature map to obtain an intermediate channel feature, and finally, perform a mul operation on the intermediate spatial feature, the intermediate channel feature, and the jth first feature map to obtain a second feature map corresponding to the jth first feature map.

As shown in fig. 1C, the spatial and channel attention module (sa+ca) is fused to selectively aggregate features for each location by a weighted sum of features, since similar features will be related to each other regardless of distance. The channel attention module is used for selectively emphasizing the channel mapping with interdependence by integrating the related features among all the channel mapping, and because each channel corresponds to certain semantic information, the channels with attention at similar positions are fused to form masks, so that each mask can pay attention to the same area without paying attention to different areas by using the masks. Finally, the two attention modules are fused with the original feature module to further enhance the feature representation.

Specifically, F, fsa, fca, F' respectively represent the input feature vectors, i.e., the first feature map, the intermediate space feature, the space channel feature, the second feature map, the feature vectors subjected to the space attention process, the feature vectors subjected to the channel attention process, and the final output vector, so that there are:

F'＝F+αFsa+βFca

Fsa＝F×S

Fca＝C×F

where S represents the spatial feature weight and C represents the channel feature weight.

Wherein the spatial feature weight is composed of a plurality of small blocks of spatial attention coefficients, s _i Is a spatial attention coefficient, the channel characteristic weight is composed of a plurality of small-block channel attention coefficients, c _i Refer to the channel attention coefficient, S _i And C _i The feature vectors at the i-th blocks of S and C, respectively, are represented and can be implemented as follows:

where e is a constant.

104. And carrying out feature fusion on the plurality of second feature images to obtain an intermediate feature image.

In a specific implementation, the electronic device may perform feature fusion on the plurality of second feature maps to obtain an intermediate feature map, for example, the plurality of second feature maps may perform inverse transformation corresponding to multi-scale decomposition of the pyramid model.

105. And inputting the intermediate feature map into the multi-task loss function to obtain a target face detection result, wherein the multi-task loss function comprises a plurality of tasks, and each task corresponds to a task label.

In the specific implementation, the electronic device can input an image to be detected, and perform face detection through a test network by using the trained model to obtain information such as face frame coordinates and confidence level.

Further, the obtained face frame coordinates and confidence information are further screened by non-maximum suppression and other methods, and finally, the area finally containing the face is output on the original image.

In this embodiment of the present application, as shown in fig. 1B, the specific structure of the face recognition model may be that the pyramid model uses a pyramid structure with a Resnet50 as a reference, and each branch in the structure adopts a module for fusing a space and a channel to enhance feature representation; then, the features of the attention module are adjusted through the bottom layer features; then fusing all the features with different scales; finally, a multitask loss function is adopted to realize the classification of the human face and the regression of the human face frame.

In a possible example, before the step 101, the following steps may be further included:

c1, acquiring a training sample set, wherein the training sample set comprises a plurality of training samples;

c2, dividing the training sample set into a plurality of training sets, wherein each training set corresponds to a complexity level;

And C3, training a preset face detection model according to the training sets to obtain the face detection model.

The preset face detection model can be stored in the electronic device in advance, can be set by a user or default by a system, and comprises a pyramid model, a double-attention module and a multi-task loss function. In a specific implementation, the training sample set may include a plurality of training samples, in this embodiment, the electronic device may classify the plurality of training samples according to complexity, divide the training sample set into a plurality of training sets, each training set corresponds to a complexity level, and train a preset face detection model according to the plurality of training sets, so as to obtain a face detection model, and as such, may promote robustness of the face detection model.

Further, in one possible example, the step C2 of dividing the training sample set into a plurality of training sets may include the following steps:

c21, determining multiple types of features in a sample a, wherein each type of feature corresponds to one quantity and weight value, and the sample a is any sample in the training sample set;

c22, calculating according to the quantity and the weight value corresponding to each type of characteristics in the multiple types of characteristics to obtain target complexity;

And C23, determining a target training set corresponding to the sample a according to a mapping relation between the preset complexity and class labels of the training set.

The mapping relationship between the preset complexity and the class label of the training set may be stored in the electronic device in advance, and in a specific implementation, the features may include the following classes: the occlusion faces, the blurring faces, the small-scale faces, the large-scale faces, and the like are not limited herein, wherein a scale threshold value can be stored in the electronic device, and a face smaller than the scale threshold value can be regarded as a small-scale face, and otherwise, a face regarded as a large-scale face.

Specifically, the electronic device may determine multiple types of features in the sample a, where each type of feature corresponds to a number and a weight value, where the sample a is any sample in the training sample set, and calculate according to one number and a weight value corresponding to each type of feature in the multiple types of features to obtain a target complexity, where the weight value corresponding to each type of feature may be preset, and determine a target training set corresponding to the sample a according to a mapping relationship between the preset complexity and class labels of the training set.

In addition, because not all training images need to be treated equally in the training process, the recognized simple images are not obvious for helping to train a more powerful face detector, so the patent sets a sample complexity evaluation model in a sample preprocessing stage, classifies the complexity of the images, enhances the complexity of difficult samples, so that people focus on face images with higher complexity in training,

In the specific implementation, in the data preprocessing stage, the electronic device can enhance the image data to improve the recognition capability and generalization capability of the network model, enhance the image by mirroring, random clipping, scaling and other methods, and design a difficult sample complexity evaluation model. In the data processing stage, a complexity threshold omega is set mainly according to the proportional relation among small scale, blurring and the number of the blocked faces in the image, the image with the complexity higher than the set threshold is marked with high complexity, otherwise, the complexity is low; and then a certain complexity is respectively given to each training image label. Complexity is dynamically assigned to training images in mid-term training iterations, which can determine whether an image has been well detected or still help for further training. This makes it possible to make full use of images that are not perfectly detected, to better facilitate later learning. The numbers of small-scale, fuzzy, shielding and large-scale faces in the image are N _sc ,N _blur ,N _c ,N _bc ，

Weights of small scale, blurring, shielding and large scale factors respectively; each sheet is thenThe complexity of an image can be expressed as:

based on the embodiment of the application, the following advantages are provided:

(1) Difficult sample complexity mechanism: and constructing a complexity evaluation model according to the relation among the faces such as occlusion, blurring, small scale, large scale and the like in the image to evaluate the complexity of the image, and then using the model for training and learning to dynamically distribute the complexity to the training image in the middle stage of training, thereby achieving the purpose of fully utilizing the image which is not perfectly detected.

(2) Attention module fusing space and channel: not only is the features of each location selectively aggregated by a weighted sum of the spatial features, but the correlation features between all channel maps are integrated to selectively emphasize the existence of interdependent channel maps; the space and channel attention method is fused, and the feature expression is further enhanced.

(3) Pyramid network model of fused attention module: the attention module is fused into the pyramid structure, the bottom layer features are used for supervising the features of the upper layer processed by the attention module in a semi-supervision mode, and the weights of the space and the channel attention module are adjusted.

It can be seen that, the face detection method described in the embodiment of the present application is applied to an electronic device, in which a face detection model is preconfigured in the electronic device, the face detection model includes a pyramid model, a dual-attention module and a multi-task loss function, a target face image is obtained, the target face image is input into the pyramid model to obtain a plurality of first feature maps with different scales, each feature map in the plurality of first feature maps is input into the dual-attention module for operation, a plurality of second feature maps are obtained, the dual-attention module includes a spatial attention module and a channel attention module, each second feature map is a feature map with added attention, feature fusion is performed on the plurality of second feature maps to obtain an intermediate feature map, the intermediate feature map is input into the multi-task loss function to obtain a target face detection result, the multi-task loss function includes a plurality of tasks, each task corresponds to a task tag, thus, a pyramid network model of the fusion attention module is constructed, each branch in the structure adopts the dual-attention module to enhance feature representation, and then all features are fused with different scales; finally, a multi-task loss function is adopted to classify and regress the human face, so that the accuracy of human face detection can be improved.

In accordance with the embodiment shown in fig. 1A, please refer to fig. 2, fig. 2 is a schematic flow chart of a face detection method provided in the embodiment of the present application, which is applied to an electronic device, wherein a face detection model is preconfigured in the electronic device, the face detection model includes a pyramid model, a dual-attention module and a multi-task loss function, and as shown in the figure, the face detection method includes:

201. a training sample set is obtained, the training sample set comprising a plurality of training samples.

202. The training sample set is divided into a plurality of training sets, each training set corresponding to a complexity level.

203. Training a preset face detection model according to the training sets to obtain a face detection model, wherein the face detection model comprises a pyramid model, a double-attention module and a multi-task loss function.

204. And acquiring a target face image.

205. And inputting the target face image into the pyramid model to obtain a plurality of first feature images with different scales.

206. And inputting each of the plurality of first feature maps into the dual-attention module for operation to obtain a plurality of second feature maps, wherein the dual-attention module comprises a spatial attention module and a channel attention module, and each second feature map is a feature map with added attention.

207. And carrying out feature fusion on the plurality of second feature images to obtain an intermediate feature image.

208. And inputting the intermediate feature map into the multi-task loss function to obtain a target face detection result, wherein the multi-task loss function comprises a plurality of tasks, and each task corresponds to a task label.

The specific description of the steps 201 to 208 may refer to the corresponding steps of the face detection method described in fig. 1A, and will not be repeated herein.

It can be seen that, in the face detection method described in the embodiment of the present application, a pyramid network model with attention modules fused is constructed, each branch in the structure adopts a dual-attention module to enhance feature representation, and then all features with different scales are fused; finally, a multi-task loss function is adopted to classify and regress the human face, so that the accuracy of human face detection can be improved.

In accordance with the foregoing embodiment, referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, where the electronic device includes a processor, a memory, a communication interface, and one or more programs, and the programs are applied to the electronic device and are applied to the electronic device, where a face detection model is preconfigured in the electronic device, and the face detection model includes a pyramid model, a dual attention module, and a multi-task loss function, and the one or more programs are stored in the memory and are configured to be executed by the processor, and in the embodiment of the present application, the programs include instructions for performing the following steps:

Acquiring a target face image;

It can be seen that, in the electronic device described in the embodiment of the present application, a face detection model is preconfigured in the electronic device, where the face detection model includes a pyramid model, a dual-attention module and a multitask loss function, a target face image is obtained, the target face image is input to the pyramid model, multiple first feature maps with different scales are obtained, each feature map in the multiple first feature maps is input to the dual-attention module to perform an operation, multiple second feature maps are obtained, the dual-attention module includes a spatial attention module and a channel attention module, each second feature map is an attention-added feature map, feature fusion is performed on the multiple second feature maps to obtain an intermediate feature map, the intermediate feature map is input to the multitask loss function, and a target face detection result is obtained, where the multitask loss function includes multiple tasks, each task corresponds to a task tag, so that a pyramid network model of the attention-fused module is constructed, each branch in the structure adopts the dual-attention module to enhance feature representation, and then all the features with different scales are fused; finally, a multi-task loss function is adopted to classify and regress the human face, so that the accuracy of human face detection can be improved.

In one possible example, in the aspect of inputting each of the plurality of first feature maps into the dual-attention module to perform an operation to obtain a plurality of second feature maps, the program further includes instructions for performing the following steps:

performing deconvolution operation on the first feature map of the ith layer to obtain a deconvolution operation result, wherein the first feature map of the ith layer is not the first feature map with the largest scale in the plurality of first feature maps;

adjusting model parameters of a dual-attention module of a first feature map of a layer above the first feature map of the ith layer according to the deconvolution operation result;

and calculating the first characteristic diagram of the layer through the adjusted dual-attention module to obtain a second characteristic diagram.

In one possible example, in the aspect of inputting each of the plurality of first feature maps into the dual-attention module to perform an operation to obtain a plurality of second feature maps, the program includes instructions for performing the following steps:

determining spatial features and channel features of a jth first feature map, wherein the jth first feature map is any one of the plurality of first feature maps;

Carrying out softmax operation on the spatial features to obtain spatial feature weights;

carrying out softmax operation on the channel characteristics to obtain channel characteristic weights;

performing mul operation on the spatial feature weight and the jth first feature map to obtain an intermediate spatial feature;

performing mul operation on the channel characteristic weight and the jth first characteristic diagram to obtain an intermediate channel characteristic;

and performing mul operation on the intermediate space feature, the intermediate channel feature and the jth first feature map to obtain a second feature map corresponding to the jth first feature map.

In one possible example, the above-described program further includes instructions for performing the steps of:

acquiring a training sample set, wherein the training sample set comprises a plurality of training samples;

dividing the training sample set into a plurality of training sets, wherein each training set corresponds to a complexity level;

training a preset face detection model according to the training sets to obtain the face detection model.

In one possible example, in said dividing the training sample set into a plurality of training sets, the above-mentioned program comprises instructions for performing the steps of:

Determining multiple types of features in a sample a, wherein each type of feature corresponds to one quantity and weight value, and the sample a is any sample in the training sample set;

calculating according to the quantity and the weight value corresponding to each type of characteristics in the multiple types of characteristics to obtain target complexity;

and determining a target training set corresponding to the sample a according to a mapping relation between the preset complexity and class labels of the training set.

The foregoing description of the embodiments of the present application has been presented primarily in terms of a method-side implementation. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application may divide the functional units of the electronic device according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.

Fig. 4A is a functional unit block diagram of the face detection apparatus 400 according to the embodiment of the present application. The face detection apparatus 400 is applied to an electronic device, in which a face detection model is preconfigured, the face detection model includes a pyramid model, a dual-attention module, and a multi-task loss function, and the apparatus includes: an acquisition unit 401, a multi-scale decomposition unit 402, an input unit 403, a fusion unit 404, and a detection unit 405, wherein,

the acquiring unit 401 is configured to acquire a target face image;

the multi-scale decomposition unit 402 is configured to input the target face image into the pyramid model, so as to obtain a plurality of first feature graphs with different scales;

The input unit 403 is configured to input each of the plurality of first feature maps into the dual-attention module for operation, so as to obtain a plurality of second feature maps, where the dual-attention module includes a spatial attention module and a channel attention module, and each second feature map is a feature map with added attention;

the fusing unit 404 is configured to perform feature fusion on the plurality of second feature graphs to obtain an intermediate feature graph;

the detecting unit 405 is configured to input the intermediate feature map to the multi-task loss function to obtain a target face detection result, where the multi-task loss function includes a plurality of tasks, and each task corresponds to a task tag.

It can be seen that, the face detection device described in the embodiment of the present application is applied to an electronic device, in which a face detection model is preconfigured in the electronic device, the face detection model includes a pyramid model, a dual-attention module and a multi-task loss function, a target face image is obtained, the target face image is input into the pyramid model to obtain a plurality of first feature maps with different scales, each feature map in the plurality of first feature maps is input into the dual-attention module for operation, a plurality of second feature maps are obtained, the dual-attention module includes a spatial attention module and a channel attention module, each second feature map is a feature map with added attention, feature fusion is performed on the plurality of second feature maps to obtain an intermediate feature map, the intermediate feature map is input into the multi-task loss function to obtain a target face detection result, the multi-task loss function includes a plurality of tasks, each task corresponds to a task tag, thus, a pyramid network model of the fusion attention module is constructed, each branch in the structure adopts the dual-attention module to enhance feature representation, and then all features are fused with different scales; finally, a multi-task loss function is adopted to classify and regress the human face, so that the accuracy of human face detection can be improved.

In one possible example, in the aspect that each of the plurality of first feature maps is input to the dual-attention module to perform operation, a plurality of second feature maps are obtained, and the input unit 403 is specifically configured to:

In one possible example, in the aspect that each feature layer in the plurality of first feature images is input into the dual-attention module to perform operation, a plurality of second feature images are obtained, and the input unit 403 is specifically configured to:

In one possible example, as shown in fig. 4B, fig. 4B is a further modified structure of the face detection apparatus shown in fig. 4A, which may further include, compared to fig. 4A: training unit 406 is specifically as follows:

the training unit 406 is specifically configured to:

Further, in one possible example, in the aspect of dividing the training sample set into a plurality of training sets, the training unit 406 is specifically configured to:

It may be understood that the functions of each program module of the face detection apparatus of the present embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not repeated herein.

The embodiment of the application also provides a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, where the computer program causes a computer to execute part or all of the steps of any one of the methods described in the embodiments of the method, where the computer includes an electronic device.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the methods described in the method embodiments above. The computer program product may be a software installation package, said computer comprising an electronic device.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A face detection method, characterized in that it is applied to an electronic device in which a face detection model is preconfigured, the face detection model including a pyramid model, a dual-attention module, and a multi-task loss function, the method comprising:

Acquiring a target face image, acquiring a target environment parameter, determining a target shooting parameter corresponding to the target environment parameter according to a mapping relation between the preset environment parameter and the shooting parameter, shooting the target face according to the target shooting parameter to obtain a first image, and performing image segmentation on the first image to obtain the target face image;

inputting the intermediate feature map into the multi-task loss function to obtain a target face detection result, wherein the multi-task loss function comprises a plurality of tasks, and each task corresponds to a task tag;

the step of inputting each of the plurality of first feature maps into the dual-attention module for operation to obtain a plurality of second feature maps includes:

2. The method of claim 1, wherein inputting each of the plurality of first feature maps into a dual-attention module for operation to obtain a plurality of second feature maps comprises:

3. The method according to claim 1 or 2, characterized in that the method further comprises:

4. The method of claim 3, wherein the dividing the training sample set into a plurality of training sets comprises:

5. A face detection apparatus, characterized in that it is applied to an electronic device in which a face detection model is preconfigured, the face detection model including a pyramid model, a dual attention module, and a multi-task loss function, the apparatus comprising: the device comprises an acquisition unit, a multi-scale decomposition unit, an input unit, a fusion unit and a detection unit, wherein,

the acquisition unit is used for acquiring a target face image, acquiring a target environment parameter, determining a target shooting parameter corresponding to the target environment parameter according to a mapping relation between the preset environment parameter and the shooting parameter, shooting the target face according to the target shooting parameter to obtain a first image, and performing image segmentation on the first image to obtain the target face image;

the detection unit is used for inputting the intermediate feature map into the multi-task loss function to obtain a target face detection result, wherein the multi-task loss function comprises a plurality of tasks, and each task corresponds to a task label;

wherein, in the aspect of inputting each of the plurality of first feature maps into the dual-attention module to perform operation to obtain a plurality of second feature maps, the input unit is specifically configured to:

6. The apparatus of claim 5, wherein in the aspect of inputting each of the plurality of first feature maps into a dual-attention module for operation, a plurality of second feature maps are obtained, the input unit is specifically configured to:

7. An electronic device comprising a processor, a memory for storing one or more programs and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-4.

8. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any of claims 1-4.