WO2022105608A1

WO2022105608A1 - Rapid face density prediction and face detection method and apparatus, electronic device, and storage medium

Info

Publication number: WO2022105608A1
Application number: PCT/CN2021/128477
Authority: WO
Inventors: 张敏文; 周治尹
Original assignee: 上海点泽智能科技有限公司
Priority date: 2020-11-19
Filing date: 2021-11-03
Publication date: 2022-05-27
Also published as: CN112329702A; CN112329702B

Abstract

The present application provides a rapid face density prediction and face detection method and apparatus, an electronic device, and a storage medium. The method comprises the following steps: obtaining an image to be detected; extracting, by using a feature pyramid residual block, a multi-scale feature in the image to be detected; performing feature fusion by using a mutual embedded upsampling module; and using a face detection module to predict a face confidence level and the width and height of the face. In the implementation process described above, the present application uses a method for predicting Gaussian distribution to predict the face density in the image and detect the face in the image, so as to avoid unstable factors caused by using a candidate box; a feature pyramid residual block is used and a small convolution kernel is used, and the depth of a network is not increased to increase the receptive field of neurons; the depth and parameter of the network are not increased, and the receptive field of the neurons is improved, so that more face information can be extracted by means of the network.

Description

A fast face density prediction and face detection method, device, electronic device and storage medium

technical field

The invention relates to image information processing technology, in particular to a method, device, electronic device and storage medium for fast face density prediction and face detection.

Background technique

Face detection has important application value in security monitoring, witness comparison, human-computer interaction, social networking and other fields. Digital cameras, smart phones and other devices have used face detection technology in a large number to achieve functions such as focusing on faces, sorting and classifying atlas during imaging, and various virtual beauty cameras also require face detection technology to locate faces.

At present, the common face detection methods (FaceBoxes, MTCNN) need to set the face candidate frame first, and learn the offset on the face candidate frame through the neural network to obtain the position of the face in the image, and the setting of the candidate frame will directly affect the The accuracy of face detection; the FaceBoxes model has high accuracy, but contains a large amount of parameters; the MTCNN (Multi-task Cascaded Convolutional Networks) model has a small amount of parameters, but its feature expression ability is general, and it contains three needs. Separately trained neural networks are not easy to train; at the same time, the U-shaped feature extraction network only expands high-level features during feature fusion, and does not fully utilize the texture information of high-level features and the detailed information of low-level features.

SUMMARY OF THE INVENTION

In order to solve the above-mentioned technical problems, the present invention proposes a face detection method, comprising the following steps:

Step S1: acquiring an image to be detected;

Step S2: using feature pyramid residual blocks to extract multi-scale features in the image to be detected;

Step S3: adopting the mutual embedding upsampling module to perform feature fusion;

Step S4: using the face detection module to predict the confidence level of the face and the width and height of the face.

Preferably, the step S2 includes:

Step S2.1: use a 3×3 convolution kernel to convolve the image to be detected, and send the convolved image into the feature pyramid residual block to extract features;

Step S2.2: combine a plurality of the feature pyramid residual blocks into a feature extraction network, and extract the features of the feature map output by the step S2.1;

Step S2.3: Combine a plurality of the feature pyramid residual blocks into a feature extraction network, and extract the features of the feature map output in the step S2.2.

Preferably, the feature pyramid residual block provided by this application includes:

A 1×1 convolution operation is used to expand the number of channels of the feature map; the feature maps are equally divided into 4 groups in the channel direction, and the first group uses a 3×3 convolution kernel with a hole size of 1 to convolve the features of the first group , the second group uses a 3×3 convolution kernel of hole size 2 to convolve the features of the second group, and the third group uses a 3×3 convolution kernel of hole size 4 to convolve the features of the third group , the fourth group uses a 3×3 convolution kernel with a hole size of 8 to convolve the features of the fourth group; the 4 groups of features convolved by the convolution kernel are combined in order to form the first feature map, using 1 The ×1 convolution performs feature fusion on the first feature map to form a second feature map; the feature map and the second feature map are added together.

Among them, the receptive fields of the atrous convolutions of the first group, the second group, the third group, and the fourth group are 3, 5, 9, and 17, respectively.

The present application implements feature fusion through feature pyramid residual blocks to increase the receptive field of neurons without increasing parameters. The four groups of hole convolutions are all depthwise convolutions. In the channel direction of the feature map, the original feature map is divided into single-channel feature maps, and then the single-channel convolution kernel is used to convolve the single-channel feature map, which can reduce the Parameters of the network model. The 4 groups of convolutions of the residual block of the feature pyramid are distributed horizontally, which improves the receptive field of neurons without increasing the depth and parameters of the network, so that the network can extract more face information.

Preferably, the step S3 includes:

Step S3.1: using the inter-embedded upsampling module to perform feature fusion on the features extracted in the step S2.2 and the features extracted in the step S2.3;

Step S3.2: Use the inter-embedded upsampling module to perform feature fusion on the features fused in the step S3.1 and the features extracted in the step S2.1.

Specifically, the present application adopts the inter-embedded upsampling module on the high-stage feature map, adopts the channel attention model to obtain the first attention coefficient of each channel, and multiplies the first attention coefficient and the low-stage features to obtain the the first fusion feature of the channel attention model fusion;

On the low-stage feature map, the spatial attention model is used to obtain the second attention coefficient of each point in the feature map, and the second attention coefficient is multiplied by the up-sampled high-stage feature map to obtain the The second fusion feature fused by the spatial attention model; the first fusion feature and the second fusion feature are added to obtain the final fusion feature.

Preferably, the step S4 includes:

Step S4.1: use a 3×3 convolution kernel to convolve the fused features in step S3.2;

Step S4.2: Use two 1×1 convolution kernels to predict face confidence and face width respectively.

Specifically, the image to be detected can be regarded as a two-dimensional coordinate system, and the upper left corner of the image can be regarded as the origin of the coordinate system, then the face in the image can be regarded as a two-dimensional Gaussian distribution. The center position of the face is the center point of the Gaussian distribution, its coordinate value corresponds to the mean of the two-dimensional Gaussian distribution, and the width and height of the face correspond to the variance of the two-dimensional Gaussian distribution.

Preferably, another embodiment of the present application discloses a network training process with a label and a loss function, specifically:

The face whose center point is (x, y) is expressed as:

f=N(x, y, σ1, σ2)

x and y are the mean values of the two-dimensional Gaussian distribution N, and σ1 and σ2 are the variances of the two-dimensional Gaussian distribution, corresponding to the width and height of the face, respectively. Therefore, the face distribution corresponding to an image containing n faces can be expressed as:

I(x, y)=max(N(x _i , y _i , σ1 _i , σ2 _i )), i=1, 2, . . . , n;

And the label for that image can be expressed as:

Ω is the label for predicting the center point of the face, and Ψ is the label for predicting the width and height of the face;

The loss function can be expressed as:

P and K are the output of the network, namely the confidence of the face (normalized Gaussian distribution amplitude) and the width and height of the face (variance of the Gaussian distribution), and λ is the loss scale coefficient.

The embodiment of the present application also provides a fast face density prediction and face detection device, including:

an image acquisition module for acquiring the image to be detected;

a feature extraction module, used for extracting multi-scale features in the image to be detected by using a feature pyramid residual block;

The feature fusion module is used for feature fusion using the inter-embedded upsampling module;

The detection result module is used to use the face detection module to predict the confidence level of the face and the width and height of the face.

Embodiments of the present application further provide an electronic device, including a memory, a processor, and machine-readable instructions stored in the memory and executable on the processor, wherein the processor executes the machine-readable instructions , execute the method described above.

Embodiments of the present application further provide a storage medium on which a computer program is stored, characterized in that, when the program is run by a processor, the method as described above is executed.

Through the above-mentioned technical scheme, the beneficial effects of the present invention are:

This application adopts the method of predicting Gaussian distribution to predict the density of faces in images and detect faces in images, so as to avoid unstable factors caused by the use of candidate frames; a feature pyramid residual block is used to use a small convolution kernel And do not increase the depth of the network to increase the receptive field of neurons; realize that the receptive field of neurons is improved without increasing the depth and parameters of the network, so that the network can extract more face information; the inter-embedded upsampling module is used for Feature fusion, when realizing the fusion of high and low-level features, makes full use of the texture information of high-level features and the detailed information of low-level features.

Description of drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

1 is a schematic flowchart of a method for fast face density prediction and face detection provided by an embodiment of the present application;

2 is a structural block diagram of a face density prediction and face detection model provided by an embodiment of the present application;

3 is a structural block diagram of a feature pyramid residual block provided by an embodiment of the present application;

4 is a schematic structural diagram of an apparatus for fast face density prediction and face detection provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terms used herein in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Please refer to the schematic flowchart of the fast face density prediction and face detection method provided by the embodiment of the present application in FIG. 1; a fast face density prediction and face detection method includes the following steps:

Step S1: acquiring an image to be detected;

The image to be detected refers to an image that needs to be detected whether it includes a human face, for example, a color image, a black-and-white image, or a binary image captured on a human face.

The method of obtaining the image to be detected in the above step S1 includes: using a terminal device such as a video camera, a video recorder or a color camera to photograph the target object to obtain the image to be detected; obtaining a pre-stored image to be detected, specifically for example: from a real-time video stream or from the video file in the file system to obtain the image to be detected, or obtain the image to be detected from the database, or obtain the image to be detected from the mobile storage device; use software such as a browser to obtain the image to be detected on the Internet, or Use other applications to access the Internet to obtain images to be inspected.

In the embodiment of the present application, referring to the structural block diagram of the face density prediction and face detection model provided by the embodiment of the present application in FIG. 2 , using the feature pyramid residual block to extract the multi-scale features in the image to be detected further includes the following steps:

Step S2.1: In the first stage, a 3×3 convolution kernel is used to convolve the image to be detected, and the convolved image is sent to the feature pyramid residual block to extract features;

Step S2.2: In the second stage, multiple feature pyramid residual blocks are combined into a feature extraction network, and the features of the feature map output in step S2.1 are extracted;

Step S2.3: In the third stage, multiple feature pyramid residual blocks are combined into a feature extraction network to extract the features of the feature map output in step S2.2.

Specifically, for the feature pyramid residual block, refer to FIG. 4 for the structural block diagram of the feature pyramid residual block provided by the embodiment of the present application;

A 1×1 convolution operation is used to expand the number of channels of the feature map; the feature maps are equally divided into 4 groups in the channel direction, and the first group uses a 3×3 convolution kernel with a hole size of 1 to convolve the features of the first group , the second group uses a 3×3 convolution kernel of hole size 2 to convolve the features of the second group, and the third group uses a 3×3 convolution kernel of hole size 4 to convolve the features of the third group , the fourth group uses a 3×3 convolution kernel with a hole size of 8 to convolve the features of the fourth group; the 4 groups of features convolved by the convolution kernel are combined in order to form the first feature map, using 1 The convolution of ×1 performs feature fusion on the first feature map to form a second feature map; the feature map and the second feature map are added together.

In the feature extraction network, to obtain a larger receptive field for neurons, either use a larger convolution kernel or deepen the depth of the network. Both methods increase the amount of parameters of the feature extraction network. This application adopts a new feature pyramid residual block, which uses small convolution kernels and does not increase the depth of the network to increase the receptive field of neurons. At the same time, the horizontal expansion of the neural network enables the network to extract more face information.

Specifically, the embodiment of the present application adopts the inter-embedded upsampling module on the high-stage feature map, adopts the channel attention model to obtain the first attention coefficient of each channel, and multiplies the first attention coefficient and the low-stage features, obtaining the first fusion feature fused by the channel attention model;

On the low-stage feature map, the spatial attention model is used to obtain the second attention coefficient of each point in the feature map, and the second attention coefficient is multiplied by the up-sampled high-stage feature map to obtain the The second fusion feature of spatial attention model fusion;

The first fusion feature and the second fusion feature are added to obtain the final fusion feature.

The channel attention model and the spatial attention model are common technologies in the field, and mainly focus on the mechanism of local information, such as a certain image area in the image. With the change of tasks, attention areas tend to change, which will not be repeated in this application.

In this application, the inter-embedded upsampling module is used for feature fusion, and the texture information of the high-level features and the detailed information of the low-level features are fully utilized when the high-level and low-level feature fusion is realized.

Step S4: Using the face detection model network to predict the confidence level of the face and the width and height of the face. Specifically, it also includes the following steps:

Step S4.2: Use two 1×1 convolution kernels to predict face confidence and face width and height respectively.

Using the bounding box to mark the face area in the face image, and to mark the classification and key points corresponding to the face area, the key points represent the key feature points in the face area; optionally, you can At the end of this method, there is another output, and the key points of the face are detected by the method of predicting the position of the center point of the face.

The image to be detected can be regarded as a two-dimensional coordinate system, and the upper left corner of the image can be regarded as the origin of the coordinate system, then the face in the image can be regarded as a two-dimensional Gaussian distribution. The center position of the face is the center point of the Gaussian distribution, its coordinate value corresponds to the mean of the two-dimensional Gaussian distribution, and the width and height of the face correspond to the variance of the two-dimensional Gaussian distribution.

Another embodiment of the present application also provides a label and a loss function to perform a network training process, specifically:

The face whose center point is (x, y) is expressed as:

f=N(x, y, σ1, σ2)

I(x, y)=max(N(x _i , y _i , σ1 _i , σ2 _i )), i=1, 2, . . . , n;

And the label for that image can be expressed as:

The loss function can be expressed as:

Therefore, this method adopts the method of predicting the Gaussian distribution to predict the face density in the image and detect the face in the image, so as to avoid the unstable factors caused by the use of candidate frames.

Please refer to the schematic structural diagram of the apparatus for fast face density prediction and face detection provided by the embodiment of the present application shown in FIG. 4; the embodiment of the present application provides a face density prediction and face detection apparatus 300, including:

an image acquisition module 310, configured to acquire an image to be detected;

A feature extraction module 320, configured to extract multi-scale features in the image to be detected by using a feature pyramid residual block;

The feature fusion module 330 is used to perform feature fusion by adopting the mutual embedded upsampling module;

The detection result module 340 is configured to use the face detection module to predict the confidence level of the face and the width and height of the face to obtain the face detection result.

It should be understood that the device corresponds to the above-mentioned embodiments of the fast face density prediction and face detection methods, and can perform various steps involved in the above-mentioned method embodiments. For the specific functions of the device, refer to the description above. To avoid repetition , and the detailed description is appropriately omitted here. The device includes at least one software function module that can be stored in a memory in the form of software or firmware or fixed in an operating system (OS) of the device.

Please refer to the schematic structural diagram of the electronic device provided by the embodiment of the present application shown in FIG. 5 . An electronic device 400 provided in an embodiment of the present application includes: a processor 410 and a memory 420, where the memory 420 stores machine-readable instructions executable by the processor 410, and the above method is executed when the machine-readable instructions are executed by the processor 410 .

The embodiment of the present application also provides a storage medium 430, where a computer program is stored on the storage medium 430, and the computer program is executed by the processor 410 to execute the above method.

Wherein, the storage medium 430 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (Static Random Access Memory, SRAM for short), electrically erasable programmable read-only Memory (Electrically Erasable Programmable Read-Only Memory, referred to as EEPROM), Erasable Programmable Read Only Memory (Erasable Programmable Read Only Memory, referred to as EPROM), Programmable Read-Only Memory (Programmable Red-Only Memory, referred to as PROM), only Read-Only Memory (ROM for short), magnetic memory, flash memory, magnetic disk or optical disk.

In the description of the present invention, the terms "first" and "second" are only used for the purpose of description, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as "first", "second" may expressly or implicitly include one or more of that feature. "Plurality" means two or more, unless expressly specifically limited otherwise.

In the present invention, unless otherwise expressly specified and limited, the terms "installed", "connected", "connected", "fixed" and other terms should be understood in a broad sense, for example, it may be a fixed connection or a detachable connection , or integrated; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium, and it can be the internal connection of the two elements or the interaction relationship between the two elements. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood according to specific situations.

In the present invention, unless otherwise expressly specified and limited, a first feature "on" or "under" a second feature may be in direct contact between the first and second features, or the first and second features indirectly through an intermediary touch. Also, the first feature being "above", "over" and "above" the second feature may mean that the first feature is directly above or obliquely above the second feature, or simply means that the first feature is level higher than the second feature. The first feature being "below", "below" and "below" the second feature may mean that the first feature is directly below or obliquely below the second feature, or simply means that the first feature has a lower level than the second feature.

In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.

Any description of a process or method in the flowcharts or otherwise described herein may be understood to represent a module, segment or portion of code comprising one or more executable instructions for implementing a specified logical function or step of the process , and the scope of the preferred embodiments of the invention includes alternative implementations in which the functions may be performed out of the order shown or discussed, including performing the functions substantially concurrently or in the reverse order depending upon the functions involved, which should It is understood by those skilled in the art to which the embodiments of the present invention belong.

The logic and/or steps represented in flowcharts or otherwise described herein, for example, may be considered an ordered listing of executable instructions for implementing the logical functions, may be embodied in any computer-readable medium, For use with, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can fetch instructions from and execute instructions from an instruction execution system, apparatus, or apparatus) or equipment. For the purposes of this specification, a "computer-readable medium" can be any device that can contain, store, communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or apparatus. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections with one or more wiring (electronic devices), portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.

It should be understood that various parts of the present invention may be implemented in hardware, software, firmware or a combination thereof. In the above-described embodiments, various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or a combination of the following techniques known in the art: Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.

Those skilled in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing the relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the program can be stored in a computer-readable storage medium. When executed, one or a combination of the steps of the method embodiment is included.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like. Although the embodiments of the present invention have been shown and described above, it should be understood that the above-mentioned embodiments are exemplary and should not be construed as limiting the present invention. Embodiments are subject to variations, modifications, substitutions and variations.

Claims

A method for fast face density prediction and face detection, characterized in that it comprises the following steps:

Step S1: acquiring an image to be detected;

Step S2: using feature pyramid residual blocks to extract multi-scale features in the image to be detected;

Step S3: adopting the mutual embedding upsampling module to perform feature fusion;

Step S4: using the face detection module to predict the confidence level of the face and the width and height of the face.
A kind of fast face density prediction and face detection method according to claim 1, is characterized in that, described step S2 comprises:

Step S2.1: use a 3×3 convolution kernel to convolve the image to be detected, and send the convolved image into the feature pyramid residual block to extract features;

Step S2.2: combine a plurality of the feature pyramid residual blocks into a feature extraction network, and extract the features of the feature map output by the step S2.1;

Step S2.3: Combine a plurality of the feature pyramid residual blocks into a feature extraction network, and extract the features of the feature map output in the step S2.2.
A kind of fast face density prediction and face detection method according to claim 2, is characterized in that, described step S3 comprises:

Step S3.1: using the inter-embedded upsampling module to perform feature fusion on the features extracted in the step S2.2 and the features extracted in the step S2.3;

Step S3.2: Use the inter-embedded upsampling module to perform feature fusion on the features fused in the step S3.1 and the features extracted in the step S2.1.
A kind of fast face density prediction and face detection method according to claim 3, is characterized in that, described step S4 comprises:

Step S4.1: use a 3×3 convolution kernel to convolve the fused features in step S3.2;

Step S4.2: Use two 1×1 convolution kernels to predict face confidence and face width and height respectively.
A kind of fast face density prediction and face detection method according to claim 1, is characterized in that, described feature pyramid residual block comprises:

Use a 1×1 convolution operation to expand the number of channels of the feature map;

Divide the feature map into 4 groups equally in the channel direction, the first group uses a 3×3 convolution kernel with a hole size of 1 to convolve the features of the first group, and the second group uses 3×3 with a hole size of 2 The convolution kernel convolves the features of the second group, the third group uses a 3×3 convolution kernel with a hole size of 4 to convolve the features of the third group, and the fourth group uses a 3×3 hole size of 8. The convolution kernel convolves the features of the fourth group;

Combining the four sets of features convolved by the convolution kernel in order to form a first feature map, and using 1×1 convolution to perform feature fusion on the first feature map to form a second feature map;

The feature map and the second feature map are added together.
A kind of fast face density prediction and face detection method according to claim 5, is characterized in that, also comprises:

before the hole convolution in the second group, the second group of features is added to the features output by the first group of convolutions;

before the hole convolution in the third group, the third group of features is added to the features output by the second group of convolutions;

The fourth set of features is added to the features output by the third set of convolutions before the atrous convolution.
A kind of fast face density prediction and face detection method according to claim 6, is characterized in that, also comprises:

The receptive fields of the atrous convolutions of the first group, the second group, the third group, and the fourth group are 3, 5, 9, and 17, respectively.
The method for fast face density prediction and face detection according to claim 1, wherein the mutual embedded upsampling module comprises:

On the high-stage feature map, the channel attention model is used to obtain the first attention coefficient of each channel, and the first attention coefficient and the low-stage feature are multiplied to obtain the first attention coefficient fused by the channel attention model. fusion features;

On the low-stage feature map, the spatial attention model is used to obtain the second attention coefficient of each point in the feature map, and the second attention coefficient is multiplied by the up-sampled high-stage feature map to obtain The second fusion feature of the spatial attention model fusion;

The first fusion feature and the second fusion feature are added to obtain the final fusion feature.
A kind of fast face density prediction and face detection method according to any one of claims 1-8, is characterized in that, also comprises using following label and loss function to carry out network training:

The face whose center point is (x, y) is expressed as:

f=N(x, y, σ1, σ2)

x and y are the mean values of the two-dimensional Gaussian distribution N, and σ1 and σ2 are the variances of the two-dimensional Gaussian distribution, corresponding to the width and height of the face, respectively. Therefore, the face distribution corresponding to an image containing n faces can be expressed as:

I(x, y)=max(N(x i , y i , σ1 i , σ2 i )), i=1, 2, . . . , n;

And the label for that image can be expressed as:

Ω is the label for predicting the center point of the face, and Ψ is the label for predicting the width and height of the face;

The loss function can be expressed as:

P and K are the output of the network, namely the confidence of the face and the width and height of the face, and λ is the loss scale coefficient.
A device for fast face density prediction and face detection, characterized in that it includes:

an image acquisition module, used to acquire an image to be detected;

a feature extraction module, used for extracting multi-scale features in the image to be detected by using a feature pyramid residual block;

The feature fusion module is used for feature fusion using the mutual embedded upsampling module;

The detection result module is used to use the face detection module to predict the confidence level of the face and the width and height of the face.
An electronic device, comprising a memory, a processor, and machine-readable instructions stored on the memory and executable on the processor, characterized in that, when the processor executes the machine-readable instructions, the method according to claim 1 is implemented. The fast face density prediction and face detection method described in any one of -9.
A storage medium having a computer program stored thereon, characterized in that, when the computer program is executed by a processor, the method for fast face density prediction and face detection according to any one of claims 1-9 is implemented.