CN109598212B

CN109598212B - Face detection method and device

Info

Publication number: CN109598212B
Application number: CN201811382039.9A
Authority: CN
Inventors: 唐华阳
Original assignee: Beijing Knownsec Information Technology Co Ltd
Current assignee: Beijing Knownsec Information Technology Co Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2020-11-24
Anticipated expiration: 2038-11-20
Also published as: CN109598212A

Abstract

The application provides a face detection method and a face detection device, which are used for solving the problem that in the prior art, the detection effect is not ideal under the conditions of different light and shade, shielding and blurring in an image and even very small face. The face detection method comprises the following steps: inputting an image containing a human face into a first preset network model for compression to obtain compression characteristics; inputting the compressed features into a second preset network model to extract fine features, and obtaining a plurality of fine features; inputting a plurality of fine features into a feature fusion network model for fusion to obtain a plurality of fusion features; inputting the fusion characteristics into a regional candidate network model to extract candidate frames, and obtaining a plurality of candidate frames and the probability of the candidate frames; and if the probabilities of the candidate frames exceed a preset threshold, taking the candidate frame corresponding to the probability of the candidate frame as a face detection result.

Description

Face detection method and device

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a face detection method and apparatus.

Background

The human face detection means that for any given image, a certain strategy is adopted to search the image to determine whether the image contains a human face, and if so, the position, the size and the posture of a face are returned.

At present, in a specific implementation process, based on a face detection technology of deep learning, the performance of many deep learning models on a face detection data set is good, but in an actual use process, an image background is changed greatly, and a face can present different light and shade, shielding, blurring and even a small face due to different scenes. For these special scenes, the current face detection technology is not very good in performance, and the detection effect is not ideal. Therefore, the prior art has the problem that the detection effect is not ideal under the condition of different light and shade, occlusion and blurring in the image and even small human face.

Disclosure of Invention

In view of this, the present application provides a face detection method and apparatus, which are used to solve the problem in the prior art that the detection effect is not ideal under the condition of different light and shade, occlusion, and blur in an image, even a small face.

The embodiment of the application provides a face detection method, which comprises the following steps: inputting an image containing a human face into a first preset network model for compression to obtain compression characteristics; inputting the compressed features into a second preset network model to extract fine features, and obtaining a plurality of fine features; inputting the plurality of fine features into a feature fusion network model for fusion to obtain a plurality of fusion features; inputting the fusion features into a regional candidate network model extraction candidate frame to obtain a plurality of candidate frames and the probability of the candidate frames; and if the probabilities of the candidate frames exceed a preset threshold, taking the candidate frame corresponding to the probability of the candidate frame as a face detection result.

Optionally, in this embodiment of the application, the inputting an image including a human face into a first preset network model for compression to obtain a compression feature includes: and inputting the image containing the human face into at least one convolution layer and at least one pooling layer for preset calculation to obtain compression characteristics.

Optionally, in this embodiment of the application, the inputting the image including the human face into at least one convolution layer and at least one pooling layer for performing a preset calculation to obtain a compression feature includes: inputting the image containing the human face into a first convolution layer to perform first convolution calculation to obtain a feature obtained by the first convolution calculation; inputting the features obtained by the first convolution calculation into a first pooling layer to perform first pooling calculation, and obtaining features obtained by the first pooling calculation; inputting the features obtained by the first pooling calculation into a second convolution layer for second convolution calculation to obtain features obtained by the second convolution calculation, wherein the convolution kernel size of the first convolution layer is different from that of the second convolution layer; and inputting the features obtained by the second convolution calculation into a second pooling layer to perform second pooling calculation, obtaining features obtained by the second pooling calculation, and taking the features obtained by the second pooling calculation as compression features.

Optionally, in this embodiment of the application, the inputting the compressed features into a second preset network model to extract fine features, and obtaining a plurality of fine features includes: inputting the compressed features into at least one sensor to carry out preset operation to obtain first precise features; inputting the first fine feature into a first convolution block for operation to obtain a second fine feature, wherein the first convolution block comprises at least one convolution layer; and inputting the second fine feature into a second convolution block for operation to obtain a third fine feature, wherein the second convolution block comprises at least one convolution layer.

Optionally, in this embodiment of the present application, the inputting the compressed features into at least one sensor for performing a preset operation to obtain first fine features includes: inputting the compressed features into a first sensor for operation to obtain first sensing features; inputting the first perception characteristic into a second perceptron for operation to obtain a second perception characteristic; and inputting the second perception characteristic into a third perceptron for operation to obtain a third perception characteristic, and taking the third perception characteristic as the first precise characteristic.

Optionally, in this embodiment of the present application, the inputting the first fine feature into the first convolution block for operation to obtain a second fine feature includes: inputting the first fine feature into a third convolution layer for convolution operation to obtain a feature obtained by third convolution calculation; inputting the features obtained by the third convolution calculation into a fourth convolution layer for convolution operation to obtain features obtained by the fourth convolution calculation, and taking the features obtained by the fourth convolution calculation as second fine features, wherein the convolution kernel size of the third convolution layer is different from the convolution kernel size of the fourth convolution layer.

Optionally, in this embodiment of the present application, the inputting the second fine feature into a second convolution block for operation to obtain a sum third fine feature includes: inputting the second fine feature into a fifth convolution layer for convolution operation to obtain a feature obtained by fifth convolution calculation; inputting the features obtained by the fifth convolution calculation into a sixth convolution layer for convolution operation to obtain features obtained by the sixth convolution calculation, and taking the features obtained by the sixth convolution calculation as third fine features, wherein the step size of the fifth convolution layer is different from that of the sixth convolution layer.

Optionally, in this embodiment of the application, before the inputting the image including the face into the first preset network model to extract the compression feature and obtain the compression feature, the method further includes: sequentially connecting a first preset network, a second preset network, a feature fusion network and a regional candidate network to obtain a face detection network; training the face detection network by using training samples in a training data set to obtain a face detection network model, wherein the face detection network model comprises: the first preset network model, the second preset network model, the feature fusion network model and the regional candidate network model.

The embodiment of the present application further provides a face detection device, the face detection device includes: the compression feature obtaining module is used for inputting an image containing a human face into a first preset network model for compression to obtain compression features; the fine feature obtaining module is used for inputting the compressed features into a second preset network model to extract fine features, and obtaining a plurality of fine features; the fusion feature obtaining module is used for inputting the plurality of fine features into a feature fusion network model for fusion to obtain a plurality of fusion features; a frame and probability obtaining module, configured to input the fusion features into a regional candidate network model to extract candidate frames, and obtain multiple candidate frames and probabilities of the candidate frames; and the face detection result module is used for taking the candidate frame corresponding to the probability of the candidate frame as the face detection result.

Optionally, in an embodiment of the present application, the method further includes: the detection network obtaining module is used for sequentially connecting the first preset network, the second preset network, the feature fusion network and the regional candidate network to obtain a face detection network; a network model obtaining module, configured to train the face detection network by using training samples in a training data set to obtain a face detection network model, where the face detection network model includes: the first preset network model, the second preset network model, the feature fusion network model and the regional candidate network model.

The application provides a face detection method and a face detection device, wherein images containing faces are input into a first preset network model to be compressed to obtain compressed features, the compressed features are input into a second preset network model to extract fine features to obtain a plurality of fine features, and finally the fine features are subjected to feature fusion and input into a regional candidate network model to extract candidate frames to obtain the probability of the candidate frames; if the probability exceeds a preset threshold value, the candidate frame corresponding to the probability is used as a face detection result, and the situation of poor detection effect when the face image is not clear is avoided through feature compression, fine feature extraction and feature fusion. By the method, the problem that in the prior art, the detection effect is not ideal under the conditions of different light and shade, occlusion and blurring in the image and even small face is effectively solved.

In order to make the aforementioned and other objects and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

For a clearer explanation of the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 2 is a schematic flow chart illustrating a face detection method provided in an embodiment of the present application;

fig. 3 is a schematic flowchart illustrating a step S100 of a face detection method according to an embodiment of the present application;

fig. 4 is a schematic flowchart illustrating a step S200 of a face detection method according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a sensor provided in an embodiment of the present application;

fig. 6 is a schematic flowchart illustrating step S210 of a face detection method according to an embodiment of the present application;

fig. 7 is a schematic flowchart illustrating step S220 of a face detection method according to an embodiment of the present application;

fig. 8 is a schematic flowchart illustrating step S230 of a face detection method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram illustrating a feature fusion network provided in an embodiment of the present application;

fig. 10 is a schematic flowchart illustrating a process before step S100 of a face detection method provided in an embodiment of the present application;

fig. 11 shows a schematic structural diagram of the whole face detection apparatus provided in the embodiment of the present application.

Icon: 100-a face detection device; 101-a processor; 102-a memory; 103-storage medium; 109-an electronic device; 110-a compression feature obtaining module; 120-fine feature acquisition module; 130-a fused feature obtaining module; 140-box and probability obtaining module; 150-a face detection result module; 160-detecting a network acquisition module; 170-network model obtaining module.

Detailed Description

The embodiment of the application provides a face detection method and a face detection device, which are used for solving the problem that in the prior art, the detection effect is not ideal under the conditions of different light and shade, shielding and blurring in an image and even small face. The method and the device applied to the electronic equipment are based on the same creative concept, and because the principles of solving the problems of the method, the corresponding device and the equipment are similar, the implementation of the method, the corresponding device and the equipment can be mutually referred, and repeated parts are not repeated.

Some terms in the embodiments of the present application will be explained below to facilitate understanding by those skilled in the art.

Object detection, also called object extraction, is an image segmentation based on object geometry and statistical features, which combines object segmentation and recognition into one, and the accuracy and real-time performance of the image segmentation are important capabilities of the whole system. Especially, in a complex scene, when a plurality of targets need to be processed in real time, automatic target extraction and identification are particularly important.

Convolutional Neural Networks (CNNs) are a class of feed forward Neural Networks (fed Neural Networks) that contain convolution or correlation calculations and have deep structures, and are one of the representative algorithms for deep learning (deep learning).

And (3) missing data processing, wherein in data mining, a large amount of incomplete, inconsistent, abnormal and deviated data exists in the original massive data. The problem data influences the data mining execution efficiency slightly, and influences the execution result seriously. Therefore, data preprocessing is indispensable, and the common task is missing value processing of the data set. Data missing value processing can be divided into two categories. One is to delete missing data, and the other is to perform data interpolation, also called missing value interpolation. The former method is simple and rough, but the biggest limitation of the method is that the historical data is reduced to be replaced by complete data, so that a great deal of waste of resources is caused, and especially under the condition that the data set is few, the objectivity and the accuracy of an analysis result can be directly influenced by deleting records. The more common data interpolation methods include: a sliding average window method, a lagrange interpolation method, and the like.

The normalization processing has two forms, one is to change a number to a decimal between (0, 1), and the other is to change a dimensional expression to a dimensionless expression. The method is mainly used for conveniently extracting data processing, the data are mapped into a range of 0-1 for processing, and the method is more convenient and faster and should fall into the digital signal processing range. For example, [2.5, 3.5, 0.5, 1.5] normalization process: 2.5+3.5+0.5+ 1.5-8, 2.5/8-0.3125, 3.5/8-0.4375, 0.5/8-0.0625, 1.5/8-0.1875, and finally to [0.3125, 0.4375, 0.0625, 0.1875 ]. In general, the data to be processed is limited (by some algorithm) to a certain range that you need. Firstly, normalization is for the convenience of data processing later, and secondly, convergence is accelerated when the program runs. The specific role of normalization is to generalize the statistical distribution of uniform samples. The normalization is a statistical probability distribution between 0 and 1 and the normalization is a statistical coordinate distribution over a certain interval.

Regularization, which means that in linear algebra theory, an ill-posed problem is usually defined by a set of linear algebra equations, and the set of equations usually results from an ill-posed problem with a large condition number. The large condition number means that rounding errors or other errors can severely impact the outcome of the problem.

In addition, it should be understood that, in the description of the embodiments of the present application, "first" -, second "-, and third,

The terms "second," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance, nor order.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. An electronic device 109 provided in an embodiment of the present application includes: a processor 101 and a memory 102, the memory 102 storing machine readable instructions executable by the processor 101, the machine readable instructions when executed by the processor 101 performing the following face detection method.

In a specific implementation process, a correlation calculation of a Convolutional Neural Network (CNN) may be accelerated by using a Graphics Processing Unit (GPU), and therefore, the electronic device may further include a Graphics processor. In addition, when the distributed computing framework is used, a communication interface is required to be used, and the electronic device may further include components such as a communication and network expansion card, an optical fiber card or a multi-serial port communication card, which are not described herein again.

Referring to fig. 1, an embodiment of the present application provides a storage medium 103, where the storage medium 103 stores a computer program, and the computer program is executed by a processor 101 to perform the following face detection method.

Those skilled in the art will appreciate that the configuration of the electronic device shown in fig. 1 does not constitute a limitation of the device, and that embodiments of the present application provide devices that include more or fewer components than those shown, or a different arrangement of components.

First embodiment

Referring to fig. 2, fig. 2 is a schematic flow chart illustrating a face detection method according to an embodiment of the present application. The embodiment of the application provides a face detection method, which comprises the following steps:

step S100: and inputting the image containing the human face into a first preset network model for compression to obtain compression characteristics.

Optionally, in this embodiment of the application, the image including the face is input to the first preset network model to be compressed, so as to obtain a compression feature, and another implementation manner of this step may also be:

step S100: and inputting the image containing the human face into at least one convolution layer and at least one pooling layer for preset calculation to obtain compression characteristics.

The image data may be training data having a size (1024,1024,3) (each indicating a length, a width, and the number of channels, and the same applies hereinafter), or may be images having other lengths, widths, and the number of channels. Therefore, the image length, the image width and the number of channels of the image data provided in the embodiment of the present application should not be construed as limiting the embodiment of the present application.

Referring to fig. 3, fig. 3 is a schematic flow chart illustrating step S100 of the face detection method according to the embodiment of the present application. Optionally, in this embodiment of the present application, inputting an image including a human face into at least one convolution layer and at least one pooling layer for performing a preset calculation, to obtain a compression feature, includes:

step S110: and inputting the image containing the human face into the first convolution layer to perform first convolution calculation, and obtaining the features obtained by the first convolution calculation.

It should be noted that, before this step, the method further includes: training a first preset network, a second preset network, a feature fusion network and a regional candidate network by using training samples in a training data set to obtain: the system comprises a first preset network model, a second preset network model, a feature fusion network model and a regional candidate network model.

Step S120: and inputting the features obtained by the first convolution calculation into the first pooling layer for first pooling calculation to obtain the features obtained by the first pooling calculation.

Wherein, the first preset network model comprises: a first convolution layer, a second convolution layer, a first pooling layer and a second pooling layer; the first convolution layer, the first pooling layer, the second convolution layer and the second pooling layer are connected in sequence; the first convolution layer is used for receiving input of an image containing a human face.

Step S130: and inputting the features obtained by the first pooling calculation into a second convolution layer for second convolution calculation to obtain the features obtained by the second convolution calculation, wherein the convolution kernel size of the first convolution layer is different from that of the second convolution layer.

The specific sizes of the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are shown in the following table.

Step S140: and inputting the features obtained by the second convolution calculation into a second pooling layer for second pooling calculation to obtain features obtained by the second pooling calculation, and taking the features obtained by the second pooling calculation as compression features.

It should be noted that the first preset network model includes: a first convolution layer, a second convolution layer, a first pooling layer and a second pooling layer; the convolution kernel size and step size of each layer are as follows:

layer name	Convolution kernel size (size)	Step size (Stride)
			The first winding layer	7732	4
First pooling layer	2*2	2
			The second convolution layer	5564	2
Second pooling layer	2*2	2

Step S200: and inputting the compressed features into a second preset network model to extract fine features, and obtaining a plurality of fine features.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a step S200 of a face detection method according to an embodiment of the present application. Optionally, in this embodiment of the present application, inputting the compressed features into a second preset network model to extract fine features, and obtaining a plurality of fine features includes:

step S210: and inputting the compression characteristics into at least one sensor to carry out preset operation to obtain first precise characteristics.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a sensor provided in the embodiment of the present application. In which the upper chinese character in the box indicates the name of the layer, the lower number is "3 × 32", where "3 × 3" indicates the size of the convolution kernel, and 32 indicates the output, and if it is written as "3 × 32, 2" by default, at this time, 2 indicates the step size in addition to 32 indicating the output, unless otherwise specified in the embodiments of the present application. The structure diagram of the sensor in which the compression characteristics are input into at least one sensor for performing the preset operation may be, for example, as shown in the figure, and of course, in a specific implementation process, the number of convolution layers may be greater than or less than the number of structures in the figure. Therefore, the structure of the sensor and the number of the convolution layers provided in the embodiments of the present application should not be construed as limiting the embodiments of the present application.

Referring to fig. 6, fig. 6 is a schematic flowchart illustrating step S210 of the face detection method according to the embodiment of the present application. Optionally, in this embodiment of the present application, inputting the compressed features into at least one sensor to perform a preset operation, to obtain first fine features, where the method includes:

step S211: and inputting the compressed features into the first sensor for operation to obtain first sensing features.

Step S212: and inputting the first perception characteristic into a second perceptron for operation to obtain a second perception characteristic.

Step S213: and inputting the second perception characteristic into a third perceptron for operation to obtain a third perception characteristic, and taking the third perception characteristic as a first precise characteristic.

Step S220: and inputting the first fine feature into a first convolution block for operation to obtain a second fine feature, wherein the first convolution block comprises at least one convolution layer.

In a specific implementation process, the structures of the first sensor, the second sensor and the third sensor may be the same or different. In an actual implementation process, the adjustment may be performed according to own requirements, and therefore, whether the first sensor, the second sensor, and the third sensor provided in the embodiment of the present application are the same should not be construed as a limitation to the embodiment of the present application.

Referring to fig. 7, fig. 7 is a flowchart illustrating a step S220 of a face detection method according to an embodiment of the present application. Optionally, in this embodiment of the present application, inputting the first fine feature into the first convolution block for operation to obtain a second fine feature, where the method includes:

step S221: and inputting the first fine feature into a third convolution layer for convolution operation to obtain a feature obtained by the third convolution calculation.

Step S222: and inputting the features obtained by the third convolution calculation into a fourth convolution layer for convolution operation to obtain features obtained by the fourth convolution calculation, wherein the features obtained by the fourth convolution calculation are used as second fine features, and the convolution kernel size of the third convolution layer is different from that of the fourth convolution layer.

Step S230: and inputting the second fine feature into a second convolution block for operation to obtain a sum third fine feature, wherein the second convolution block comprises at least one convolution layer.

Referring to fig. 8, fig. 8 is a schematic flowchart illustrating a step S230 of a face detection method according to an embodiment of the present application. Optionally, in this embodiment of the present application, inputting the second fine feature into the second convolution block for operation, and obtaining and third fine features include:

step S231: and inputting the second fine feature into a fifth convolution layer for convolution operation to obtain a feature obtained by the fifth convolution calculation.

Step S232: and inputting the features obtained by the fifth convolution calculation into a sixth convolution layer for convolution operation to obtain the features obtained by the sixth convolution calculation, wherein the features obtained by the sixth convolution calculation are used as third fine features, and the step length of the fifth convolution layer is different from that of the sixth convolution layer.

Step S300: and inputting the plurality of fine features into the feature fusion network model for fusion to obtain a plurality of fusion features.

Referring to fig. 9, fig. 9 is a schematic structural diagram illustrating a feature fusion network according to an embodiment of the present application. Wherein, the characteristic fusion network model comprises: a seventh, eighth, ninth, tenth, eleventh, and twelfth convolutional layers; the input end of the seventh convolutional layer is connected with the third sensor, the output end of the seventh convolutional layer is connected with the eighth convolutional layer, the input end of the ninth convolutional layer is connected with the fourth convolutional layer, the output end of the ninth convolutional layer is connected with the tenth convolutional layer, the input end of the eleventh convolutional layer is connected with the sixth convolutional layer, the output end of the eleventh convolutional layer is connected with the twelfth convolutional layer, and the output end of the eighth convolutional layer, the output end of the tenth convolutional layer and the output end of the twelfth convolutional layer are connected with the area candidate network model;

a seventh convolutional layer for receiving the first fine feature, a ninth convolutional layer for receiving the second fine feature,

the eleventh convolutional layer is for receiving the third fine feature, and the eighth convolutional layer, the tenth convolutional layer and the twelfth convolutional layer are for outputting a plurality of fused features.

Referring to fig. 9, it should be noted that the feature fusion network model further includes a first upsampling layer; the input end of the first up-sampling layer is connected with the eleventh convolution layer, and the output end of the first up-sampling layer is connected with the tenth convolution layer; the tenth convolutional layer is used for fusing the second fine feature output by the ninth convolutional layer with the third fine feature output by the first upsampling layer. The feature fusion network model further comprises a second upsampling layer; the input of the second upsampling layer is connected to a ninth convolutional layer,

and the output end of the second upsampling layer is connected with an eighth convolution layer, and the eighth convolution layer is used for fusing the first fine feature output by the seventh convolution layer and the second fine feature output by the second upsampling layer.

Step S400: and inputting the fusion characteristics into the regional candidate network model to extract candidate frames, and obtaining the candidate frames and the probability of the candidate frames.

It should be noted that the RPN model may be constructed by directly using the RPN concept in fast RCNN (Regions with CNN), or may be constructed by other methods. Therefore, the specific manner of constructing and obtaining the area candidate network provided by the embodiment of the present application should not be construed as limiting the embodiment of the present application.

Step S500: and if the probabilities of the candidate frames exceed a preset threshold, taking the candidate frame corresponding to the probability of the candidate frame as a face detection result.

Referring to fig. 10, fig. 10 is a schematic flow chart illustrating a process before step S100 of the face detection method according to the embodiment of the present application. Optionally, in this embodiment of the present application, before inputting an image including a human face into a first preset network model to extract a compression feature and obtain the compression feature, the method further includes:

step S80: and sequentially connecting the first preset network, the second preset network, the feature fusion network and the regional candidate network to obtain the face detection network.

Step S90: training a face detection network by using training samples in a training data set to obtain a face detection network model, wherein the face detection network model comprises: the system comprises a first preset network model, a second preset network model, a feature fusion network model and a regional candidate network model.

Second embodiment

Referring to fig. 11, fig. 11 is a schematic structural diagram of a face detection device according to an embodiment of the present application. In the face detection apparatus 100 provided in the embodiment of the present application, the face detection apparatus 100 includes:

the compression feature obtaining module 110 is configured to input an image including a human face to a first preset network model for compression, so as to obtain a compression feature.

And a fine feature obtaining module 120, configured to input the compressed features into a second preset network model to extract fine features, so as to obtain a plurality of fine features.

A fusion feature obtaining module 130, configured to input the multiple fine features into the feature fusion network model for fusion, so as to obtain multiple fusion features.

The frame and probability obtaining module 140 is configured to input the fusion features into the candidate network model extraction candidate frame, and obtain a plurality of candidate frames and probabilities of the candidate frames.

And a face detection result module 150, configured to use the candidate frame corresponding to the probability of the candidate frame as a face detection result.

Optionally, in an embodiment of the present application, the method further includes:

the detection network obtaining module 160 is configured to sequentially connect the first preset network, the second preset network, the feature fusion network, and the area candidate network to obtain a face detection network.

A network model obtaining module 170, configured to train a face detection network by using training samples in a training data set, so as to obtain a face detection network model, where the face detection network model includes: the system comprises a first preset network model, a second preset network model, a feature fusion network model and a regional candidate network model.

The embodiment of the application provides a face detection method and a face detection device, wherein an image containing a face is input into a first preset network model to be compressed to obtain a compressed characteristic, the compressed characteristic is input into a second preset network model to extract a plurality of fine characteristics, and finally the fine characteristics are subjected to characteristic fusion and input into a regional candidate network model to extract candidate frames to obtain the probability of the candidate frames; if the probability exceeds a preset threshold value, the candidate frame corresponding to the probability is used as a face detection result, and the situation of poor detection effect when the face image is not clear is avoided through feature compression, fine feature extraction and feature fusion. By the method, the problem that in the prior art, the detection effect is not ideal under the conditions of different light and shade, occlusion and blurring in the image and even small face is effectively solved.

The above embodiments are merely preferred examples of the present application, and are not intended to limit the present application, and those skilled in the art may make various modifications and changes. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present application shall be included in the protection scope of the embodiments of the present application.

Claims

1. A face detection method, comprising:

inputting an image containing a human face into a first preset network model for compression to obtain compression characteristics;

inputting the compressed features into a second preset network model to extract fine features, and obtaining a plurality of fine features;

inputting the plurality of fine features into a feature fusion network model for fusion to obtain a plurality of fusion features;

inputting the fusion features into a regional candidate network model extraction candidate frame to obtain a plurality of candidate frames and the probability of the candidate frames;

if the probability of the candidate frames exceeds a preset threshold value, taking the candidate frame corresponding to the probability of the candidate frame as a face detection result;

wherein the feature fusion network model comprises: a seventh, eighth, ninth, tenth, eleventh, and twelfth convolutional layers; the output end of the seventh convolutional layer is connected with the eighth convolutional layer, the output end of the ninth convolutional layer is connected with the tenth convolutional layer, the output end of the eleventh convolutional layer is connected with the twelfth convolutional layer, and the output end of the eighth convolutional layer, the output end of the tenth convolutional layer and the output end of the twelfth convolutional layer are connected with the area candidate network model; the seventh convolutional layer for receiving the first fine feature, the ninth convolutional layer for receiving the second fine feature, the eleventh convolutional layer for receiving the third fine feature, the eighth convolutional layer and the tenth convolutional layer for outputting the plurality of fused features;

the feature fusion network model further comprises a first upsampling layer; the input end of the first upsampling layer is connected with the eleventh convolution layer, and the output end of the first upsampling layer is connected with the tenth convolution layer; the tenth convolutional layer is used for fusing the second fine feature output by the ninth convolutional layer with the third fine feature output by the first upsampling layer; the feature fusion network model further comprises a second upsampling layer; the input end of the second upsampling layer is connected with the ninth convolution layer, the output end of the second upsampling layer is connected with the eighth convolution layer, and the eighth convolution layer is used for fusing the first fine feature output by the seventh convolution layer and the second fine feature output by the second upsampling layer.

2. The method of claim 1, wherein the inputting the image containing the human face into a first preset network model for compression to obtain the compression features comprises:

and inputting the image containing the human face into at least one convolution layer and at least one pooling layer for preset calculation to obtain compression characteristics.

3. The method of claim 2, wherein the inputting the image including the human face into at least one convolution layer and at least one pooling layer for performing a predetermined calculation to obtain a compressed feature comprises:

inputting the image containing the human face into a first convolution layer to perform first convolution calculation to obtain a feature obtained by the first convolution calculation;

inputting the features obtained by the first convolution calculation into a first pooling layer to perform first pooling calculation, and obtaining features obtained by the first pooling calculation;

inputting the features obtained by the first pooling calculation into a second convolution layer for second convolution calculation to obtain features obtained by the second convolution calculation, wherein the convolution kernel size of the first convolution layer is different from that of the second convolution layer;

and inputting the features obtained by the second convolution calculation into a second pooling layer to perform second pooling calculation, obtaining features obtained by the second pooling calculation, and taking the features obtained by the second pooling calculation as compression features.

4. The method of claim 1, wherein inputting the compressed features into a second predetermined network model extracts fine features, obtaining a plurality of fine features, comprising:

inputting the compressed features into at least one sensor to carry out preset operation to obtain first precise features;

inputting the first fine feature into a first convolution block for operation to obtain a second fine feature, wherein the first convolution block comprises at least one convolution layer;

and inputting the second fine feature into a second convolution block for operation to obtain a third fine feature, wherein the second convolution block comprises at least one convolution layer.

5. The method of claim 4, wherein inputting the compressed features into at least one sensor for a predetermined operation to obtain first fine features comprises:

inputting the compressed features into a first sensor for operation to obtain first sensing features;

inputting the first perception characteristic into a second perceptron for operation to obtain a second perception characteristic;

and inputting the second perception characteristic into a third perceptron for operation to obtain a third perception characteristic, and taking the third perception characteristic as the first precise characteristic.

6. The method of claim 4, wherein inputting the first fine feature into a first convolution block for operation to obtain a second fine feature comprises:

inputting the first fine feature into a third convolution layer for convolution operation to obtain a feature obtained by third convolution calculation;

inputting the features obtained by the third convolution calculation into a fourth convolution layer for convolution operation to obtain features obtained by the fourth convolution calculation, and taking the features obtained by the fourth convolution calculation as second fine features, wherein the convolution kernel size of the third convolution layer is different from the convolution kernel size of the fourth convolution layer.

7. The method of claim 4, wherein inputting the second fine feature into a second convolution block for operation to obtain a sum of third fine features comprises:

inputting the second fine feature into a fifth convolution layer for convolution operation to obtain a feature obtained by fifth convolution calculation;

inputting the features obtained by the fifth convolution calculation into a sixth convolution layer for convolution operation to obtain features obtained by the sixth convolution calculation, and taking the features obtained by the sixth convolution calculation as third fine features, wherein the step size of the fifth convolution layer is different from that of the sixth convolution layer.

8. The method as claimed in any one of claims 1 to 7, wherein before inputting the image containing the human face into the first predetermined network model to extract the compressed features and obtain the compressed features, the method further comprises:

sequentially connecting a first preset network, a second preset network, a feature fusion network and a regional candidate network to obtain a face detection network;

training the face detection network by using training samples in a training data set to obtain a face detection network model, wherein the face detection network model comprises: the first preset network model, the second preset network model, the feature fusion network model and the regional candidate network model.

9. A face detection apparatus, characterized in that the face detection apparatus comprises:

the compression feature obtaining module is used for inputting an image containing a human face into a first preset network model for compression to obtain compression features;

the fine feature obtaining module is used for inputting the compressed features into a second preset network model to extract fine features, and obtaining a plurality of fine features;

the fusion feature obtaining module is used for inputting the plurality of fine features into a feature fusion network model for fusion to obtain a plurality of fusion features;

a frame and probability obtaining module, configured to input the fusion features into a regional candidate network model to extract candidate frames, and obtain multiple candidate frames and probabilities of the candidate frames;

a face detection result module, configured to use a candidate frame corresponding to the probability of the candidate frame as a face detection result;

10. The face detection apparatus as claimed in claim 9, further comprising:

the detection network obtaining module is used for sequentially connecting the first preset network, the second preset network, the feature fusion network and the regional candidate network to obtain a face detection network;

a network model obtaining module, configured to train the face detection network by using training samples in a training data set to obtain a face detection network model, where the face detection network model includes: the first preset network model, the second preset network model, the feature fusion network model and the regional candidate network model.