CN112560825B

CN112560825B - Face detection method and device, electronic equipment and readable storage medium

Info

Publication number: CN112560825B
Application number: CN202110202066.9A
Authority: CN
Inventors: 罗伯特·罗恩思; 赵磊; 马原
Original assignee: Beijing Pengsi Technology Co ltd
Current assignee: Beijing Pengsi Technology Co ltd
Priority date: 2021-02-23
Filing date: 2021-02-23
Publication date: 2021-05-18
Anticipated expiration: 2041-02-23
Also published as: CN112560825A; CN113688663A

Abstract

The embodiment of the disclosure discloses a face detection method, a face detection device, electronic equipment and a readable storage medium. The face detection method comprises the following steps: processing the face image data through a trunk convolutional neural network, wherein the trunk convolutional neural network comprises a plurality of processing stages, and each processing stage outputs a first feature map; processing the plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps; determining a plurality of prediction boxes based on the plurality of second feature maps; obtaining a first threshold value indicating confidence under flexible maximum operation; converting the first threshold value into a second threshold value which indicates confidence level under addition and subtraction operation; determining a prediction result based on the prediction box and the second threshold. By converting the threshold comparison of the flexible maximum value into the threshold comparison of addition and subtraction operation, the calculation amount is greatly saved, the processing efficiency is improved, and the difficulty in deploying the model at the terminal is also reduced.

Description

Face detection method and device, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of face recognition technologies, and in particular, to a face detection method, an apparatus, an electronic device, and a readable storage medium.

Background

With the continuous development of human identity identification verification technology and image intelligent detection and identification technology, the face identification technology is mature day by day, the face identification application tends to a terminal and a web end, and the reduction of input conditions makes the human-computer interaction more convenient. In general, the face recognition technology includes a face detection technology, a face key point positioning technology, a face feature extraction technology, and a face attribute analysis technology. The inventor finds that the existing face detection algorithm is large in calculation amount, much time is consumed, and the face detection algorithm is difficult to deploy in a terminal (such as entrance guard).

Disclosure of Invention

In order to solve the problems in the related art, embodiments of the present disclosure provide a face detection method, an apparatus, an electronic device, and a readable storage medium.

In a first aspect, a face detection method is provided in an embodiment of the present disclosure.

Specifically, the face detection method includes:

processing the face image data through a trunk convolutional neural network, wherein the trunk convolutional neural network comprises a plurality of processing stages, and each processing stage outputs a first feature map;

processing the plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps;

determining a plurality of prediction boxes based on the plurality of second feature maps;

obtaining a first threshold value indicating confidence under flexible maximum operation;

converting the first threshold value into a second threshold value which indicates confidence level under addition and subtraction operation;

determining a prediction result based on the prediction box and the second threshold.

With reference to the first aspect, in a first implementation manner of the first aspect, the disclosure converts the first threshold value into the second threshold value by the following formula:

wherein, t₁Is a first threshold value, t₂Is the second threshold.

With reference to the first aspect, the present disclosure provides in a second implementation manner of the first aspect, the trunk convolutional neural network includes a plurality of normal convolutional layers and a plurality of depth separable convolutional layers, which are alternately arranged.

With reference to the first aspect, or any one of the first or second implementation manners of the first aspect, in a third implementation manner of the first aspect, the processing, by the feature fusion network, the plurality of first feature maps to obtain a plurality of second feature maps includes:

processing the plurality of first feature maps through a first fusion sub-network, so as to fuse features among the first feature maps to obtain a plurality of third feature maps;

and respectively processing the plurality of third feature maps through a second fusion sub-network to obtain a plurality of second feature maps.

With reference to the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the processing, by the first merging sub-network, the plurality of first feature maps includes at least a feature map C1 and a feature map C2, a size of the feature map C1 is larger than a size of the feature map C2, and the obtaining a plurality of third feature maps includes:

processing the characteristic diagrams C1 and C2 by a 1 × 1 convolutional layer respectively to obtain characteristic diagrams M1 and P2 with the same channel number;

p2 is up-sampled to obtain a feature map M2_ up, and the feature map M2_ up is overlapped with M1 to obtain a feature map M1_ add;

the feature map P1 is obtained by performing a convolutional layer process M1_ add of 3 × 3,

wherein P1 and P2 are third feature maps, and the upsampling is implemented using a nearest neighbor interpolation algorithm.

With reference to the third implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the processing the plurality of third feature maps through the second merging sub-network respectively to obtain a plurality of second feature maps includes performing the following operations on each third feature map P:

obtaining a feature map S1 by convolution layer processing P with a first output channel number;

processing the convolution layer P with a second output channel number to obtain a characteristic diagram T;

respectively processing T through two convolution channels to obtain feature maps S2 and S3, wherein the number of channels of S2 and S3 is the number of second output channels;

s1, S2, and S3 are superimposed by channel to obtain a second feature map F having a predetermined number of channels.

With reference to the first aspect and any one of the first to fifth implementation manners of the first aspect, in a sixth implementation manner of the first aspect, the determining a plurality of prediction blocks based on the plurality of second feature maps includes:

determining anchor point positions based on pixel points in the second feature map;

determining the size of an anchor point frame based on the size of the second feature map, wherein the size of the second feature map and the size of the anchor point frame have a negative correlation relationship;

determining the density of prediction frames generated by each anchor point position based on the down-sampling multiplying power of the second feature map and the size of the anchor point frame;

determining a plurality of prediction frames based on the anchor locations, the size of the anchor frames, and the density of prediction frames generated by the respective anchor locations, each prediction frame including the following information: a position and a size of the prediction box, a first confidence that the prediction box is a positive sample and a second confidence that the prediction box is a negative sample.

With reference to the sixth implementation manner of the first aspect, in a seventh implementation manner of the first aspect, the determining a prediction result based on the prediction block and the second threshold includes:

determining a prediction box with a prediction score greater than a second threshold based on a difference between the first confidence level and the second confidence level;

sorting the prediction frames with the prediction scores larger than the second threshold value according to the prediction scores through a binary tree interpolation and a sequence sorting algorithm to obtain a sorting result;

and according to the sorting result, processing the prediction frame with the prediction score larger than a second threshold value through non-maximum value inhibition so as to filter repeated prediction frames to obtain a prediction result.

With reference to the first aspect and any one of the first to seventh implementation manners of the first aspect, in an eighth implementation manner of the first aspect, the method further includes:

obtaining a sample image;

mapping the brightness of the sample image to a specific interval to construct an augmented image;

and training a face detection model comprising the trunk convolutional neural network and the feature fusion network based on the sample image and the augmented image.

In a second aspect, a face detection apparatus is provided in the embodiments of the present disclosure.

Specifically, the face detection device includes:

the feature extraction module is configured to process the face image data through a trunk convolutional neural network, wherein the trunk convolutional neural network comprises a plurality of processing stages, and each processing stage outputs a first feature map;

the feature fusion module is configured to process the first feature maps through a feature fusion network to obtain second feature maps;

a prediction box determination module configured to determine a plurality of prediction boxes based on the plurality of second feature maps;

a threshold acquisition module configured to acquire a first threshold indicating a confidence level under a flexible maximum operation;

a threshold conversion module configured to convert the first threshold into a second threshold indicating a confidence level under addition and subtraction operations;

a result determination module configured to determine a predicted result based on the prediction box and the second threshold.

In a third aspect, the present disclosure provides an electronic device, including a memory and a processor, where the memory is configured to store one or more computer instructions, where the one or more computer instructions are executed by the processor to implement the method according to the first aspect, and any one of the first to eighth implementation manners of the first aspect.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium having stored thereon computer instructions, which, when executed by a processor, implement the method according to any one of the first aspect, the first to the eighth implementation manners of the first aspect.

According to the technical scheme provided by the embodiment of the disclosure, the threshold comparison of the flexible maximum value is converted into the threshold comparison of addition and subtraction operation, so that the calculation amount is greatly saved, the processing efficiency is improved, and the difficulty in deploying the model at the terminal is also reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. The following is a description of the drawings.

Fig. 1 shows a flow chart of a face detection method according to an embodiment of the present disclosure.

Fig. 2 shows a flowchart for processing a plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps according to an embodiment of the present disclosure.

Fig. 3 shows a flow chart for processing a plurality of said first feature maps by a first convergence subnetwork to obtain a plurality of third feature maps according to an embodiment of the present disclosure.

Fig. 4 shows a flowchart of processing the plurality of third feature maps by the second convergence sub-network to obtain a plurality of second feature maps, respectively, according to an embodiment of the present disclosure.

Fig. 5 shows a schematic structural diagram of a second converged sub-network according to an embodiment of the present disclosure.

Fig. 6 illustrates a flow chart for determining a plurality of prediction blocks based on the plurality of second feature maps according to an embodiment of the present disclosure.

FIGS. 7A-7D are schematic diagrams illustrating anchor point frame expansion according to an embodiment of the disclosure.

Fig. 8 illustrates a flow chart for determining a prediction result based on the prediction box and the second threshold according to an embodiment of the present disclosure.

FIG. 9 shows a flow diagram of a model training method according to an embodiment of the present disclosure.

Fig. 10 shows a block diagram of a face detection apparatus according to an embodiment of the present disclosure.

FIG. 11 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Fig. 12 is a schematic structural diagram of a computer system suitable for implementing the face detection method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.

In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.

It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

As shown in FIG. 1, the method includes operations S110-S160.

In operation S110, processing the face image data through a trunk convolutional neural network, where the trunk convolutional neural network includes a plurality of processing stages, and each processing stage outputs a first feature map;

in operation S120, processing the plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps;

determining a plurality of prediction boxes based on the plurality of second feature maps in operation S130;

in operation S140, obtaining a first threshold value indicating a confidence level in the flexible maximum operation;

in operation S150, converting the first threshold into a second threshold indicating a confidence level in the addition and subtraction operation;

in operation S160, a prediction result is determined based on the prediction box and the second threshold.

According to the technical scheme of the embodiment of the disclosure, the threshold comparison of the flexible maximum value is converted into the threshold comparison of addition and subtraction operation, so that the calculation amount is greatly saved, the processing efficiency is improved, and the difficulty in deploying the model at the terminal is also reduced.

According to the embodiment of the disclosure, the face image data may be image data of a single image or image data of a video frame. The face image data can be collected by a camera, for example, an access control system can collect the image data collected by the camera, and a face recognition algorithm is used for recognizing whether the image contains the face of the person with the authority, so that the state of the access control is controlled.

According to the embodiments of the present disclosure, the backbone convolutional neural network may employ various existing neural network models, for example, for extracting features of an image. The main convolutional neural network is divided into a plurality of processing stages, and a first characteristic diagram is output after the execution of each processing stage is finished. Optionally, the size of the data in the direction from the input layer to the output layer of the trunk convolutional neural network is gradually reduced, so that the size of the first feature map output in the previous processing stage is larger than that of the first feature map output in the later processing stage, thereby forming the feature pyramid.

According to the embodiment of the disclosure, the feature fusion network is used for further mining the features in the plurality of first feature maps to form a plurality of second feature maps. The obtained second feature map is used for generating a plurality of prediction boxes, and each prediction box can comprise the following information: a position and a size of the prediction box, a first confidence that the prediction box is a positive sample and a second confidence that the prediction box is a negative sample.

Generally, the number of prediction frames generated in this step is large, and in order to obtain a final prediction result, a small number of prediction frames satisfying a certain condition need to be screened out from a large number of prediction frames.

According to the general understanding, the sum of the first confidence and the second confidence should be 1, but since the first confidence and the second confidence are generated independently and have no constraint relationship therebetween, the sum of the first confidence and the second confidence may be greater than 1 or less than 1. For example, the first confidence level is 0.2 and the second confidence level is 0.1, in which case the conclusion that the probability that the prediction box is a positive sample is small using the first confidence level directly is incorrect.

On handling similar problems, some related techniques perform a flexible maximum operation softmax on two confidences. For example, in the case where the first confidence is 0.2 and the second confidence is 0.1, the corrected first confidence is calculated as

=0.52, and the second confidence after correction is

= 0.48. After all the objects to be distinguished are processed, all the objects can be uniformly measured through a threshold value. The inventor finds that the method is large in calculation amount, and the performance of the method is unsatisfactory in the aspect of screening problems of a plurality of prediction boxes.

The embodiment of the disclosure provides a method for measuring whether a prediction frame meets a preset condition through a threshold value of addition and subtraction operation. For example, assuming that the first threshold set in the softmax method is 0.6, in order to select data with a corrected first confidence greater than 0.6 after being processed by softmax, the method of the embodiment of the present disclosure converts the data into a comparison manner in which the difference between the first confidence and the second confidence is greater than 0.4, for example, in the case where the first confidence is 0.2 and the second confidence is 0.1, the difference between the first confidence and the second confidence is 0.1 and is not greater than 0.4, the condition is not satisfied, and in the case where the first confidence is 0.6 and the second confidence is 0.1, the difference between the first confidence and the second confidence is 0.5 and is greater than 0.4, the condition is satisfied. By the method, the exponential operation is converted into the addition and subtraction operation, so that the operation amount is greatly reduced, and the method can be used for realizing the algorithm for quickly detecting the face at the terminal.

According to an embodiment of the present disclosure, the first threshold value may be converted into the second threshold value by:

wherein, t₁Is a first threshold value, t₂Is the second threshold.

The rationality of this method is explained below.

According to the softmax formula:

wherein, a₁Is a first degree of confidence, a₀Is the second degree of confidence, S₁For the corrected first confidence, S₀Is the corrected second confidence level.

If S₁Greater than a first threshold value t₁Then due to

It is possible to obtain:

the above formula is equivalent to

Since the ln function is monotonically increased, the two faces are logarithmized at the same time, and the inequality is still true, namely:

when the first threshold value t is₁When the determination is made, the user can select the specific part,

is a constant, and the constant is set to t₂Namely:

when the first threshold value t is₁While stationary, t only needs to be calculated once₂The first confidence a can be used directly₁And a second degree of confidence a₀Whether the score satisfies the condition is calculated.

Compared with the method for calculating softmax for each object, the method provided by the embodiment of the disclosure has the advantage that the softmax threshold is skillfully compared by calculating the primary constant, and the effects of the two are consistent. When the size of the input network is fixed, for example, 8208 prediction boxes can be generated, if each prediction box calculates softmax, 8208 softmax operations need to be calculated, and according to the scheme, a large number of softmax operations can be simplified into addition and subtraction operations only by calculating the second threshold once, so that the calculation amount can be greatly reduced in the data post-processing stage, and the detection efficiency is improved.

According to an embodiment of the present disclosure, the trunk convolutional neural network includes a plurality of normal convolutional layers and a plurality of depth separable convolutional layers alternately arranged.

By adopting the depth separable convolution instead of the common convolution to extract the features, the parameter quantity can be further reduced while ensuring the sufficient network depth, and the calculated quantity is further reduced.

Assume a convolution kernel size of K_h × K_wThe number of input channels is C_inThe number of output channels is C_outThe width and height of the output profile are W and H, respectively, and the bias terms are omitted here.

For the standard convolutional layer:

the number of parameters is: params = K_h × K_w× C_in× C_out

The number of floating point operations per second, FLOPs, is: params × H × W

And for depth separable convolution:

when depthwise convolution is carried out, only one convolution kernel with the dimension in _ channels is used for carrying out feature extraction (feature combination is not carried out); in the pointwise convolution, only convolution kernels of output _ channels having a dimension of in _ channels 1 × 1 are used for feature combination, and the combination is performed at different ratios (learnable parameters).

The number of parameters is changed from original K_h × K_w× C_in× C_outBecome to K_h × K_w× C_in×1 + 1 × 1× C_in× C_outIf K is_h = 3，K_w = 3，C_out=64, the parameter volume is reduced to 1/8~ 1/9.

According to an embodiment of the present disclosure, the backbone neural network structure may be implemented in the form shown in table 1, for example. Where Conv denotes normal convolutional layer followed by Batch Normalization (BN) and active layer (e.g., ReLU), Conv dw denotes depth separable convolutional layer followed by Batch Normalization and active layer, S2 denotes step size of 2, S1 denotes step size of 1, and padding mode is SAME mode. Wherein, C1, C2 and C3 are the outputs of three branches, namely three first characteristic diagrams.

TABLE 1

Convolution type/step size	Convolution kernel size	Input size
			Conv / S2	3×3×3×8	192×128×3
Conv dw / S1	3×3×8 dw	96×64×8
			Conv / S1	1×1×8×16	96×64×8
Conv dw / S2	3×3×16 dw	96×64×16
			Conv / S1	1×1×16×32	48×32×16
Conv dw / S1	3×3×32 dw	48×32×32
			Conv / S1	1×1×32×32	48×32×32
Conv dw / S2	3×3×32 dw	48×32×32
			Conv / S1	1×1×32×64	24×16×32
Conv dw / S1	3×3×64 dw	24×16×64
			Conv / S1 (C1)	1×1×64×64	24×16×64
Conv dw / S2	3×3×64 dw	24×16×64
			Conv / S1	1×1×64×128	12×8×64
Conv dw / S1	3×3×128 dw	12×8×128
			Conv / S1	1×1×128×128	12×8×128
Conv dw / S1	3×3×128 dw	12×8×128
			Conv / S1	1×1×128×128	12×8×128
Conv dw / S1	3×3×128 dw	12×8×128
			Conv / S1	1×1×128×128	12×8×128
Conv dw / S1	3×3×128 dw	12×8×128
			Conv / S1 (C2)	1×1×128×128	12×8×128
Conv dw / S2	3×3×128 dw	12×8×128
			Conv / S1	1×1×128×256	6×4×128
Conv dw / S1	3×3×256 dw	6×4×256
			Conv / S1(C3)	1×1×256×256	6×4×256

According to an embodiment of the present disclosure, a feature convergence network may include, for example, a first convergence subnetwork and a second convergence subnetwork.

As shown in fig. 2, operation S120 may include operations S210 and S220.

Processing a plurality of the first feature maps through a first merging sub-network for merging features between the first feature maps to obtain a plurality of third feature maps in operation S210;

in operation S220, the plurality of third feature maps are respectively processed by the second convergence sub-network to obtain a plurality of second feature maps.

Through twice fusion, different receptive fields can be increased, faces with different sizes can be noted, and the face detection effect can be improved.

According to the embodiment of the disclosure, the plurality of first feature maps at least comprise a feature map C1 and a feature map C2, and the size of the feature map C1 is larger than that of the feature map C2. As described above, the first feature legend may include, for example, C1, C2, and C3.

As shown in FIG. 3, operation S210 may include operations S310-S330.

In operation S310, the feature maps C1 and C2 are processed through 1 × 1 convolutional layers, respectively, to obtain feature maps M1 and P2 with the same number of channels;

in operation S320, upsampling P2 to obtain a feature map M2_ up, and overlapping with M1 to obtain a feature map M1_ add;

in operation S330, a feature map P1 is obtained through a 3 × 3 convolutional layer process M1_ add, where P1 and P2 are the third feature maps.

According to an embodiment of the present disclosure, the upsampling may be implemented by, for example, a nearest neighbor interpolation algorithm. As follows: the left side is a small feature map, and the right side is a feature map obtained by interpolation of nearest neighbors with twice of upsampling:

through the nearest neighbor value interpolation algorithm, the semantic information (beneficial to classification) of the feature map can be reserved to the maximum extent in the up-sampling process, so that the feature map with rich spatial information (high resolution and beneficial to positioning) corresponding to the up-sampling process is fused, and the feature map with good spatial information and strong semantic information is obtained.

In one embodiment of the present disclosure, the first converged sub-network may, for example, be in the form illustrated in table 2.

TABLE 2

According to the embodiment of the disclosure, after obtaining the outputs of three branches C1, C2 and C3, the backbone network sends the outputs to the feature fusion network to perform feature fusion on the top-level features and the bottom-level features, and the feature fusion is to amplify the top-level small feature map to the same size as the feature map of the previous stage in an up-sampling (up-sampling) process.

As shown in table 2, the C1, C2, and C3 layers are respectively subjected to 1 × 1 convolution, and the number of channels in the feature map is changed (the number of channels in all M layers is the same, for example, d =64 is set in the embodiment of the present disclosure) to obtain M1, M2, and M3. M3 obtains M3_ up through up-sampling, and then adds with the corresponding position of M2 to obtain M2_ add. M2_ add is up-sampled to obtain M2_ up, and then added with the corresponding position of M1 to obtain M1_ add. And (3) performing 3 × 3 convolution on the M1_ add and M2_ add layer feature maps (reducing aliasing influence caused by nearest neighbor interpolation, wherein the surrounding numbers are the same), and obtaining final P1, P2 and P3 layer features, namely a third feature map.

It can be understood that C2, C3, M2, P3 (M3), M3_ up, M2_ add, and P2 in the embodiment illustrated in table 2 correspond to C1, C2, M1, P2, M2_ up, M1_ add, and P1 described in operations S310 to S330, respectively. The reason for this is that in the embodiment illustrated in table 2, there are three first profiles, thus resulting in a shift of the sequence number.

According to the embodiment of the disclosure, after the feature fusion layer, the P1, P2 and P3 are further fused with the information of different receptive fields through a second fusion sub-network.

As shown in FIG. 4, operation S220 may include operations S410-S440.

In operation S410, a feature map S1 is obtained by convolutional layer processing P having a first number of output channels;

in operation S420, a feature map T is obtained by convolutional layer processing P having a second number of output channels;

in operation S430, processing T through two convolution channels respectively to obtain feature maps S2 and S3, where the number of channels S2 and S3 is the second output channel number;

in operation S440, S1, S2, and S3 are stacked by channel to obtain a second feature map F having a predetermined number of channels.

The following description is made with reference to the second converged sub-network illustrated in fig. 5.

As shown in fig. 5, the second converged sub-network may be a three-drop network. For any third feature map, for example, P1, first, a normal convolution with a convolution kernel size of 3 × 3 and an output channel of 32 is performed, and then batch normalization is performed to obtain S1; p1 is subjected to convolution, batch standardization layer and Relu activation layer with a convolution kernel of 3 × 3 output channel of 16 to obtain T2, T2 is subjected to convolution layer and batch standardization layer with a convolution kernel of 3 × 3 output channel of 16 to obtain S2; t2 is subjected to convolution layer, batch normalization layer and Relu activation layer with another convolution kernel 3 × 3 output channel being 16 to obtain T3, T3 is subjected to convolution layer and batch normalization layer with a convolution kernel 3 × 3 output channel being 16 to obtain S3, and finally S1, S2 and S3 are spliced according to channel dimensions and then subjected to Relu activation layer to obtain a second feature map F1, wherein F1 is still a feature map of 64 channels. Similarly, P2 and P3 may be subjected to the above treatments to give F2 and F3, respectively.

Similarly, P1 and T2 in the above illustrated embodiment correspond to P and T, respectively, described in operations S410-S440.

The inventor finds that the existing detection algorithm has a poor detection effect on smaller faces, and therefore, the embodiment of the disclosure provides a scheme for expanding the number of anchor frames in a smaller view when generating a prediction frame.

As shown in FIG. 6, operation S130 may include operations S610-S640.

In operation S610, determining an anchor point position based on a pixel point in the second feature map;

determining a size of an anchor frame based on the size of the second feature map, wherein the size of the second feature map and the size of the anchor frame have a negative correlation relationship in operation S620;

determining a density of prediction frames generated at respective anchor positions based on the down-sampling magnification of the second feature map and the sizes of the anchor frames in operation S630;

in operation S640, a plurality of prediction frames are determined based on the anchor positions, the sizes of the anchor frames, and the densities of the prediction frames generated by the respective anchor positions.

According to the embodiment of the present disclosure, the down-sampling magnifications between the sizes of the second feature maps F1, F2, F3 and the original input image are, for example, 8, 16, 32, respectively. Since the more convolutional layers pass, the higher the downsampling magnification, the larger the corresponding field of view. Embodiments of the present disclosure may define anchor blocks of a variety of different sizes, for example anchor blocks of 16 × 16, 32 × 32, 64 × 64, 128 × 128, 256 × 256, and 512 × 512, respectively. And using the anchor block prediction with a smaller size on the second feature map with a smaller downsampling multiplying factor (the second feature map with a larger size), and using the anchor block prediction with a larger size on the second feature map with a larger downsampling multiplying factor (the second feature map with a smaller size). For example, anchor block prediction of two sizes, 16 × 16 and 32 × 32, is used at F1, anchor block prediction of two sizes, 64 × 64 and 128 × 128, is used at F2, and anchor block prediction of two sizes, 256 × 256 and 512 × 512, is used at F3.

Typically, each anchor location predicts a set of anchor blocks of different aspect ratios and different sizes, e.g., 3 × 2=6 anchor blocks per anchor location with aspect ratios of 1:1, 1:1.5 and 1.5:1, and sizes of 16 × 16 and 32 × 32. In the disclosed embodiment, the aspect ratio is fixed to 1:1, with 16 × 16 and 32 × 32 anchor blocks predicted per anchor position for feature map F1, and 256 × 256 and 512 × 512 anchor blocks predicted per anchor position for feature map F3.

Obviously, in this case, anchor blocks of smaller size are sparse relative to anchor blocks of larger size. This is also an important reason why the small target detection effect is not good. In order to solve the above problem, the embodiments of the present disclosure propose to increase the density of anchor blocks of a partial feature map. Increasing the anchor frame density means that at each anchor position, multiple anchor frames are predicted for each aspect ratio and anchor frame size combination. For example, the density-deficient anchor block may be offset-doubled by the center, as shown in fig. 7A-7D.

It can be defined that the anchor frame density of 16 × 16 size is increased by 4 times, the anchor frame density of 32 × 32 size is increased by 2 times, and the anchor frame density of 64 × 64 size is increased by 2 times, so that the F1 feature layer yields 4 × 4+2 × 2=20 prediction offsets per pixel position (16 anchor frames of 16 × 16 and 4 anchor frames of 32 × 32), the F2 feature layer predicts 2 × 2+1 × 1=5 prediction offsets per pixel position (4 anchor frames of 64 × 64 and 1 anchor frame of 128 × 128), and the F3 feature layer predicts 1 × 1+1 × 1=2 prediction offsets per pixel position (1 anchor frame of 256 × 256 and 1 anchor frame of 512 × 512).

Table 3 illustrates the selection of anchor boxes of different sizes, the corresponding density of each anchor box, and the ratio of the step size of the anchor box divided by the size of the anchor box in accordance with an embodiment of the present disclosure.

TABLE 3

Anchor frame size	Down sampling multiplying power	Anchor frame density	Anchor frame step size	Ratio of
					16×16	8	N=4	2=8/4	2/16=1/8
32×32	8	N=2	4=8/2	4/32=1/8
					64×64	16	N=2	8=16/2	8/64=1/8
128×128	16	N=1	16=16/1	16/128=1/8
					256×256	32	N=1	32=32/1	32/256=1/8
512×512	32	N=1	32=32/1	32/512=1/16

As shown in table 3, the step size of the anchor box is defined as the magnification of the down-sampling divided by the density of the anchor box, and the ratio of the step size of the anchor box divided by the size of the anchor box is defined as a fixed value (e.g., 1/8, where the ratio is determined to be 1/16 in the case of an anchor box size of 512 × 512 to ensure that the density of the anchor box is not lower than 1).

That is, the F1 feature map has a size of 24 × 16, and each pixel is responsible for predicting 4 × 4 anchor blocks of size 16 and 2 × 2 anchor blocks of size 32, resulting in 20 prediction blocks per pixel; the size of the F2 feature map is 12 x 8, each pixel is responsible for predicting 2 x 2 anchor blocks with the size of 64 and 1 x 1 anchor blocks with the size of 128, and each pixel obtains 5 prediction blocks; the F3 feature map is 6 × 4 in size, and each pixel is responsible for predicting 1 × 1 anchor block of size 256 and 1 × 1 anchor block of size 512, resulting in 2 prediction blocks per pixel. A total of 24 × 16 × 20 + 12 × 8 × 5 + 6 × 4 × 2= 8208 prediction boxes are generated.

According to an embodiment of the present disclosure, each of the prediction boxes includes the following information: the position and size of the prediction box, the first confidence that the prediction box is a positive sample and the second confidence that the prediction box is a negative sample may be expressed as (x, y, w, h, a), for example₁,a₀) Where x, y are the offset of the prediction frame with respect to the anchor point, w, h are the variation of the prediction frame with respect to the anchor point, a₁,a₀The data are generated based on a second feature map, the first confidence being a positive sample and the second confidence being that the prediction box is a negative sample.

The method of the embodiment of the disclosure increases the density of the anchor point frame for predicting the smaller face, and improves the detection rate of the smaller face.

As shown in FIG. 8, operation S160 may include operations S810-S830.

Determining a prediction box with a prediction score greater than a second threshold based on a difference between the first confidence and the second confidence in operation S810;

in operation S820, sorting the prediction frames with the prediction scores greater than the second threshold according to the prediction scores by using a binary tree interpolation and a median sorting algorithm to obtain a sorting result;

in operation S830, the prediction boxes with the prediction scores greater than the second threshold are processed by non-maximum suppression according to the sorting result to filter the repeated prediction boxes, so as to obtain a prediction result.

According to the embodiment of the present disclosure, after the prediction results of the F1, F2, and F3 feature layers are obtained, the post-processing stage of the prediction boxes is entered, for example, the prediction boxes with prediction scores lower than 0.3 may be considered as negative samples by using the above-mentioned threshold conversion method, the prediction boxes with prediction scores higher than 0.3 are sorted in descending order according to the prediction scores, and then NMS (non-maximum suppression) is performed to further filter the prediction boxes with IOU (area of overlapping region divided by area of union) larger than 0.5 as duplicates, so as to obtain the prediction results. Optionally, the prediction box which is retained and has a score greater than another threshold (such as 0.8 or 0.5) defined by the system may be further output as the prediction result.

When filtering duplicate prediction boxes using NMS (non-maximum suppression), it is necessary to sort by score to keep the prediction box with the highest confidence and suppress duplicate prediction boxes with lower scores. The time complexity of the conventional sorting algorithm is O (n)²) In the embodiment of the present disclosure, binary tree interpolation and binary tree middle-order search may be adopted to make the time complexity of the sorting be O (Log)₂n) to O (n), further improving the treatment efficiency.

In addition, the inventor finds that the missing rate and the false alarm rate of the human face are high in the existing human face detection algorithm under complex scenes such as backlight, dim light and wearing a mask. Therefore, the embodiment of the present disclosure provides a method for augmenting a training sample to alleviate the above problem, and improve the face detection rate in a complex scene.

As shown in fig. 9, the method may further include operations S910 to S930 based on any of the embodiments illustrated in fig. 1 to 8.

In operation S910, a sample image is obtained;

mapping the brightness of the sample image to a specific section to construct an augmented image in operation S920;

in operation S930, a face detection model including the trunk convolutional neural network and the feature fusion network is trained based on the sample image and the augmented image.

According to the embodiment of the present disclosure, the RGB data of the input video frame or image may be preprocessed, for example, the luminance values of each pixel in the image may be processed by linear stretching:

y=(x -min(x)) ×(dmax-dmin))/(max(x) - min(x)+1.0) + dmin

where x represents the input image, min (x) represents the minimum value of the intensity of the pixels in the image, max (x) represents the maximum value of the intensity of the pixels in the image, dmax, dmin represent the maximum and minimum values of the mapping of the picture to the target area, respectively.

In consideration of data enhancement of backlight conditions, the brightness of the training image is adjusted by using a maximum and minimum linear stretching mode, and the brightness of the training data is randomly mapped to a specific interval, such as 155-255, so that a certain amount of data under an overexposure or backlight scene appears in the training sample, and the trained model can have better generalization capability on data with strong backlight and illumination.

Similarly, the luminance can also be mapped to a smaller interval of values to enhance the generalization ability of the model to data of dark scenes.

The embodiment of the disclosure provides a face detection method based on a depth separable convolutional neural network, which inputs RGB data of a video frame or an image through a human-computer interface, performs normalization preprocessing on the data, and inputs the preprocessed data into a main convolutional neural network to obtain preliminary characteristic maps of three different stages. And then inputting the feature maps of different stages into the feature fusion layers for feature fusion and outputting three feature maps after fusion, then performing convolution neural networks of three branches on each feature layer to further extract features, and using each feature layer for prediction of prediction frames of different sizes. And finally, filtering out negative samples and prediction frames with higher overlapping degree through post-processing, and outputting the prediction frames with confidence scores higher than a threshold value. The method can reduce the operation amount and improve the detection capability of the smaller face.

Fig. 10 shows a block diagram of a face detection apparatus according to an embodiment of the present disclosure. The apparatus 1000 may be implemented as part or all of an electronic device through software, hardware, or a combination of both.

As shown in fig. 10, the information processing apparatus 1000 includes a feature extraction module 1010, a feature fusion module 1020, a prediction box determination module 1030, a threshold acquisition module 1040, a threshold conversion module 1050, and a result determination module 1060.

A feature extraction module 1010 configured to process face image data by a trunk convolutional neural network, wherein the trunk convolutional neural network includes a plurality of processing stages, and each processing stage outputs a first feature map;

a feature fusion module 1020 configured to process the plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps;

a prediction box determination module 1030 configured to determine a plurality of prediction boxes based on the plurality of second feature maps;

a threshold acquisition module 1040 configured to acquire a first threshold indicating a confidence level under the flexible maximum operation;

a threshold conversion module 1050 configured to convert the first threshold into a second threshold indicating a confidence level under addition and subtraction operations;

a result determination module 1060 configured to determine a predicted result based on the prediction box and the second threshold.

According to an embodiment of the present disclosure, the first threshold value is converted into a second threshold value by:

wherein, t₁Is a first threshold value, t₂Is the second threshold.

According to an embodiment of the present disclosure, the processing the plurality of first feature maps through the feature fusion network to obtain a plurality of second feature maps includes:

According to an embodiment of the present disclosure, the plurality of first feature maps includes at least feature map C1 and feature map C2, the size of feature map C1 is larger than the size of feature map C2, and the processing the plurality of first feature maps through the first merging subnetwork to obtain a plurality of third feature maps includes:

According to an embodiment of the present disclosure, the processing the plurality of third feature maps by the second merging sub-network, respectively, to obtain a plurality of second feature maps includes performing the following operations for each third feature map P:

According to an embodiment of the present disclosure, the determining a plurality of prediction boxes based on the plurality of second feature maps includes:

According to an embodiment of the present disclosure, the determining a prediction result based on the prediction box and the second threshold includes:

According to the embodiment of the disclosure, the device may further include a training module configured to obtain a sample image, map the brightness of the sample image to a specific interval to construct an augmented image, and train a face detection model including the backbone convolutional neural network and the feature fusion network based on the sample image and the augmented image.

The present disclosure also discloses an electronic device, and fig. 11 shows a block diagram of the electronic device according to an embodiment of the present disclosure.

As shown in fig. 11, the electronic device 1100 includes a memory 1101 and a processor 1102, where the memory 1101 is used for storing a program that supports the electronic device to execute the information processing method or the code generation method in any of the above embodiments, and the processor 1102 is configured to execute the program stored in the memory 1101.

The memory 1101 is configured to store one or more computer instructions, which are executed by the processor 1102 to implement a face detection method as described in any of the above embodiments, according to the embodiments of the present disclosure.

As shown in fig. 12, the computer system 1200 includes a processing unit 1201 which can execute various processes in the above-described embodiments according to a program stored in a Read Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for the operation of the system 1200 are also stored. The processing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary. The processing unit 1201 can be implemented as a CPU, a GPU, a TPU, an FPGA, an NPU, or other processing units.

In particular, the above described methods may be implemented as computer software programs according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the above-described method. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present disclosure may be implemented by software or by programmable hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.

As another aspect, the present disclosure also provides a computer-readable storage medium, which may be a computer-readable storage medium included in the electronic device or the computer system in the above embodiments; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A face detection method, comprising:

determining a prediction result based on the prediction box and the second threshold,

wherein the first threshold is converted to the second threshold by:

wherein, t₁Is a first threshold value, t₂Is the second threshold.

2. The method of claim 1, wherein the backbone convolutional neural network comprises a plurality of normal convolutional layers and a plurality of depth separable convolutional layers that are alternately arranged.

3. The method according to claim 1 or 2, wherein the processing the plurality of first feature maps through the feature fusion network to obtain a plurality of second feature maps comprises:

4. The method of claim 3, wherein the plurality of first feature maps comprises at least feature map C1 and feature map C2, the feature map C1 is larger in size than feature map C2, and the processing the plurality of first feature maps by the first merging sub-network for merging features between the first feature maps to obtain a plurality of third feature maps comprises:

5. The method according to claim 3, wherein said processing said plurality of third feature maps separately by the second convergence sub-network to obtain a plurality of second feature maps comprises, for each third feature map P:

6. The method of claim 1 or 2, wherein the determining a plurality of prediction boxes based on the plurality of second feature maps comprises:

7. The method of claim 6, wherein the determining a prediction result based on the prediction box and the second threshold comprises:

8. The method of claim 1 or 2, further comprising:

obtaining a sample image;

9. A face detection apparatus comprising:

a threshold translation module configured to translate the first threshold to a second threshold that indicates a confidence level in an add-subtract operation, wherein the first threshold is translated to the second threshold by:

wherein, t₁Is a first threshold value, t₂Is a second threshold value;

10. An electronic device comprising a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the steps of the method of any one of claims 1-8.

11. A readable storage medium having stored thereon computer instructions, characterized in that the computer instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 8.