CN112560825B - Face detection method and device, electronic equipment and readable storage medium - Google Patents

Face detection method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112560825B
CN112560825B CN202110202066.9A CN202110202066A CN112560825B CN 112560825 B CN112560825 B CN 112560825B CN 202110202066 A CN202110202066 A CN 202110202066A CN 112560825 B CN112560825 B CN 112560825B
Authority
CN
China
Prior art keywords
prediction
feature
threshold
feature maps
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110202066.9A
Other languages
Chinese (zh)
Other versions
CN112560825A (en
Inventor
罗伯特·罗恩思
赵磊
马原
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Pengsi Technology Co ltd
Original Assignee
Beijing Pengsi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Pengsi Technology Co ltd filed Critical Beijing Pengsi Technology Co ltd
Priority to CN202110202066.9A priority Critical patent/CN112560825B/en
Priority to CN202110762221.2A priority patent/CN113688663A/en
Publication of CN112560825A publication Critical patent/CN112560825A/en
Application granted granted Critical
Publication of CN112560825B publication Critical patent/CN112560825B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the disclosure discloses a face detection method, a face detection device, electronic equipment and a readable storage medium. The face detection method comprises the following steps: processing the face image data through a trunk convolutional neural network, wherein the trunk convolutional neural network comprises a plurality of processing stages, and each processing stage outputs a first feature map; processing the plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps; determining a plurality of prediction boxes based on the plurality of second feature maps; obtaining a first threshold value indicating confidence under flexible maximum operation; converting the first threshold value into a second threshold value which indicates confidence level under addition and subtraction operation; determining a prediction result based on the prediction box and the second threshold. By converting the threshold comparison of the flexible maximum value into the threshold comparison of addition and subtraction operation, the calculation amount is greatly saved, the processing efficiency is improved, and the difficulty in deploying the model at the terminal is also reduced.

Description

Face detection method and device, electronic equipment and readable storage medium
Technical Field
The present disclosure relates to the field of face recognition technologies, and in particular, to a face detection method, an apparatus, an electronic device, and a readable storage medium.
Background
With the continuous development of human identity identification verification technology and image intelligent detection and identification technology, the face identification technology is mature day by day, the face identification application tends to a terminal and a web end, and the reduction of input conditions makes the human-computer interaction more convenient. In general, the face recognition technology includes a face detection technology, a face key point positioning technology, a face feature extraction technology, and a face attribute analysis technology. The inventor finds that the existing face detection algorithm is large in calculation amount, much time is consumed, and the face detection algorithm is difficult to deploy in a terminal (such as entrance guard).
Disclosure of Invention
In order to solve the problems in the related art, embodiments of the present disclosure provide a face detection method, an apparatus, an electronic device, and a readable storage medium.
In a first aspect, a face detection method is provided in an embodiment of the present disclosure.
Specifically, the face detection method includes:
processing the face image data through a trunk convolutional neural network, wherein the trunk convolutional neural network comprises a plurality of processing stages, and each processing stage outputs a first feature map;
processing the plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps;
determining a plurality of prediction boxes based on the plurality of second feature maps;
obtaining a first threshold value indicating confidence under flexible maximum operation;
converting the first threshold value into a second threshold value which indicates confidence level under addition and subtraction operation;
determining a prediction result based on the prediction box and the second threshold.
With reference to the first aspect, in a first implementation manner of the first aspect, the disclosure converts the first threshold value into the second threshold value by the following formula:
Figure 201414DEST_PATH_IMAGE001
wherein, t1Is a first threshold value, t2Is the second threshold.
With reference to the first aspect, the present disclosure provides in a second implementation manner of the first aspect, the trunk convolutional neural network includes a plurality of normal convolutional layers and a plurality of depth separable convolutional layers, which are alternately arranged.
With reference to the first aspect, or any one of the first or second implementation manners of the first aspect, in a third implementation manner of the first aspect, the processing, by the feature fusion network, the plurality of first feature maps to obtain a plurality of second feature maps includes:
processing the plurality of first feature maps through a first fusion sub-network, so as to fuse features among the first feature maps to obtain a plurality of third feature maps;
and respectively processing the plurality of third feature maps through a second fusion sub-network to obtain a plurality of second feature maps.
With reference to the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the processing, by the first merging sub-network, the plurality of first feature maps includes at least a feature map C1 and a feature map C2, a size of the feature map C1 is larger than a size of the feature map C2, and the obtaining a plurality of third feature maps includes:
processing the characteristic diagrams C1 and C2 by a 1 × 1 convolutional layer respectively to obtain characteristic diagrams M1 and P2 with the same channel number;
p2 is up-sampled to obtain a feature map M2_ up, and the feature map M2_ up is overlapped with M1 to obtain a feature map M1_ add;
the feature map P1 is obtained by performing a convolutional layer process M1_ add of 3 × 3,
wherein P1 and P2 are third feature maps, and the upsampling is implemented using a nearest neighbor interpolation algorithm.
With reference to the third implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the processing the plurality of third feature maps through the second merging sub-network respectively to obtain a plurality of second feature maps includes performing the following operations on each third feature map P:
obtaining a feature map S1 by convolution layer processing P with a first output channel number;
processing the convolution layer P with a second output channel number to obtain a characteristic diagram T;
respectively processing T through two convolution channels to obtain feature maps S2 and S3, wherein the number of channels of S2 and S3 is the number of second output channels;
s1, S2, and S3 are superimposed by channel to obtain a second feature map F having a predetermined number of channels.
With reference to the first aspect and any one of the first to fifth implementation manners of the first aspect, in a sixth implementation manner of the first aspect, the determining a plurality of prediction blocks based on the plurality of second feature maps includes:
determining anchor point positions based on pixel points in the second feature map;
determining the size of an anchor point frame based on the size of the second feature map, wherein the size of the second feature map and the size of the anchor point frame have a negative correlation relationship;
determining the density of prediction frames generated by each anchor point position based on the down-sampling multiplying power of the second feature map and the size of the anchor point frame;
determining a plurality of prediction frames based on the anchor locations, the size of the anchor frames, and the density of prediction frames generated by the respective anchor locations, each prediction frame including the following information: a position and a size of the prediction box, a first confidence that the prediction box is a positive sample and a second confidence that the prediction box is a negative sample.
With reference to the sixth implementation manner of the first aspect, in a seventh implementation manner of the first aspect, the determining a prediction result based on the prediction block and the second threshold includes:
determining a prediction box with a prediction score greater than a second threshold based on a difference between the first confidence level and the second confidence level;
sorting the prediction frames with the prediction scores larger than the second threshold value according to the prediction scores through a binary tree interpolation and a sequence sorting algorithm to obtain a sorting result;
and according to the sorting result, processing the prediction frame with the prediction score larger than a second threshold value through non-maximum value inhibition so as to filter repeated prediction frames to obtain a prediction result.
With reference to the first aspect and any one of the first to seventh implementation manners of the first aspect, in an eighth implementation manner of the first aspect, the method further includes:
obtaining a sample image;
mapping the brightness of the sample image to a specific interval to construct an augmented image;
and training a face detection model comprising the trunk convolutional neural network and the feature fusion network based on the sample image and the augmented image.
In a second aspect, a face detection apparatus is provided in the embodiments of the present disclosure.
Specifically, the face detection device includes:
the feature extraction module is configured to process the face image data through a trunk convolutional neural network, wherein the trunk convolutional neural network comprises a plurality of processing stages, and each processing stage outputs a first feature map;
the feature fusion module is configured to process the first feature maps through a feature fusion network to obtain second feature maps;
a prediction box determination module configured to determine a plurality of prediction boxes based on the plurality of second feature maps;
a threshold acquisition module configured to acquire a first threshold indicating a confidence level under a flexible maximum operation;
a threshold conversion module configured to convert the first threshold into a second threshold indicating a confidence level under addition and subtraction operations;
a result determination module configured to determine a predicted result based on the prediction box and the second threshold.
In a third aspect, the present disclosure provides an electronic device, including a memory and a processor, where the memory is configured to store one or more computer instructions, where the one or more computer instructions are executed by the processor to implement the method according to the first aspect, and any one of the first to eighth implementation manners of the first aspect.
In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium having stored thereon computer instructions, which, when executed by a processor, implement the method according to any one of the first aspect, the first to the eighth implementation manners of the first aspect.
According to the technical scheme provided by the embodiment of the disclosure, the threshold comparison of the flexible maximum value is converted into the threshold comparison of addition and subtraction operation, so that the calculation amount is greatly saved, the processing efficiency is improved, and the difficulty in deploying the model at the terminal is also reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. The following is a description of the drawings.
Fig. 1 shows a flow chart of a face detection method according to an embodiment of the present disclosure.
Fig. 2 shows a flowchart for processing a plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps according to an embodiment of the present disclosure.
Fig. 3 shows a flow chart for processing a plurality of said first feature maps by a first convergence subnetwork to obtain a plurality of third feature maps according to an embodiment of the present disclosure.
Fig. 4 shows a flowchart of processing the plurality of third feature maps by the second convergence sub-network to obtain a plurality of second feature maps, respectively, according to an embodiment of the present disclosure.
Fig. 5 shows a schematic structural diagram of a second converged sub-network according to an embodiment of the present disclosure.
Fig. 6 illustrates a flow chart for determining a plurality of prediction blocks based on the plurality of second feature maps according to an embodiment of the present disclosure.
FIGS. 7A-7D are schematic diagrams illustrating anchor point frame expansion according to an embodiment of the disclosure.
Fig. 8 illustrates a flow chart for determining a prediction result based on the prediction box and the second threshold according to an embodiment of the present disclosure.
FIG. 9 shows a flow diagram of a model training method according to an embodiment of the present disclosure.
Fig. 10 shows a block diagram of a face detection apparatus according to an embodiment of the present disclosure.
FIG. 11 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.
Fig. 12 is a schematic structural diagram of a computer system suitable for implementing the face detection method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.
In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.
It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 shows a flow chart of a face detection method according to an embodiment of the present disclosure.
As shown in FIG. 1, the method includes operations S110-S160.
In operation S110, processing the face image data through a trunk convolutional neural network, where the trunk convolutional neural network includes a plurality of processing stages, and each processing stage outputs a first feature map;
in operation S120, processing the plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps;
determining a plurality of prediction boxes based on the plurality of second feature maps in operation S130;
in operation S140, obtaining a first threshold value indicating a confidence level in the flexible maximum operation;
in operation S150, converting the first threshold into a second threshold indicating a confidence level in the addition and subtraction operation;
in operation S160, a prediction result is determined based on the prediction box and the second threshold.
According to the technical scheme of the embodiment of the disclosure, the threshold comparison of the flexible maximum value is converted into the threshold comparison of addition and subtraction operation, so that the calculation amount is greatly saved, the processing efficiency is improved, and the difficulty in deploying the model at the terminal is also reduced.
According to the embodiment of the disclosure, the face image data may be image data of a single image or image data of a video frame. The face image data can be collected by a camera, for example, an access control system can collect the image data collected by the camera, and a face recognition algorithm is used for recognizing whether the image contains the face of the person with the authority, so that the state of the access control is controlled.
According to the embodiments of the present disclosure, the backbone convolutional neural network may employ various existing neural network models, for example, for extracting features of an image. The main convolutional neural network is divided into a plurality of processing stages, and a first characteristic diagram is output after the execution of each processing stage is finished. Optionally, the size of the data in the direction from the input layer to the output layer of the trunk convolutional neural network is gradually reduced, so that the size of the first feature map output in the previous processing stage is larger than that of the first feature map output in the later processing stage, thereby forming the feature pyramid.
According to the embodiment of the disclosure, the feature fusion network is used for further mining the features in the plurality of first feature maps to form a plurality of second feature maps. The obtained second feature map is used for generating a plurality of prediction boxes, and each prediction box can comprise the following information: a position and a size of the prediction box, a first confidence that the prediction box is a positive sample and a second confidence that the prediction box is a negative sample.
Generally, the number of prediction frames generated in this step is large, and in order to obtain a final prediction result, a small number of prediction frames satisfying a certain condition need to be screened out from a large number of prediction frames.
According to the general understanding, the sum of the first confidence and the second confidence should be 1, but since the first confidence and the second confidence are generated independently and have no constraint relationship therebetween, the sum of the first confidence and the second confidence may be greater than 1 or less than 1. For example, the first confidence level is 0.2 and the second confidence level is 0.1, in which case the conclusion that the probability that the prediction box is a positive sample is small using the first confidence level directly is incorrect.
On handling similar problems, some related techniques perform a flexible maximum operation softmax on two confidences. For example, in the case where the first confidence is 0.2 and the second confidence is 0.1, the corrected first confidence is calculated as
Figure 646302DEST_PATH_IMAGE002
=0.52, and the second confidence after correction is
Figure 187005DEST_PATH_IMAGE003
= 0.48. After all the objects to be distinguished are processed, all the objects can be uniformly measured through a threshold value. The inventor finds that the method is large in calculation amount, and the performance of the method is unsatisfactory in the aspect of screening problems of a plurality of prediction boxes.
The embodiment of the disclosure provides a method for measuring whether a prediction frame meets a preset condition through a threshold value of addition and subtraction operation. For example, assuming that the first threshold set in the softmax method is 0.6, in order to select data with a corrected first confidence greater than 0.6 after being processed by softmax, the method of the embodiment of the present disclosure converts the data into a comparison manner in which the difference between the first confidence and the second confidence is greater than 0.4, for example, in the case where the first confidence is 0.2 and the second confidence is 0.1, the difference between the first confidence and the second confidence is 0.1 and is not greater than 0.4, the condition is not satisfied, and in the case where the first confidence is 0.6 and the second confidence is 0.1, the difference between the first confidence and the second confidence is 0.5 and is greater than 0.4, the condition is satisfied. By the method, the exponential operation is converted into the addition and subtraction operation, so that the operation amount is greatly reduced, and the method can be used for realizing the algorithm for quickly detecting the face at the terminal.
According to an embodiment of the present disclosure, the first threshold value may be converted into the second threshold value by:
Figure 158372DEST_PATH_IMAGE004
wherein, t1Is a first threshold value, t2Is the second threshold.
The rationality of this method is explained below.
According to the softmax formula:
Figure 759118DEST_PATH_IMAGE005
wherein, a1Is a first degree of confidence, a0Is the second degree of confidence, S1For the corrected first confidence, S0Is the corrected second confidence level.
If S1Greater than a first threshold value t1Then due to
Figure 488039DEST_PATH_IMAGE006
It is possible to obtain:
Figure 770116DEST_PATH_IMAGE007
the above formula is equivalent to
Figure 533673DEST_PATH_IMAGE008
Since the ln function is monotonically increased, the two faces are logarithmized at the same time, and the inequality is still true, namely:
Figure 882483DEST_PATH_IMAGE009
when the first threshold value t is1When the determination is made, the user can select the specific part,
Figure 98701DEST_PATH_IMAGE010
is a constant, and the constant is set to t2Namely:
Figure 184469DEST_PATH_IMAGE011
when the first threshold value t is1While stationary, t only needs to be calculated once2The first confidence a can be used directly1And a second degree of confidence a0Whether the score satisfies the condition is calculated.
Compared with the method for calculating softmax for each object, the method provided by the embodiment of the disclosure has the advantage that the softmax threshold is skillfully compared by calculating the primary constant, and the effects of the two are consistent. When the size of the input network is fixed, for example, 8208 prediction boxes can be generated, if each prediction box calculates softmax, 8208 softmax operations need to be calculated, and according to the scheme, a large number of softmax operations can be simplified into addition and subtraction operations only by calculating the second threshold once, so that the calculation amount can be greatly reduced in the data post-processing stage, and the detection efficiency is improved.
According to an embodiment of the present disclosure, the trunk convolutional neural network includes a plurality of normal convolutional layers and a plurality of depth separable convolutional layers alternately arranged.
By adopting the depth separable convolution instead of the common convolution to extract the features, the parameter quantity can be further reduced while ensuring the sufficient network depth, and the calculated quantity is further reduced.
Assume a convolution kernel size of Kh × KwThe number of input channels is CinThe number of output channels is CoutThe width and height of the output profile are W and H, respectively, and the bias terms are omitted here.
For the standard convolutional layer:
the number of parameters is: params = Kh × Kw× Cin × Cout
The number of floating point operations per second, FLOPs, is: params × H × W
And for depth separable convolution:
when depthwise convolution is carried out, only one convolution kernel with the dimension in _ channels is used for carrying out feature extraction (feature combination is not carried out); in the pointwise convolution, only convolution kernels of output _ channels having a dimension of in _ channels 1 × 1 are used for feature combination, and the combination is performed at different ratios (learnable parameters).
The number of parameters is changed from original Kh × Kw× Cin × CoutBecome to Kh × Kw× Cin×1 + 1 × 1× Cin × CoutIf K ish = 3,Kw = 3,Cout=64, the parameter volume is reduced to 1/8~ 1/9.
According to an embodiment of the present disclosure, the backbone neural network structure may be implemented in the form shown in table 1, for example. Where Conv denotes normal convolutional layer followed by Batch Normalization (BN) and active layer (e.g., ReLU), Conv dw denotes depth separable convolutional layer followed by Batch Normalization and active layer, S2 denotes step size of 2, S1 denotes step size of 1, and padding mode is SAME mode. Wherein, C1, C2 and C3 are the outputs of three branches, namely three first characteristic diagrams.
TABLE 1
Convolution type/step size Convolution kernel size Input size
Conv / S2 3×3×3×8 192×128×3
Conv dw / S1 3×3×8 dw 96×64×8
Conv / S1 1×1×8×16 96×64×8
Conv dw / S2 3×3×16 dw 96×64×16
Conv / S1 1×1×16×32 48×32×16
Conv dw / S1 3×3×32 dw 48×32×32
Conv / S1 1×1×32×32 48×32×32
Conv dw / S2 3×3×32 dw 48×32×32
Conv / S1 1×1×32×64 24×16×32
Conv dw / S1 3×3×64 dw 24×16×64
Conv / S1 (C1) 1×1×64×64 24×16×64
Conv dw / S2 3×3×64 dw 24×16×64
Conv / S1 1×1×64×128 12×8×64
Conv dw / S1 3×3×128 dw 12×8×128
Conv / S1 1×1×128×128 12×8×128
Conv dw / S1 3×3×128 dw 12×8×128
Conv / S1 1×1×128×128 12×8×128
Conv dw / S1 3×3×128 dw 12×8×128
Conv / S1 1×1×128×128 12×8×128
Conv dw / S1 3×3×128 dw 12×8×128
Conv / S1 (C2) 1×1×128×128 12×8×128
Conv dw / S2 3×3×128 dw 12×8×128
Conv / S1 1×1×128×256 6×4×128
Conv dw / S1 3×3×256 dw 6×4×256
Conv / S1(C3) 1×1×256×256 6×4×256
According to an embodiment of the present disclosure, a feature convergence network may include, for example, a first convergence subnetwork and a second convergence subnetwork.
Fig. 2 shows a flowchart for processing a plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps according to an embodiment of the present disclosure.
As shown in fig. 2, operation S120 may include operations S210 and S220.
Processing a plurality of the first feature maps through a first merging sub-network for merging features between the first feature maps to obtain a plurality of third feature maps in operation S210;
in operation S220, the plurality of third feature maps are respectively processed by the second convergence sub-network to obtain a plurality of second feature maps.
Through twice fusion, different receptive fields can be increased, faces with different sizes can be noted, and the face detection effect can be improved.
According to the embodiment of the disclosure, the plurality of first feature maps at least comprise a feature map C1 and a feature map C2, and the size of the feature map C1 is larger than that of the feature map C2. As described above, the first feature legend may include, for example, C1, C2, and C3.
Fig. 3 shows a flow chart for processing a plurality of said first feature maps by a first convergence subnetwork to obtain a plurality of third feature maps according to an embodiment of the present disclosure.
As shown in FIG. 3, operation S210 may include operations S310-S330.
In operation S310, the feature maps C1 and C2 are processed through 1 × 1 convolutional layers, respectively, to obtain feature maps M1 and P2 with the same number of channels;
in operation S320, upsampling P2 to obtain a feature map M2_ up, and overlapping with M1 to obtain a feature map M1_ add;
in operation S330, a feature map P1 is obtained through a 3 × 3 convolutional layer process M1_ add, where P1 and P2 are the third feature maps.
According to an embodiment of the present disclosure, the upsampling may be implemented by, for example, a nearest neighbor interpolation algorithm. As follows: the left side is a small feature map, and the right side is a feature map obtained by interpolation of nearest neighbors with twice of upsampling:
Figure 536953DEST_PATH_IMAGE013
through the nearest neighbor value interpolation algorithm, the semantic information (beneficial to classification) of the feature map can be reserved to the maximum extent in the up-sampling process, so that the feature map with rich spatial information (high resolution and beneficial to positioning) corresponding to the up-sampling process is fused, and the feature map with good spatial information and strong semantic information is obtained.
In one embodiment of the present disclosure, the first converged sub-network may, for example, be in the form illustrated in table 2.
TABLE 2
Figure 417184DEST_PATH_IMAGE014
According to the embodiment of the disclosure, after obtaining the outputs of three branches C1, C2 and C3, the backbone network sends the outputs to the feature fusion network to perform feature fusion on the top-level features and the bottom-level features, and the feature fusion is to amplify the top-level small feature map to the same size as the feature map of the previous stage in an up-sampling (up-sampling) process.
As shown in table 2, the C1, C2, and C3 layers are respectively subjected to 1 × 1 convolution, and the number of channels in the feature map is changed (the number of channels in all M layers is the same, for example, d =64 is set in the embodiment of the present disclosure) to obtain M1, M2, and M3. M3 obtains M3_ up through up-sampling, and then adds with the corresponding position of M2 to obtain M2_ add. M2_ add is up-sampled to obtain M2_ up, and then added with the corresponding position of M1 to obtain M1_ add. And (3) performing 3 × 3 convolution on the M1_ add and M2_ add layer feature maps (reducing aliasing influence caused by nearest neighbor interpolation, wherein the surrounding numbers are the same), and obtaining final P1, P2 and P3 layer features, namely a third feature map.
It can be understood that C2, C3, M2, P3 (M3), M3_ up, M2_ add, and P2 in the embodiment illustrated in table 2 correspond to C1, C2, M1, P2, M2_ up, M1_ add, and P1 described in operations S310 to S330, respectively. The reason for this is that in the embodiment illustrated in table 2, there are three first profiles, thus resulting in a shift of the sequence number.
According to the embodiment of the disclosure, after the feature fusion layer, the P1, P2 and P3 are further fused with the information of different receptive fields through a second fusion sub-network.
Fig. 4 shows a flowchart of processing the plurality of third feature maps by the second convergence sub-network to obtain a plurality of second feature maps, respectively, according to an embodiment of the present disclosure.
As shown in FIG. 4, operation S220 may include operations S410-S440.
In operation S410, a feature map S1 is obtained by convolutional layer processing P having a first number of output channels;
in operation S420, a feature map T is obtained by convolutional layer processing P having a second number of output channels;
in operation S430, processing T through two convolution channels respectively to obtain feature maps S2 and S3, where the number of channels S2 and S3 is the second output channel number;
in operation S440, S1, S2, and S3 are stacked by channel to obtain a second feature map F having a predetermined number of channels.
The following description is made with reference to the second converged sub-network illustrated in fig. 5.
Fig. 5 shows a schematic structural diagram of a second converged sub-network according to an embodiment of the present disclosure.
As shown in fig. 5, the second converged sub-network may be a three-drop network. For any third feature map, for example, P1, first, a normal convolution with a convolution kernel size of 3 × 3 and an output channel of 32 is performed, and then batch normalization is performed to obtain S1; p1 is subjected to convolution, batch standardization layer and Relu activation layer with a convolution kernel of 3 × 3 output channel of 16 to obtain T2, T2 is subjected to convolution layer and batch standardization layer with a convolution kernel of 3 × 3 output channel of 16 to obtain S2; t2 is subjected to convolution layer, batch normalization layer and Relu activation layer with another convolution kernel 3 × 3 output channel being 16 to obtain T3, T3 is subjected to convolution layer and batch normalization layer with a convolution kernel 3 × 3 output channel being 16 to obtain S3, and finally S1, S2 and S3 are spliced according to channel dimensions and then subjected to Relu activation layer to obtain a second feature map F1, wherein F1 is still a feature map of 64 channels. Similarly, P2 and P3 may be subjected to the above treatments to give F2 and F3, respectively.
Similarly, P1 and T2 in the above illustrated embodiment correspond to P and T, respectively, described in operations S410-S440.
The inventor finds that the existing detection algorithm has a poor detection effect on smaller faces, and therefore, the embodiment of the disclosure provides a scheme for expanding the number of anchor frames in a smaller view when generating a prediction frame.
Fig. 6 illustrates a flow chart for determining a plurality of prediction blocks based on the plurality of second feature maps according to an embodiment of the present disclosure.
As shown in FIG. 6, operation S130 may include operations S610-S640.
In operation S610, determining an anchor point position based on a pixel point in the second feature map;
determining a size of an anchor frame based on the size of the second feature map, wherein the size of the second feature map and the size of the anchor frame have a negative correlation relationship in operation S620;
determining a density of prediction frames generated at respective anchor positions based on the down-sampling magnification of the second feature map and the sizes of the anchor frames in operation S630;
in operation S640, a plurality of prediction frames are determined based on the anchor positions, the sizes of the anchor frames, and the densities of the prediction frames generated by the respective anchor positions.
According to the embodiment of the present disclosure, the down-sampling magnifications between the sizes of the second feature maps F1, F2, F3 and the original input image are, for example, 8, 16, 32, respectively. Since the more convolutional layers pass, the higher the downsampling magnification, the larger the corresponding field of view. Embodiments of the present disclosure may define anchor blocks of a variety of different sizes, for example anchor blocks of 16 × 16, 32 × 32, 64 × 64, 128 × 128, 256 × 256, and 512 × 512, respectively. And using the anchor block prediction with a smaller size on the second feature map with a smaller downsampling multiplying factor (the second feature map with a larger size), and using the anchor block prediction with a larger size on the second feature map with a larger downsampling multiplying factor (the second feature map with a smaller size). For example, anchor block prediction of two sizes, 16 × 16 and 32 × 32, is used at F1, anchor block prediction of two sizes, 64 × 64 and 128 × 128, is used at F2, and anchor block prediction of two sizes, 256 × 256 and 512 × 512, is used at F3.
Typically, each anchor location predicts a set of anchor blocks of different aspect ratios and different sizes, e.g., 3 × 2=6 anchor blocks per anchor location with aspect ratios of 1:1, 1:1.5 and 1.5:1, and sizes of 16 × 16 and 32 × 32. In the disclosed embodiment, the aspect ratio is fixed to 1:1, with 16 × 16 and 32 × 32 anchor blocks predicted per anchor position for feature map F1, and 256 × 256 and 512 × 512 anchor blocks predicted per anchor position for feature map F3.
Obviously, in this case, anchor blocks of smaller size are sparse relative to anchor blocks of larger size. This is also an important reason why the small target detection effect is not good. In order to solve the above problem, the embodiments of the present disclosure propose to increase the density of anchor blocks of a partial feature map. Increasing the anchor frame density means that at each anchor position, multiple anchor frames are predicted for each aspect ratio and anchor frame size combination. For example, the density-deficient anchor block may be offset-doubled by the center, as shown in fig. 7A-7D.
It can be defined that the anchor frame density of 16 × 16 size is increased by 4 times, the anchor frame density of 32 × 32 size is increased by 2 times, and the anchor frame density of 64 × 64 size is increased by 2 times, so that the F1 feature layer yields 4 × 4+2 × 2=20 prediction offsets per pixel position (16 anchor frames of 16 × 16 and 4 anchor frames of 32 × 32), the F2 feature layer predicts 2 × 2+1 × 1=5 prediction offsets per pixel position (4 anchor frames of 64 × 64 and 1 anchor frame of 128 × 128), and the F3 feature layer predicts 1 × 1+1 × 1=2 prediction offsets per pixel position (1 anchor frame of 256 × 256 and 1 anchor frame of 512 × 512).
Table 3 illustrates the selection of anchor boxes of different sizes, the corresponding density of each anchor box, and the ratio of the step size of the anchor box divided by the size of the anchor box in accordance with an embodiment of the present disclosure.
TABLE 3
Anchor frame size Down sampling multiplying power Anchor frame density Anchor frame step size Ratio of
16×16 8 N=4 2=8/4 2/16=1/8
32×32 8 N=2 4=8/2 4/32=1/8
64×64 16 N=2 8=16/2 8/64=1/8
128×128 16 N=1 16=16/1 16/128=1/8
256×256 32 N=1 32=32/1 32/256=1/8
512×512 32 N=1 32=32/1 32/512=1/16
As shown in table 3, the step size of the anchor box is defined as the magnification of the down-sampling divided by the density of the anchor box, and the ratio of the step size of the anchor box divided by the size of the anchor box is defined as a fixed value (e.g., 1/8, where the ratio is determined to be 1/16 in the case of an anchor box size of 512 × 512 to ensure that the density of the anchor box is not lower than 1).
That is, the F1 feature map has a size of 24 × 16, and each pixel is responsible for predicting 4 × 4 anchor blocks of size 16 and 2 × 2 anchor blocks of size 32, resulting in 20 prediction blocks per pixel; the size of the F2 feature map is 12 x 8, each pixel is responsible for predicting 2 x 2 anchor blocks with the size of 64 and 1 x 1 anchor blocks with the size of 128, and each pixel obtains 5 prediction blocks; the F3 feature map is 6 × 4 in size, and each pixel is responsible for predicting 1 × 1 anchor block of size 256 and 1 × 1 anchor block of size 512, resulting in 2 prediction blocks per pixel. A total of 24 × 16 × 20 + 12 × 8 × 5 + 6 × 4 × 2= 8208 prediction boxes are generated.
According to an embodiment of the present disclosure, each of the prediction boxes includes the following information: the position and size of the prediction box, the first confidence that the prediction box is a positive sample and the second confidence that the prediction box is a negative sample may be expressed as (x, y, w, h, a), for example1,a0) Where x, y are the offset of the prediction frame with respect to the anchor point, w, h are the variation of the prediction frame with respect to the anchor point, a1,a0The data are generated based on a second feature map, the first confidence being a positive sample and the second confidence being that the prediction box is a negative sample.
The method of the embodiment of the disclosure increases the density of the anchor point frame for predicting the smaller face, and improves the detection rate of the smaller face.
Fig. 8 illustrates a flow chart for determining a prediction result based on the prediction box and the second threshold according to an embodiment of the present disclosure.
As shown in FIG. 8, operation S160 may include operations S810-S830.
Determining a prediction box with a prediction score greater than a second threshold based on a difference between the first confidence and the second confidence in operation S810;
in operation S820, sorting the prediction frames with the prediction scores greater than the second threshold according to the prediction scores by using a binary tree interpolation and a median sorting algorithm to obtain a sorting result;
in operation S830, the prediction boxes with the prediction scores greater than the second threshold are processed by non-maximum suppression according to the sorting result to filter the repeated prediction boxes, so as to obtain a prediction result.
According to the embodiment of the present disclosure, after the prediction results of the F1, F2, and F3 feature layers are obtained, the post-processing stage of the prediction boxes is entered, for example, the prediction boxes with prediction scores lower than 0.3 may be considered as negative samples by using the above-mentioned threshold conversion method, the prediction boxes with prediction scores higher than 0.3 are sorted in descending order according to the prediction scores, and then NMS (non-maximum suppression) is performed to further filter the prediction boxes with IOU (area of overlapping region divided by area of union) larger than 0.5 as duplicates, so as to obtain the prediction results. Optionally, the prediction box which is retained and has a score greater than another threshold (such as 0.8 or 0.5) defined by the system may be further output as the prediction result.
When filtering duplicate prediction boxes using NMS (non-maximum suppression), it is necessary to sort by score to keep the prediction box with the highest confidence and suppress duplicate prediction boxes with lower scores. The time complexity of the conventional sorting algorithm is O (n)2) In the embodiment of the present disclosure, binary tree interpolation and binary tree middle-order search may be adopted to make the time complexity of the sorting be O (Log)2n) to O (n), further improving the treatment efficiency.
In addition, the inventor finds that the missing rate and the false alarm rate of the human face are high in the existing human face detection algorithm under complex scenes such as backlight, dim light and wearing a mask. Therefore, the embodiment of the present disclosure provides a method for augmenting a training sample to alleviate the above problem, and improve the face detection rate in a complex scene.
FIG. 9 shows a flow diagram of a model training method according to an embodiment of the present disclosure.
As shown in fig. 9, the method may further include operations S910 to S930 based on any of the embodiments illustrated in fig. 1 to 8.
In operation S910, a sample image is obtained;
mapping the brightness of the sample image to a specific section to construct an augmented image in operation S920;
in operation S930, a face detection model including the trunk convolutional neural network and the feature fusion network is trained based on the sample image and the augmented image.
According to the embodiment of the present disclosure, the RGB data of the input video frame or image may be preprocessed, for example, the luminance values of each pixel in the image may be processed by linear stretching:
y=(x -min(x)) ×(dmax-dmin))/(max(x) - min(x)+1.0) + dmin
where x represents the input image, min (x) represents the minimum value of the intensity of the pixels in the image, max (x) represents the maximum value of the intensity of the pixels in the image, dmax, dmin represent the maximum and minimum values of the mapping of the picture to the target area, respectively.
In consideration of data enhancement of backlight conditions, the brightness of the training image is adjusted by using a maximum and minimum linear stretching mode, and the brightness of the training data is randomly mapped to a specific interval, such as 155-255, so that a certain amount of data under an overexposure or backlight scene appears in the training sample, and the trained model can have better generalization capability on data with strong backlight and illumination.
Similarly, the luminance can also be mapped to a smaller interval of values to enhance the generalization ability of the model to data of dark scenes.
The embodiment of the disclosure provides a face detection method based on a depth separable convolutional neural network, which inputs RGB data of a video frame or an image through a human-computer interface, performs normalization preprocessing on the data, and inputs the preprocessed data into a main convolutional neural network to obtain preliminary characteristic maps of three different stages. And then inputting the feature maps of different stages into the feature fusion layers for feature fusion and outputting three feature maps after fusion, then performing convolution neural networks of three branches on each feature layer to further extract features, and using each feature layer for prediction of prediction frames of different sizes. And finally, filtering out negative samples and prediction frames with higher overlapping degree through post-processing, and outputting the prediction frames with confidence scores higher than a threshold value. The method can reduce the operation amount and improve the detection capability of the smaller face.
Fig. 10 shows a block diagram of a face detection apparatus according to an embodiment of the present disclosure. The apparatus 1000 may be implemented as part or all of an electronic device through software, hardware, or a combination of both.
As shown in fig. 10, the information processing apparatus 1000 includes a feature extraction module 1010, a feature fusion module 1020, a prediction box determination module 1030, a threshold acquisition module 1040, a threshold conversion module 1050, and a result determination module 1060.
A feature extraction module 1010 configured to process face image data by a trunk convolutional neural network, wherein the trunk convolutional neural network includes a plurality of processing stages, and each processing stage outputs a first feature map;
a feature fusion module 1020 configured to process the plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps;
a prediction box determination module 1030 configured to determine a plurality of prediction boxes based on the plurality of second feature maps;
a threshold acquisition module 1040 configured to acquire a first threshold indicating a confidence level under the flexible maximum operation;
a threshold conversion module 1050 configured to convert the first threshold into a second threshold indicating a confidence level under addition and subtraction operations;
a result determination module 1060 configured to determine a predicted result based on the prediction box and the second threshold.
According to an embodiment of the present disclosure, the first threshold value is converted into a second threshold value by:
Figure 120698DEST_PATH_IMAGE001
wherein, t1Is a first threshold value, t2Is the second threshold.
According to an embodiment of the present disclosure, the trunk convolutional neural network includes a plurality of normal convolutional layers and a plurality of depth separable convolutional layers alternately arranged.
According to an embodiment of the present disclosure, the processing the plurality of first feature maps through the feature fusion network to obtain a plurality of second feature maps includes:
processing the plurality of first feature maps through a first fusion sub-network, so as to fuse features among the first feature maps to obtain a plurality of third feature maps;
and respectively processing the plurality of third feature maps through a second fusion sub-network to obtain a plurality of second feature maps.
According to an embodiment of the present disclosure, the plurality of first feature maps includes at least feature map C1 and feature map C2, the size of feature map C1 is larger than the size of feature map C2, and the processing the plurality of first feature maps through the first merging subnetwork to obtain a plurality of third feature maps includes:
processing the characteristic diagrams C1 and C2 by a 1 × 1 convolutional layer respectively to obtain characteristic diagrams M1 and P2 with the same channel number;
p2 is up-sampled to obtain a feature map M2_ up, and the feature map M2_ up is overlapped with M1 to obtain a feature map M1_ add;
the feature map P1 is obtained by performing a convolutional layer process M1_ add of 3 × 3,
wherein P1 and P2 are third feature maps, and the upsampling is implemented using a nearest neighbor interpolation algorithm.
According to an embodiment of the present disclosure, the processing the plurality of third feature maps by the second merging sub-network, respectively, to obtain a plurality of second feature maps includes performing the following operations for each third feature map P:
obtaining a feature map S1 by convolution layer processing P with a first output channel number;
processing the convolution layer P with a second output channel number to obtain a characteristic diagram T;
respectively processing T through two convolution channels to obtain feature maps S2 and S3, wherein the number of channels of S2 and S3 is the number of second output channels;
s1, S2, and S3 are superimposed by channel to obtain a second feature map F having a predetermined number of channels.
According to an embodiment of the present disclosure, the determining a plurality of prediction boxes based on the plurality of second feature maps includes:
determining anchor point positions based on pixel points in the second feature map;
determining the size of an anchor point frame based on the size of the second feature map, wherein the size of the second feature map and the size of the anchor point frame have a negative correlation relationship;
determining the density of prediction frames generated by each anchor point position based on the down-sampling multiplying power of the second feature map and the size of the anchor point frame;
determining a plurality of prediction frames based on the anchor locations, the size of the anchor frames, and the density of prediction frames generated by the respective anchor locations, each prediction frame including the following information: a position and a size of the prediction box, a first confidence that the prediction box is a positive sample and a second confidence that the prediction box is a negative sample.
According to an embodiment of the present disclosure, the determining a prediction result based on the prediction box and the second threshold includes:
determining a prediction box with a prediction score greater than a second threshold based on a difference between the first confidence level and the second confidence level;
sorting the prediction frames with the prediction scores larger than the second threshold value according to the prediction scores through a binary tree interpolation and a sequence sorting algorithm to obtain a sorting result;
and according to the sorting result, processing the prediction frame with the prediction score larger than a second threshold value through non-maximum value inhibition so as to filter repeated prediction frames to obtain a prediction result.
According to the embodiment of the disclosure, the device may further include a training module configured to obtain a sample image, map the brightness of the sample image to a specific interval to construct an augmented image, and train a face detection model including the backbone convolutional neural network and the feature fusion network based on the sample image and the augmented image.
The present disclosure also discloses an electronic device, and fig. 11 shows a block diagram of the electronic device according to an embodiment of the present disclosure.
As shown in fig. 11, the electronic device 1100 includes a memory 1101 and a processor 1102, where the memory 1101 is used for storing a program that supports the electronic device to execute the information processing method or the code generation method in any of the above embodiments, and the processor 1102 is configured to execute the program stored in the memory 1101.
The memory 1101 is configured to store one or more computer instructions, which are executed by the processor 1102 to implement a face detection method as described in any of the above embodiments, according to the embodiments of the present disclosure.
Fig. 12 is a schematic structural diagram of a computer system suitable for implementing the face detection method according to an embodiment of the present disclosure.
As shown in fig. 12, the computer system 1200 includes a processing unit 1201 which can execute various processes in the above-described embodiments according to a program stored in a Read Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for the operation of the system 1200 are also stored. The processing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.
The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary. The processing unit 1201 can be implemented as a CPU, a GPU, a TPU, an FPGA, an NPU, or other processing units.
In particular, the above described methods may be implemented as computer software programs according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the above-described method. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present disclosure may be implemented by software or by programmable hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, the present disclosure also provides a computer-readable storage medium, which may be a computer-readable storage medium included in the electronic device or the computer system in the above embodiments; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (11)

1. A face detection method, comprising:
processing the face image data through a trunk convolutional neural network, wherein the trunk convolutional neural network comprises a plurality of processing stages, and each processing stage outputs a first feature map;
processing the plurality of first feature maps through a feature fusion network to obtain a plurality of second feature maps;
determining a plurality of prediction boxes based on the plurality of second feature maps;
obtaining a first threshold value indicating confidence under flexible maximum operation;
converting the first threshold value into a second threshold value which indicates confidence level under addition and subtraction operation;
determining a prediction result based on the prediction box and the second threshold,
wherein the first threshold is converted to the second threshold by:
Figure FDA0003016251780000011
wherein, t1Is a first threshold value, t2Is the second threshold.
2. The method of claim 1, wherein the backbone convolutional neural network comprises a plurality of normal convolutional layers and a plurality of depth separable convolutional layers that are alternately arranged.
3. The method according to claim 1 or 2, wherein the processing the plurality of first feature maps through the feature fusion network to obtain a plurality of second feature maps comprises:
processing the plurality of first feature maps through a first fusion sub-network, so as to fuse features among the first feature maps to obtain a plurality of third feature maps;
and respectively processing the plurality of third feature maps through a second fusion sub-network to obtain a plurality of second feature maps.
4. The method of claim 3, wherein the plurality of first feature maps comprises at least feature map C1 and feature map C2, the feature map C1 is larger in size than feature map C2, and the processing the plurality of first feature maps by the first merging sub-network for merging features between the first feature maps to obtain a plurality of third feature maps comprises:
processing the characteristic diagrams C1 and C2 by a 1 × 1 convolutional layer respectively to obtain characteristic diagrams M1 and P2 with the same channel number;
p2 is up-sampled to obtain a feature map M2_ up, and the feature map M2_ up is overlapped with M1 to obtain a feature map M1_ add;
the feature map P1 is obtained by performing a convolutional layer process M1_ add of 3 × 3,
wherein P1 and P2 are third feature maps, and the upsampling is implemented using a nearest neighbor interpolation algorithm.
5. The method according to claim 3, wherein said processing said plurality of third feature maps separately by the second convergence sub-network to obtain a plurality of second feature maps comprises, for each third feature map P:
obtaining a feature map S1 by convolution layer processing P with a first output channel number;
processing the convolution layer P with a second output channel number to obtain a characteristic diagram T;
respectively processing T through two convolution channels to obtain feature maps S2 and S3, wherein the number of channels of S2 and S3 is the number of second output channels;
s1, S2, and S3 are superimposed by channel to obtain a second feature map F having a predetermined number of channels.
6. The method of claim 1 or 2, wherein the determining a plurality of prediction boxes based on the plurality of second feature maps comprises:
determining anchor point positions based on pixel points in the second feature map;
determining the size of an anchor point frame based on the size of the second feature map, wherein the size of the second feature map and the size of the anchor point frame have a negative correlation relationship;
determining the density of prediction frames generated by each anchor point position based on the down-sampling multiplying power of the second feature map and the size of the anchor point frame;
determining a plurality of prediction frames based on the anchor locations, the size of the anchor frames, and the density of prediction frames generated by the respective anchor locations, each prediction frame including the following information: a position and a size of the prediction box, a first confidence that the prediction box is a positive sample and a second confidence that the prediction box is a negative sample.
7. The method of claim 6, wherein the determining a prediction result based on the prediction box and the second threshold comprises:
determining a prediction box with a prediction score greater than a second threshold based on a difference between the first confidence level and the second confidence level;
sorting the prediction frames with the prediction scores larger than the second threshold value according to the prediction scores through a binary tree interpolation and a sequence sorting algorithm to obtain a sorting result;
and according to the sorting result, processing the prediction frame with the prediction score larger than a second threshold value through non-maximum value inhibition so as to filter repeated prediction frames to obtain a prediction result.
8. The method of claim 1 or 2, further comprising:
obtaining a sample image;
mapping the brightness of the sample image to a specific interval to construct an augmented image;
and training a face detection model comprising the trunk convolutional neural network and the feature fusion network based on the sample image and the augmented image.
9. A face detection apparatus comprising:
the feature extraction module is configured to process the face image data through a trunk convolutional neural network, wherein the trunk convolutional neural network comprises a plurality of processing stages, and each processing stage outputs a first feature map;
the feature fusion module is configured to process the first feature maps through a feature fusion network to obtain second feature maps;
a prediction box determination module configured to determine a plurality of prediction boxes based on the plurality of second feature maps;
a threshold acquisition module configured to acquire a first threshold indicating a confidence level under a flexible maximum operation;
a threshold translation module configured to translate the first threshold to a second threshold that indicates a confidence level in an add-subtract operation, wherein the first threshold is translated to the second threshold by:
Figure FDA0003016251780000031
wherein, t1Is a first threshold value, t2Is a second threshold value;
a result determination module configured to determine a predicted result based on the prediction box and the second threshold.
10. An electronic device comprising a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the steps of the method of any one of claims 1-8.
11. A readable storage medium having stored thereon computer instructions, characterized in that the computer instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 8.
CN202110202066.9A 2021-02-23 2021-02-23 Face detection method and device, electronic equipment and readable storage medium Expired - Fee Related CN112560825B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110202066.9A CN112560825B (en) 2021-02-23 2021-02-23 Face detection method and device, electronic equipment and readable storage medium
CN202110762221.2A CN113688663A (en) 2021-02-23 2021-02-23 Face detection method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110202066.9A CN112560825B (en) 2021-02-23 2021-02-23 Face detection method and device, electronic equipment and readable storage medium

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202110762221.2A Division CN113688663A (en) 2021-02-23 2021-02-23 Face detection method and device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112560825A CN112560825A (en) 2021-03-26
CN112560825B true CN112560825B (en) 2021-05-18

Family

ID=75034532

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110202066.9A Expired - Fee Related CN112560825B (en) 2021-02-23 2021-02-23 Face detection method and device, electronic equipment and readable storage medium
CN202110762221.2A Pending CN113688663A (en) 2021-02-23 2021-02-23 Face detection method and device, electronic equipment and readable storage medium

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202110762221.2A Pending CN113688663A (en) 2021-02-23 2021-02-23 Face detection method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (2) CN112560825B (en)

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220618B (en) * 2017-05-25 2019-12-24 中国科学院自动化研究所 Face detection method and device, computer readable storage medium and equipment
CN109472193A (en) * 2018-09-21 2019-03-15 北京飞搜科技有限公司 Method for detecting human face and device
CN109670430A (en) * 2018-12-11 2019-04-23 浙江大学 A kind of face vivo identification method of the multiple Classifiers Combination based on deep learning
CN109801270B (en) * 2018-12-29 2021-07-16 北京市商汤科技开发有限公司 Anchor point determining method and device, electronic equipment and storage medium
CN109753927A (en) * 2019-01-02 2019-05-14 腾讯科技(深圳)有限公司 A kind of method for detecting human face and device
CN109919097A (en) * 2019-03-08 2019-06-21 中国科学院自动化研究所 Face and key point combined detection system, method based on multi-task learning
CN110427821A (en) * 2019-06-27 2019-11-08 高新兴科技集团股份有限公司 A kind of method for detecting human face and system based on lightweight convolutional neural networks
CN110647817B (en) * 2019-08-27 2022-04-05 江南大学 Real-time face detection method based on MobileNet V3
CN111275166B (en) * 2020-01-15 2023-05-02 华南理工大学 Convolutional neural network-based image processing device, equipment and readable storage medium
CN111274994B (en) * 2020-02-13 2022-08-23 腾讯科技(深圳)有限公司 Cartoon face detection method and device, electronic equipment and computer readable medium
CN111368903B (en) * 2020-02-28 2021-08-27 深圳前海微众银行股份有限公司 Model performance optimization method, device, equipment and storage medium
CN111553227A (en) * 2020-04-21 2020-08-18 东南大学 Lightweight face detection method based on task guidance
CN111753682B (en) * 2020-06-11 2023-05-23 中建地下空间有限公司 Hoisting area dynamic monitoring method based on target detection algorithm
CN111753960B (en) * 2020-06-25 2023-08-08 北京百度网讯科技有限公司 Model training and image processing method and device, electronic equipment and storage medium
CN112287860B (en) * 2020-11-03 2022-01-07 北京京东乾石科技有限公司 Training method and device of object recognition model, and object recognition method and system

Also Published As

Publication number Publication date
CN112560825A (en) 2021-03-26
CN113688663A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
CN103578116B (en) For tracking the apparatus and method of object
CN111582201A (en) Lane line detection system based on geometric attention perception
CN110909690A (en) Method for detecting occluded face image based on region generation
CN111291826A (en) Multi-source remote sensing image pixel-by-pixel classification method based on correlation fusion network
CN113762138A (en) Method and device for identifying forged face picture, computer equipment and storage medium
CN109671055B (en) Pulmonary nodule detection method and device
CN116091372B (en) Infrared and visible light image fusion method based on layer separation and heavy parameters
CN113808031A (en) Image restoration method based on LSK-FNet model
CN111881915B (en) Satellite video target intelligent detection method based on multiple prior information constraints
WO2020043296A1 (en) Device and method for separating a picture into foreground and background using deep learning
CN114580541A (en) Fire disaster video smoke identification method based on time-space domain double channels
CN113052170A (en) Small target license plate recognition method under unconstrained scene
CN111274964B (en) Detection method for analyzing water surface pollutants based on visual saliency of unmanned aerial vehicle
CN115984323A (en) Two-stage fusion RGBT tracking algorithm based on space-frequency domain equalization
CN114862707A (en) Multi-scale feature recovery image enhancement method and device and storage medium
CN112560825B (en) Face detection method and device, electronic equipment and readable storage medium
CN112926667A (en) Method and device for detecting saliency target of depth fusion edge and high-level feature
CN111797940A (en) Image identification method based on ocean search and rescue and related device
CN111667498A (en) Automatic moving ship target detection method facing optical satellite video
CN115861818A (en) Small water body extraction method based on attention mechanism combined convolution neural network
CN108596893B (en) Image processing method and system
CN111611846A (en) Pedestrian re-identification method and device, electronic equipment and storage medium
JP2004062604A (en) Image processing method and device
CN117893413B (en) Vehicle-mounted terminal man-machine interaction method based on image enhancement
CN113592752B (en) Road traffic light offset image enhancement method and device based on countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210518

CF01 Termination of patent right due to non-payment of annual fee