CN112633085A

CN112633085A - Human face detection method, system, storage medium and terminal based on attention guide mechanism

Info

Publication number: CN112633085A
Application number: CN202011425736.5A
Authority: CN
Inventors: 赵磊
Original assignee: Terminus Technology Group Co Ltd
Current assignee: Terminus Technology Group Co Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-04-09
Anticipated expiration: 2040-12-08
Also published as: CN112633085B

Abstract

The invention discloses a face detection method, a system, a storage medium and a terminal based on an attention guide mechanism, wherein the method comprises the following steps: acquiring a target image to be detected, inputting the target image to be detected into a pre-trained face detection model, firstly, performing feature extraction on the target image to be detected through VGG16 expanded in a rolling block to generate a feature map sequence, and selecting 6 layers from the feature map sequence as a first branch original feature map; a context extraction module of the face detection model performs channel splicing on each feature map in the first branch original feature map to generate a spliced feature map; an attention guide module of the face detection model acquires semantic relations and spatial information corresponding to the spliced feature maps to generate acquired feature maps; generating a second branch enhanced feature map based on the first branch original feature map and the collected feature map; and obtaining the detected face image according to the enhanced feature image. Therefore, by adopting the embodiment of the application, the face detection precision can be improved.

Description

Human face detection method, system, storage medium and terminal based on attention guide mechanism

Technical Field

The invention relates to the technical field of deep learning of computers, in particular to a human face detection method, a human face detection system, a storage medium and a terminal based on an attention guide mechanism.

Background

In a face detection task based on deep learning, the difficulty of detecting small targets and small faces is high, and many technical challenges are faced, because the resolution ratio of pictures is low, the pictures are blurred, and background noise is high.

The existing small face detection method mainly comprises the steps of detecting small faces through a traditional image pyramid and a multi-scale sliding window; the data amplification method is used for increasing the number and the types of the small face samples to improve the small face detection performance; based on a feature fusion method, the multi-scale features of a high layer and a low layer are fused to improve the detection performance; a method based on anchor sampling and matching strategies; methods that utilize contextual information, and the like.

Since the context information in the visual task is crucial to performance improvement, many detection algorithms design an interlayer fusion structure for extracting the context information, for example, DenseNet has dense cross-layer connection to realize feature multiplexing, FPN fuses feature information of the upper layer and the bottom layer, and deplab V3 has an ASPP structure to increase the receptive field.

DSFD is used as a double-branch face detection algorithm, and combines the ideas of FPN and RFB, a Feature Enhancement Module (FEM) is provided, so that not only is the feature information between different layers used, but also the features of larger receptive field are obtained by using cavity convolution, and therefore more features with high identification degree and strong robustness are obtained. However, the FEM module only groups and processes the FPN fused feature maps and then splices the feature maps to increase the receptive field, and context features of fine granularity and coarse granularity are not effectively fused, so that the identification precision is reduced.

Disclosure of Invention

The embodiment of the application provides a face detection method, a face detection system, a storage medium and a terminal based on an attention guide mechanism. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, an embodiment of the present application provides a face detection method based on an attention-guiding mechanism, where the method includes:

acquiring a target image to be detected, and inputting the target image to be detected into a pre-trained face detection model; the human face detection model comprises a volume block and an attention guide feature enhancement module; the attention guidance feature enhancement module comprises an attention guidance module and a context extraction module;

performing feature extraction operation on a target image to be detected by adopting VGG16 expanded in a rolling block to generate a feature map sequence, and selecting 6 layers from the feature map sequence as a first branch original feature map;

performing channel splicing on each feature map in the first branch original feature map based on a context extraction module to generate a spliced feature map;

acquiring semantic relations and spatial information corresponding to the spliced feature maps according to an attention guide module, and generating the acquired feature maps;

generating a second branch enhanced feature map based on the first branch original feature map and the collected feature map;

and inputting the second branch enhanced feature map into the SSD target detection algorithm head of the face detection model to obtain a detected face image.

Optionally, the performing channel splicing on each feature map in the first branch original feature map based on the context extraction module to generate a spliced feature map, including:

the context extraction module carries out channel grouping on the first branch original feature map to generate three groups of feature map sequences;

the context extraction module performs feature processing on the three groups of feature graph sequences to generate three groups of feature graph sequences after feature processing;

the context extraction module performs feature fusion again on each feature map in the feature map sequences after the convolution of the three groups of cavities through 1 × 1 convolution parameters to generate three groups of feature map sequences after the convolution again;

and the context extraction module splices the three groups of re-convolved feature map sequences to generate a spliced feature map.

Optionally, the context extraction module performs feature processing on the three sets of feature graph sequences to generate three sets of feature graph sequences after feature processing, including:

the context extraction module adopts different cavity convolution layers to extract multi-scale characteristic information of the face aiming at a first group of the three groups of characteristic channels to generate a first refined characteristic diagram sequence; the void convolution parameter is 3 x 3, and the rolling mark rate of the void convolution is 3;

the context extraction module increases the number of effective feature weights by adopting 1 × 1 convolution aiming at a second group in the three groups of feature channels to generate a second refined feature map sequence;

the context extraction module performs global feature extraction on a third group of the three groups of feature channels to generate a global feature map sequence;

the context extraction module carries out channel splicing on the first refined feature map sequence, the second refined feature map sequence and the global feature map sequence to generate a spliced feature map sequence;

and the context extraction module performs feature fusion on the feature graph sequences after the convolution of the three groups of cavities by adopting 1-by-1 convolution parameters to generate three groups of feature graph sequences after feature processing.

Optionally, the context extraction module performs global feature extraction on a third group of the three groups of feature channels to generate a global feature map sequence, including:

the context extraction module adopts Global Average Pooling (GAP) processing for a third group of the three groups of feature channels to generate a pooled feature graph sequence;

the context extraction module changes the channel dimension of the pooled feature map sequence by adopting 1 × 1 convolution parameters to generate a changed feature map sequence;

and the context extraction module upsamples the changed feature map sequence to the spatial dimension of a preset threshold value to generate a global feature map sequence.

Optionally, the generating the collected feature map according to the semantic relationship and the spatial information corresponding to the feature map collected and spliced by the attention guidance module includes:

the attention guiding module extracts the semantic relation between any two positions in the spliced feature map;

the attention guiding module collects spatial information between any two positions in the spliced characteristic diagram;

and the attention guiding module combines the semantic relation and the spatial information to generate an acquired feature map.

Optionally, the generating a pre-trained face detection model according to the following steps includes:

adopting the expanded convolutional neural network VGG16 to create a backbone network;

adding the volume block and the attention-directed feature enhancement module to the created backbone network to generate a face detection model; wherein the attention guidance feature enhancement module is composed of an attention guidance module (AM) and a Context Extraction Module (CEM);

loading the detection layer sequence of the first branch, taking 6 layers in a backbone network of the face detection model as the detection layer sequence of the first branch, and generating a replaced face detection model;

collecting a training sample with a face image, inputting the training sample with the face image into the replaced face detection model for training, and outputting a progressive anchor loss value of the face detection model;

and when the gradual anchor loss value of the face detection model reaches a preset minimum value, generating the trained face detection model.

Optionally, when the gradual anchor loss value of the face detection model reaches a preset minimum value, a trained face detection model is generated, including:

when the gradual anchor loss value of the face detection model does not reach a preset minimum value, continuing to execute the step of collecting a training sample with a face image; or

And when the training times of the training samples with the face images do not reach the preset times, continuing to execute the step of collecting the training samples with the face images.

In a second aspect, an embodiment of the present application provides a face detection system based on an attention-oriented mechanism, where the system includes:

the image acquisition module is used for acquiring a target image to be detected and inputting the target image to be detected into a pre-trained face detection model; the human face detection model comprises a volume block and an attention guide feature enhancement module; the attention guidance feature enhancement module comprises an attention guidance module and a context extraction module;

the first branch original feature map generation module is used for performing feature extraction operation on a target image to be detected by adopting VGG16 expanded in the volume block to generate a feature map sequence, and selecting 6 layers from the feature map sequence as a first branch original feature map;

the first feature map generation module is used for performing channel splicing on each feature map in the first branch original feature map based on the context extraction module to generate a spliced feature map;

the second feature map generation module is used for acquiring semantic relations and spatial information corresponding to the spliced feature maps according to the attention guidance module and generating the acquired feature maps;

the enhanced feature map generation module is used for generating a second branch enhanced feature map based on the first branch original feature map and the collected feature map;

and the face image output module is used for inputting the second branch enhanced feature map into the SSD target detection algorithm head of the face detection model to obtain a detected face image.

In a third aspect, embodiments of the present application provide a computer storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides a terminal, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

in the embodiment of the application, a human face detection system based on an attention-oriented mechanism firstly acquires a target image to be detected and inputs the target image to be detected into a human face detection model trained in advance; the human face detection model comprises a volume block and an attention guide feature enhancement module; the attention guidance feature enhancement module comprises an attention guidance module and a context extraction module; and then, performing feature extraction operation on a target image to be detected by adopting VGG16 expanded in a rolling block to generate a feature map sequence, selecting 6 layers from the feature map sequence as a first branch original feature map, performing channel splicing on each feature map in the first branch original feature map based on a context extraction module to generate a spliced feature map, acquiring semantic relations and spatial information corresponding to the spliced feature maps according to an attention guide module to generate an acquired feature map, generating a second branch enhanced feature map based on the first branch original feature map and the acquired feature map, and finally inputting the second branch enhanced feature map into the SSD target detection algorithm head of the face detection model to obtain a detected face image. Therefore, by adopting the embodiment of the application, after the enhancement is performed through the attention guide module and the context extraction module, the face detection model focuses more on the face features, so that the face detection performance is greatly improved, and the face detection precision is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a schematic flowchart of a face detection method based on an attention-directing mechanism according to an embodiment of the present application;

fig. 2 is a schematic network structure diagram of a context extraction module in a face detection network according to an embodiment of the present application;

fig. 3 is a structural diagram of an attention guidance module in an attention guidance feature enhancing module according to an embodiment of the present application;

fig. 4 is a structural diagram of a face detection network structure according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another face detection method based on an attention-oriented mechanism according to an embodiment of the present application;

fig. 6 is a schematic system structure diagram of a face detection system based on an attention-oriented mechanism according to an embodiment of the present application;

fig. 7 is a schematic diagram of a terminal according to an embodiment of the present application.

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Up to now, the existing small face detection method mainly includes detecting small faces by a traditional image pyramid and a multi-scale sliding window; the data amplification method is used for increasing the number and the types of the small face samples to improve the small face detection performance; based on a feature fusion method, the multi-scale features of a high layer and a low layer are fused to improve the detection performance; a method based on anchor sampling and matching strategies; methods that utilize contextual information, and the like. Since the context information in the visual task is crucial to performance improvement, many detection algorithms design an interlayer fusion structure for extracting the context information, for example, DenseNet has dense cross-layer connection to realize feature multiplexing, FPN fuses feature information of the upper layer and the bottom layer, and deplab V3 has an ASPP structure to increase the receptive field.

DSFD is used as a double-branch face detection algorithm, and combines the ideas of FPN and RFB, a Feature Enhancement Module (FEM) is provided, so that not only is the feature information between different layers used, but also the features of larger receptive field are obtained by using cavity convolution, and therefore more features with high identification degree and strong robustness are obtained. However, the FEM module only groups and processes the FPN fused feature maps and then splices the feature maps to increase the receptive field, and context features of fine granularity and coarse granularity are not effectively fused, so that the identification precision is reduced. Therefore, the present application provides a method, a system, a storage medium and a terminal for detecting a face based on an attention-guiding mechanism, so as to solve the problems in the related art. In the technical scheme provided by the application, after the enhancement is performed through the attention guide module and the context extraction module, the face detection model focuses on the face features more, so that the face detection performance is greatly improved, the face detection precision is further improved, and the following detailed description is performed by adopting an exemplary embodiment.

The following describes in detail a face detection method based on an attention-oriented mechanism according to an embodiment of the present application with reference to fig. 1 to 5. The method may be implemented in dependence on a computer program, operable on an attention-directed mechanism-based face detection system based on the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application.

Referring to fig. 1, a schematic flow chart of a face detection method based on an attention-guiding mechanism is provided in an embodiment of the present application. As shown in fig. 1, the method of the embodiment of the present application may include the following steps:

s101, acquiring a target image to be detected, and inputting the target image to be detected into a pre-trained face detection model; the human face detection model comprises a volume block and an attention guide feature enhancement module; the attention guidance feature enhancement module comprises an attention guidance module and a context extraction module;

the target image is an image to be detected, the image comprises one or more faces, and the face image can be a face image acquired in real time or a face image stored in a computer. Either inline or offline.

Generally, a pre-trained face detection model is a mathematical model with a small face detection function, and when the face detection model is trained, firstly, an expanded convolutional neural network VGG16 is adopted to create a backbone network, and then a convolutional block and an attention-directed feature enhancement module are added to the created backbone network to generate the face detection model; the attention guide feature enhancement module is composed of an attention guide module (AM) and a Context Extraction Module (CEM), a detection layer sequence of a first branch is loaded, 6 layers in a backbone network of the face detection model are used as the detection layer sequence of the first branch to generate a replaced face detection model, a training sample with a face image is collected, the training sample with the face image is input into the replaced face detection model to be trained, a progressive anchor loss value of the face detection model is output, and finally the trained face detection model is generated when the progressive anchor loss value of the face detection model reaches a preset minimum value.

Further, when the progressive anchor loss value of the face detection model does not reach a preset minimum value, continuing to execute the step of collecting the training sample with the face image; or when the training times of the training samples with the face images do not reach the preset times, continuing to execute the step of collecting the training samples with the face images.

In particular, the Context Extraction Module (CEM) may utilize rich context information from domains of various sizes; the attention guidance module (AM) may enhance significant contextual dependencies.

In a possible implementation manner, when a face in a target image is detected, a target image with the face is collected by a camera, and then the target image with the face is input into a pre-trained face detection model for processing.

S102, performing feature extraction operation on a target image to be detected by adopting the VGG16 expanded in the volume block to generate a feature map sequence, and selecting 6 layers from the feature map sequence as first branch original feature maps;

in general, a face detection model is provided with a convolution block, a convolution operation is performed on a target image through the convolution block firstly, so that a series of feature maps (future maps) of the target image are generated, and then feature maps of6 layers are selected from the generated series of feature maps as original feature maps of a first branch.

S103, channel splicing is carried out on each feature map in the first branch original feature map based on the context extraction module, and a spliced feature map is generated;

in the embodiment of the application, when channel splicing is performed on each feature map in the first branch original feature map, the context extraction module firstly performs channel grouping on the first branch original feature map to generate three groups of feature map sequences, then performs feature processing on the three groups of feature map sequences to generate three groups of feature map sequences after feature processing, then performs feature fusion again on each feature map in the three groups of feature map sequences after cavity convolution through 1 × 1 convolution parameters to generate three groups of feature map sequences after convolution again, and finally performs splicing processing on the three groups of feature map sequences after convolution again to generate a spliced feature map.

Further, when the context extraction module performs feature processing on the three groups of feature map sequences to generate three groups of feature map sequences after feature processing, the context extraction module firstly extracts multi-scale characteristic information of the face by adopting different hole convolution layers for a first group of the three groups of feature channels to generate a first refined feature map sequence; the void convolution parameter is 3 x 3, and the rolling mark rate of the void convolution is 3; and then, increasing the number of effective feature weights by adopting 1 × 1 convolution for a second group of the three groups of feature channels to generate a second refined feature map sequence, then, carrying out global feature extraction for a third group of the three groups of feature channels to generate a global feature map sequence, carrying out channel splicing on the first refined feature map sequence, the second refined feature map sequence and the global feature map sequence to generate a spliced feature map sequence, and finally, carrying out feature fusion on the three groups of cavity convolved feature map sequences by adopting 1 × 1 convolution parameters to generate three groups of feature processed feature map sequences.

When the first group of the three groups of characteristic channels adopts different hole convolution layers to extract the multi-scale characteristic information of the human face, the first group is divided into 3 groups again, which can be expressed as

And processing a first group in the 3 groups which are divided again by adopting a cavity convolution, processing a second group by adopting two cavity convolutions, processing a third group by adopting three cavity convolutions, and finally splicing the processed three groups of cavity convolutions to generate a first refined characteristic diagram sequence.

Further, the context extraction module performs global feature extraction on a third group of the three groups of feature channels, and when generating a global feature map sequence, the context extraction module firstly performs Global Average Pooling (GAP) processing on the third group of the three groups of feature channels to generate a pooled feature map sequence, then changes the channel dimension of the pooled feature map sequence by adopting 1 × 1 convolution parameters to generate a changed feature map sequence, and finally upsamples the changed feature map sequence to the spatial dimension of a preset threshold value to generate the global feature map sequence.

S104, collecting semantic relations and spatial information corresponding to the spliced feature maps according to the attention guide module, and generating collected feature maps;

in a possible implementation manner, the attention guiding module firstly extracts the semantic relationship between any two positions in the spliced feature map, then collects the spatial information between any two positions in the spliced feature map, and finally generates the collected feature map after combining the semantic relationship and the spatial information.

For example, as shown in fig. 2, fig. 2 is a schematic network structure diagram of a context extraction module in a face detection network provided in an embodiment of the present application, a target image is first subjected to a convolution operation by a convolution block in face detection based on an attention-oriented mechanism to generate feature maps Fd of multiple layers, then channel grouping is performed by the context extraction module, Fd is divided into three groups to perform void convolution processing, channel splicing is performed after processing is completed, a first feature map is finally output, then channel dimensions are changed by using 1 × 1 convolution parameters to generate a changed feature map sequence, and finally the changed feature map sequence is up-sampled to a spatial dimension of a preset threshold to generate a global feature map sequence. And finally, channel splicing is carried out, and then 1 × 1 convolution parameter processing is carried out, so that a characteristic diagram Fc is obtained.

S105, generating a second branch enhanced feature map based on the first branch original feature map and the collected feature map;

the collected feature map is obtained by performing Global Average Pooling (GAP) on Fc.

In a possible implementation manner, when the acquired feature map is obtained, the acquired feature map and the second branch enhanced feature map are multiplied element by the attention guiding module and then added, and finally, the enhanced feature map of the second branch is generated.

For example, as shown in fig. 3, fig. 3 is a structural diagram of an attention guidance module in an attention guidance feature enhancing module provided in this embodiment of the present application, where FC is a feature map generated after processing by a context extraction module, Fd is a first branch original feature map, after FC and Fd are obtained, FC and Fd are respectively subjected to Global Average Pooling (GAP) processing, then are respectively multiplied element by element, and finally are added element by element to generate a final second branch enhanced feature map Fa.

And S106, inputting the second branch enhancement feature map into the SSD target detection algorithm head of the face detection model to obtain a detected face image.

For example, as shown in fig. 4, fig. 4 is a structural diagram of a face detection network structure provided in the present application, first, by extending VGG16 to serve as a base backbone network of DSFD, that is, replacing a full connection layer of VGG16 with another auxiliary convolutional layer. The convolutional layers selected in the present application are the following, respectively:

conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2, conv7_2 as the detector layer of the first branch to generate 6 original feature maps, which are named of1, of2, of3, of4, of5, of6, then the attention-directed feature enhancement module proposed in this application converts the 6 original feature maps into feature maps of6 attention-directed mechanisms, which are named ef1, ef2, ef3, ef4, ef5, ef6, which have the same size as the corresponding 6 original feature maps, by inputting them to the head of the network SSD type of the face detection model, thereby constructing the detection layer of the second branch. After using the attention-steering mechanism module to enhance the receptive field and the new anchor design strategy, it is in principle unnecessary to let the three sizes (stride, anchor, receptive field) satisfy the equal proportion interval principle. Thus, DSFD is more flexible and also more robust. At the same time, the original detector layer of the first branch and the detector layer that constitutes the second branch have 2 different loss values. Are named as first and second branch progressive anchor loss (FSL) and SSL, respectively.

Please refer to fig. 4, which is a flowchart illustrating another method for detecting a face based on an attention-oriented mechanism according to an embodiment of the present disclosure. The face detection method based on the attention guiding mechanism can comprise the following steps:

s201, acquiring a target image to be detected, and inputting the target image to be detected into a pre-trained face detection model; the human face detection model comprises a volume block and an attention guide feature enhancement module; the attention guidance feature enhancement module comprises an attention guidance module and a context extraction module;

s202, performing feature extraction operation on a target image to be detected by adopting the VGG16 expanded in the volume block to generate a feature map sequence, and selecting 6 layers from the feature map sequence as a first branch original feature map;

s203, the context extraction module carries out channel grouping on the first branch original feature graph to generate three groups of feature graph sequences;

s204, the context extraction module adopts different cavity convolution layers to extract multi-scale characteristic information of the face aiming at a first group of the three groups of characteristic channels to generate a first refined characteristic diagram sequence; the void convolution parameter is 3 x 3, and the rolling mark rate of the void convolution is 3;

s205, the context extraction module increases the number of effective feature weights by adopting 1 x 1 convolution aiming at a second group in the three groups of feature channels to generate a second refined feature map sequence;

s206, the context extraction module performs global feature extraction on a third group of the three groups of feature channels to generate a global feature map sequence;

s207, the context extraction module carries out channel splicing on the first refined feature map sequence, the second refined feature map sequence and the global feature map sequence to generate a spliced feature map sequence;

s208, the context extraction module performs feature fusion on the feature graph sequences after the convolution of the three groups of cavities by adopting 1 x 1 convolution parameters to generate the feature graph sequences after the convolution of the three groups of cavities;

s209, the context extraction module performs feature fusion again on each feature map in the feature map sequences after the convolution of the three groups of cavities through 1 × 1 convolution parameters to generate three groups of feature map sequences after the convolution again;

s210, splicing the three groups of feature graph sequences which are convoluted again by the context extraction module to generate a spliced feature graph;

s211, collecting semantic relations and spatial information corresponding to the spliced feature maps by an attention guide module, and generating collected feature maps;

s212, the attention guiding module collects semantic relations and spatial information corresponding to the spliced feature maps to generate collected feature maps;

and S213, the attention guiding module collects the semantic relation and the spatial information corresponding to the spliced feature map to generate the collected feature map.

And further, generating a second branch enhanced feature map based on the first branch original feature map and the collected feature map, and inputting the second branch enhanced feature map into the SSD target detection algorithm head of the face detection model to obtain a detected face image.

The following are embodiments of systems of the present invention that may be used to perform embodiments of methods of the present invention. For details which are not disclosed in the embodiments of the system of the present invention, reference is made to the embodiments of the method of the present invention.

Referring to fig. 6, a schematic structural diagram of a face detection system based on an attention-oriented mechanism according to an exemplary embodiment of the present invention is shown. The human face detection system based on the attention-oriented mechanism can be realized into all or part of an intelligent robot through software, hardware or a combination of the software and the hardware. The system 1 comprises an image acquisition module 10, a first branch original feature map generation module 20, a first feature map generation module 30, a second feature map generation module 40, an enhanced feature map generation module 50 and a face image output module 60.

The image acquisition module 10 is configured to acquire a target image to be detected and input the target image to be detected into a pre-trained face detection model; the human face detection model comprises a volume block and an attention guide feature enhancement module; the attention guidance feature enhancement module comprises an attention guidance module and a context extraction module;

the first branch original feature map generation module 20 is configured to perform feature extraction operation on a target image to be detected by using the VGG16 expanded in the volume block to generate a feature map sequence, and select 6 layers from the feature map sequence as a first branch original feature map;

a first feature map generation module 30, configured to perform channel splicing on each feature map in the first branch original feature map based on the context extraction module, and generate a spliced feature map;

the second feature map generation module 40 is configured to generate an acquired feature map according to the semantic relationship and the spatial information corresponding to the feature map acquired by the attention guidance module after splicing;

an enhanced feature map generation module 50, configured to generate a second branch enhanced feature map based on the first branch original feature map and the acquired feature map;

and the face image output module 60 is configured to input the second branch enhanced feature map into the SSD target detection algorithm head of the face detection model, and obtain a detected face image.

It should be noted that, when the face detection system based on the attention-oriented mechanism provided in the foregoing embodiment executes the face detection method based on the attention-oriented mechanism, only the division of the functional modules is illustrated, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the face detection system based on the attention-oriented mechanism provided by the above embodiment and the face detection method embodiment based on the attention-oriented mechanism belong to the same concept, and the detailed implementation process thereof is referred to the method embodiment and is not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The present invention also provides a computer readable medium, on which program instructions are stored, and the program instructions, when executed by a processor, implement the method for detecting a face based on an attention-oriented mechanism provided by the above-mentioned method embodiments.

The present invention also provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method for detecting a face based on an attention-directed mechanism of the above-mentioned method embodiments.

Please refer to fig. 7, which provides a schematic structural diagram of a terminal according to an embodiment of the present application. As shown in fig. 7, terminal 1000 can include: at least one processor 1001, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002.

Wherein a communication bus 1002 is used to enable connective communication between these components.

The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.

The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Processor 1001 may include one or more processing cores, among other things. Processor 1001 interfaces various components throughout terminal 1000 using various interfaces and lines to perform various functions and process data of terminal 1000 by executing or executing instructions, programs, code sets, or instruction sets stored in memory 1005 and invoking data stored in memory 1005. Alternatively, the processor 1001 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1001 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1001, but may be implemented by a single chip.

The Memory 1005 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer-readable medium. The memory 1005 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1005 may optionally be at least one memory system located remotely from the processor 1001. As shown in fig. 7, a memory 1005, which is a type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a face detection application based on an attention-directed mechanism.

In the terminal 1000 shown in fig. 7, the user interface 1003 is mainly used as an interface for providing input for a user, and acquiring data input by the user; and the processor 1001 may be configured to invoke the attention-directed mechanism-based face detection application stored in the memory 1005, and specifically perform the following operations:

In an embodiment, when the processor 1001 performs channel splicing on each feature map in the first branch original feature map based on the context extraction module to generate a spliced feature map, the following operations are specifically performed:

In an embodiment, when the processor 1001 executes the context extraction module to perform feature processing on the three sets of feature map sequences, and generates a feature map sequence after the three sets of feature processing, the following operation is specifically performed:

In an embodiment, when the processor 1001 executes the context extraction module to perform global feature extraction on a third group of the three groups of feature channels and generate a global feature map sequence, the following operations are specifically performed:

In an embodiment, when the processor 1001 performs the following operation when acquiring the semantic relationship and the spatial information corresponding to the spliced feature map according to the attention guidance module and generating the acquired feature map:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware that is related to instructions of a computer program, and the program can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A face detection method based on an attention-guiding mechanism is characterized by comprising the following steps:

acquiring a target image to be detected, and inputting the target image to be detected into a pre-trained face detection model; the human face detection model comprises a volume block and an attention guide feature enhancement module; the attention-directed feature enhancement module comprises an attention-directed module and a context extraction module;

performing feature extraction operation on the target image to be detected by adopting the VGG16 expanded in the volume block to generate a feature map sequence, and selecting 6 layers from the feature map sequence as a first branch original feature map;

performing channel splicing on each feature map in the first branch original feature map based on the context extraction module to generate a spliced feature map;

acquiring semantic relation and spatial information corresponding to the spliced feature map according to an attention guide module, and generating an acquired feature map;

and inputting the second branch enhancement feature map into an SSD target detection algorithm head of a face detection model to obtain a detected face image.

2. The method according to claim 1, wherein the channel-based stitching of each feature map in the first branch raw feature map by the context extraction module to generate a stitched feature map comprises:

the context extraction module performs feature fusion again on each feature map in the feature map sequence after the three groups of cavities are convoluted through 1 × 1 convolution parameters to generate three groups of feature map sequences after the convolution again;

3. The method of claim 2, wherein the context extraction module performs feature processing on the three sets of feature map sequences to generate three sets of feature-processed feature map sequences, and comprises:

the context extraction module adopts different cavity convolution layers to extract multi-scale characteristic information of the face aiming at a first group in the three groups of characteristic channels to generate a first refined characteristic diagram sequence; wherein the void convolution parameter is 3 x 3, and the rolling mark rate of the void convolution is 3;

the context extraction module increases the number of effective feature weights by adopting 1-by-1 convolution aiming at a second group in the three groups of feature channels to generate a second refined feature map sequence;

4. The method of claim 3, wherein the context extraction module performs global feature extraction on a third group of the three groups of feature channels to generate a global feature map sequence, comprising:

the context extraction module adopts Global Average Pooling (GAP) processing to the third group of the three groups of feature channels to generate a pooled feature map sequence;

and the context extraction module samples the changed characteristic diagram sequence to the space dimension of a preset threshold value to generate a global characteristic diagram sequence.

5. The method according to claim 1, wherein the collecting semantic relationships and spatial information corresponding to the spliced feature map according to the attention-directed module to generate a collected feature map comprises:

6. The method of claim 1, wherein generating a pre-trained face detection model comprises:

adding a rolling block and an attention-directed feature enhancement module to the created backbone network to generate a face detection model; wherein the attention-directed feature enhancement module consists of an attention-directed module (AM) and a Context Extraction Module (CEM);

loading a detection layer sequence of a first branch, taking 6 layers in a backbone network of the face detection model as the detection layer sequence of the first branch, and generating a replaced face detection model;

and when the gradual anchor loss value of the face detection model reaches a preset minimum value, generating a trained face detection model.

7. The method according to claim 6, wherein the generating the trained face detection model when the progressive anchor loss value of the face detection model reaches a preset minimum value comprises:

when the gradual anchor loss value of the face detection model does not reach a preset minimum value, continuing to execute the step of collecting the training sample with the face image; or

8. A human face detection system based on an attention-directed mechanism, the system comprising:

the image acquisition module is used for acquiring a target image to be detected and inputting the target image to be detected into a pre-trained face detection model; the human face detection model comprises a volume block and an attention guide feature enhancement module; the attention-directed feature enhancement module comprises an attention-directed module and a context extraction module;

a first branch original feature map generation module, configured to perform feature extraction operation on the target image to be detected by using the VGG16 expanded in the volume block to generate a feature map sequence, and select 6 layers from the feature map sequence as a first branch original feature map;

the second feature map generation module is used for acquiring semantic relations and spatial information corresponding to the spliced feature maps according to the attention guidance module and generating acquired feature maps;

9. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 7.

10. A terminal, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 7.