CN112036339A

CN112036339A - Face detection method and device and electronic equipment

Info

Publication number: CN112036339A
Application number: CN202010917905.0A
Authority: CN
Inventors: 张为义; 涂弘德; 刘以勒; 罗士杰
Original assignee: Fujian Cook Intelligent Technology Co ltd
Current assignee: Fujian Cook Intelligent Technology Co ltd
Priority date: 2020-09-03
Filing date: 2020-09-03
Publication date: 2020-12-04
Anticipated expiration: 2040-09-03
Also published as: CN112036339B

Abstract

A method, a device and an electronic device for face detection can perform face detection and living body detection under dark or low-light environment, and improve face detection efficiency and accuracy, thereby comprehensively improving the performance of face detection technology. The face detection method comprises the following steps: acquiring a depth map of a target to be detected, and performing feature extraction on the depth map to obtain a first feature map; carrying out face detection on the first feature map to obtain a face region feature map; extracting the features of the face region feature map to obtain a second feature map; performing living body detection on the second feature map to obtain a living body detection result of the human face; and outputting a face region frame including the living face in the depth map according to the living body detection result of the face.

Description

Face detection method and device and electronic equipment

Technical Field

The present application relates to the field of biometric detection technologies, and in particular, to a method and an apparatus for detecting a human face, and an electronic device.

Background

Face detection (face detection) is a biological detection and identification technology for identifying the identity of a person based on facial feature information of the person. The method comprises the steps of collecting images or video streams containing human faces by using a camera or a camera, automatically detecting and tracking the human faces in the images, and further performing a series of related technologies such as image preprocessing, image feature extraction, matching and recognition of the detected human faces, wherein the related technologies are generally called human face recognition, portrait recognition or facial recognition. With the rapid development of computer and network technologies, face detection technology has been widely applied to many industries and fields such as intelligent access control, mobile terminals, public security, entertainment, military and the like.

Most of the existing face detection technologies find out the approximate position of a candidate frame in an image, determine that the image content in the candidate frame is not a background, further precisely locate the position of the candidate frame, and identify whether the candidate frame is a face, which causes a complex face detection process, low detection efficiency, and no information for in vivo detection.

Therefore, how to perform face detection and living body detection in dark or low-light environment and improve accuracy and efficiency of face detection, thereby comprehensively improving performance of the face detection device is a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a method and a device for face detection and electronic equipment, which can perform face detection and living body detection in dark or low-light environment, improve face detection efficiency and accuracy, and comprehensively improve the performance of face detection technology.

In a first aspect, a method for face detection is provided, including: acquiring a depth map of a target to be detected, and performing feature extraction on the depth map to obtain a first feature map; carrying out face detection on the first feature map to obtain a face region feature map; extracting the features of the face region feature map to obtain a second feature map; performing living body detection on the second feature map to obtain a living body detection result of the human face; and outputting a face region frame including the living face in the depth map according to the living body detection result of the face.

According to the scheme of the embodiment of the application, the in-vivo detection result of the face position in the target to be detected and the face can be synchronously output can be output, so that the accuracy of face detection is improved, meanwhile, the input image is the depth map of the target to be detected, the influence of ambient light on the face detection can be avoided, the face detection can be effectively carried out under the conditions of low illumination, no illumination or reverse illumination, the face detection efficiency is improved, and the performance of the face detection technology is comprehensively improved.

In some possible embodiments, the performing feature extraction on the depth map to obtain a first feature map includes: and performing feature extraction on the depth map by using a first face feature extraction module to obtain the first feature map, wherein the first feature map comprises edge line features in the depth map.

In some possible embodiments, the number of layers of the convolutional layers of the first facial feature extraction module is not greater than 4.

In some possible embodiments, the performing face detection on the first feature map to obtain a face region feature map includes: and carrying out face detection on the first feature map by adopting a face detection module to obtain a face region feature map.

In some possible embodiments, the face detection module comprises: convolutional layer network, face scope convolution layer and face center convolution layer, wherein, should adopt face detection module to carry out face detection to this first characteristic map, obtain the regional characteristic map of face, include: performing convolution calculation on the first characteristic diagram by adopting the convolution layer network to obtain a first intermediate characteristic diagram; performing convolution calculation on the first intermediate characteristic graph by adopting the face range convolution layer and the face center convolution layer respectively to obtain a face region prediction graph and a face center prediction graph; and obtaining the face region characteristic diagram in the first characteristic diagram according to the face region prediction diagram and the face center prediction diagram.

In some possible embodiments, the face detection module further comprises: a face feature concentration layer; the human face feature concentration layer is used for carrying out weight distribution on pixel values of the intermediate feature image so as to highlight the human face five-sense-organ features in the intermediate feature image.

According to the scheme of the embodiment of the application, the face feature concentration layer is added in the face detection module, so that the convolution network can convolute to obtain the feature map which can highlight the facial features, and the accuracy of subsequent face position detection and living body detection is improved.

In some possible implementations, the face feature concentration layer is a space-based attention module.

In some possible embodiments, the method further comprises: and carrying out neural network training on the face detection module to obtain parameters of the face detection module.

In some possible embodiments, the face detection module further includes a central adjustment convolution layer, wherein the performing neural network training on the face detection module to obtain parameters of the face detection module includes: acquiring a sample image, wherein the sample image is marked with a real value of a face region and a real value of a face center; performing convolution calculation on the sample image by adopting the convolutional layer network to obtain a first sample characteristic diagram; performing convolution calculation on the first sample feature map by adopting the face range convolution layer, the face center convolution layer and the center adjustment convolution layer respectively to obtain a face area prediction value, a face center prediction value and a face center offset prediction value; and calculating a loss function to obtain parameters of the face detection module according to the face area predicted value, the face center offset predicted value, the face area real value and the face center real value.

According to the scheme of the embodiment of the application, in the training process, the coordinates of the predicted face center position are adjusted by setting the center adjusting convolution layer and the like so as to increase the robustness and accuracy of center position prediction, and in the actual face detection process, the face area is obtained only by using the face range convolution layer and the face center convolution layer, so that the efficiency of the detection process can be improved, and the speed of face detection is increased.

In some possible embodiments, the face range convolution layer and the face center convolution layer are two 1 × 1 convolution layers, wherein the face center prediction map is a face center heatmap.

In some possible embodiments, the performing feature extraction on the face region feature map to obtain a second feature map includes: and performing feature extraction on the face region feature map by adopting a second face feature extraction module to obtain a second feature map, wherein the second feature map comprises the detail features of the face.

In some possible embodiments, the number of layers of the convolutional layers of the second face feature extraction module is not greater than 4.

In some possible embodiments, the second feature map includes facial features of five sense organs.

In some possible embodiments, the performing living body detection on the second feature map to obtain a living body detection result of the human face includes: and performing living body detection on the second characteristic diagram by adopting a concentration module to obtain a living body detection result of the human face, wherein the concentration module is an attention mechanism module combining a space and a channel.

Through the scheme of the embodiment of the application, the target characteristic diagram can be obtained more simply and effectively by adopting the light-weight concentration module compared with the concentration module only focusing on the channel or the concentration module only focusing on the space.

In some possible embodiments, the concentration module includes: a plurality of layers of convolutional layers, a channel attention module, and a spatial attention module; should adopt and focus on module and carry out live body detection to this second characteristic map, obtain the live body detection result of people's face, include: performing convolution calculation on the second characteristic diagram by adopting the first convolution layer to obtain a first intermediate characteristic diagram; processing the first intermediate characteristic diagram by adopting the channel attention module to obtain a channel attention characteristic diagram; performing convolution calculation on the channel attention feature map and the first intermediate feature map by using a second convolution layer to obtain a second intermediate feature map; processing the second intermediate feature map by adopting the spatial attention module to obtain a spatial attention feature map; performing convolution calculation on the space attention feature map and the second intermediate feature map by using a third convolution layer to obtain a target feature map; and obtaining a living body detection result of the human face based on the target feature map, wherein the target feature map comprises the living body features of the human face.

Through the scheme of the embodiment of the application, the modules for realizing the steps form a lightweight neural network architecture, and the lightweight neural network is convenient to operate on a friendly edge operation device, so that the face detection method can be applied to more scenes.

In some possible embodiments, the method is run on an edge computing device.

In a second aspect, an apparatus for face detection is provided, including: the acquisition unit is used for acquiring a depth map of a target to be detected; the first face feature extraction module is used for extracting features of the depth map to obtain a first feature map; the face detection module is used for carrying out face detection on the first feature map to obtain a face region feature map; the second face feature extraction module is used for extracting the features of the face region feature map to obtain a second feature map; the concentration module is used for carrying out living body detection on the second feature map to obtain a living body detection result of the human face; and the output module is used for outputting a face region frame including the living face in the depth map according to the living body detection result of the face.

In some possible implementations, the first feature map includes edge line features in the depth map.

In some possible embodiments, the face detection module comprises: a convolutional layer network, a face range convolutional layer and a face center convolutional layer;

the convolutional layer network is used for carrying out convolution calculation on the first characteristic diagram to obtain an intermediate characteristic diagram; the face range convolution layer and the face center convolution layer are respectively used for carrying out convolution calculation on the intermediate characteristic graph to obtain a face region prediction graph and a face center prediction graph; the face region prediction image and the face center prediction image are used for mapping the detection result to the first feature image to obtain the face region feature image.

In some possible embodiments, the face detection module further comprises: a face feature concentration layer; the convolution layer network and the human face feature concentration layer are used for carrying out convolution calculation on the first feature map to obtain an intermediate feature map; the human face feature concentration layer is used for carrying out weight distribution on pixel values of the intermediate feature image so as to highlight the human face five-sense-organ features in the intermediate feature image.

In some possible embodiments, the parameters of the face detection module are obtained by neural network training.

In some possible embodiments, the face detection module further comprises a centering convolution layer, the convolution layer network further configured to: performing convolution calculation on the sample image to obtain a first sample characteristic image, wherein a human face area true value and a human face center true value are marked in the sample image; the face range convolution layer, the face center convolution layer and the center adjustment convolution layer are used for: performing convolution calculation on the first sample feature map respectively to obtain a face area predicted value, a face center predicted value and a face center offset predicted value; the face area predicted value, the face center offset predicted value, and the face area real value and the face center real value are used for calculating a loss function to obtain parameters of the face detection module.

In some possible embodiments, the second feature map includes detail features of a human face.

In some possible embodiments, the concentration module is a concentration mechanism module that combines space and channels.

In some possible embodiments, the concentration module includes: a plurality of layers of convolutional layers, a channel attention module, and a spatial attention module; performing convolution calculation on the second feature map by a first convolution layer in the multilayer convolution layers to obtain a first intermediate feature map; processing the first intermediate characteristic diagram by adopting the channel attention module to obtain a channel attention characteristic diagram; performing convolution calculation on the channel attention feature map and the first intermediate feature map by using a second convolution layer to obtain a second intermediate feature map; processing the second intermediate feature map by adopting the spatial attention module to obtain a spatial attention feature map; performing convolution calculation on the space attention feature map and the second intermediate feature map by using a third convolution layer to obtain a target feature map; and obtaining a living body detection result of the human face based on the target feature map, wherein the target feature map comprises the living body features of the human face.

In some possible embodiments, the device is an edge computing device.

In a third aspect, an electronic device is provided, including: the apparatus for face detection in the second aspect or any possible implementation manner thereof.

In some possible embodiments, the electronic device further comprises: depth map acquisition device.

In a fourth aspect, a computer-readable storage medium is provided, which is used for storing program instructions, and when the program instructions are executed by a computer, the computer executes the method for detecting a human face in the first aspect or any possible implementation manner of the first aspect.

In a fifth aspect, a computer program product is provided, which contains instructions that, when executed by a computer, cause the computer to perform the method for face detection in the first aspect or any of the possible implementations of the first aspect.

In particular, the computer program product may be run on the electronic device of the above third aspect.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture provided in the present application.

FIG. 2 is a basic framework diagram of a master RCNN according to an embodiment of the present application.

FIG. 3 is a schematic diagram of a target detection process of the master RCNN according to an embodiment of the present application.

Fig. 4 is a schematic flow chart diagram of a face detection method according to an embodiment of the present application.

Fig. 5 is a schematic block diagram of a neural network-like architecture according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of a face detection module according to an embodiment of the present application.

Fig. 7 is a schematic diagram of a face region box in a first feature map according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of another face detection module according to an embodiment of the present application.

FIG. 9 is a block diagram of a focus module according to an embodiment of the present application.

Fig. 10 is a schematic structural block diagram of a face detection device according to an embodiment of the present application.

Fig. 11 is a schematic structural block diagram of a processing unit according to an embodiment of the present application.

Fig. 12 is a schematic hardware structure diagram of a face detection apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

The embodiment of the application can be applied to a face detection system, including but not limited to products based on optical face imaging. The face detection system can be applied to various electronic devices with image acquisition devices (such as cameras), the electronic devices can be personal computers, computer workstations, smart phones, tablet computers, smart cameras, media consumption devices, wearable devices, set top boxes, game machines, Augmented Reality (AR) AR/Virtual Reality (VR) devices, vehicle-mounted terminals and the like, and the embodiment disclosed by the application is not limited to this.

It should be understood that the specific examples are provided herein only to assist those skilled in the art in better understanding the embodiments of the present application and are not intended to limit the scope of the embodiments of the present application.

It should also be understood that, in the various embodiments of the present application, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the inherent logic of the processes, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should also be understood that the various embodiments described in this specification can be implemented individually or in combination, and the examples in this application are not limited thereto.

Unless otherwise defined, all technical and scientific terms used in the examples of this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

For better understanding of the solution of the embodiment of the present application, a brief description is given below to a possible application scenario of the embodiment of the present application with reference to fig. 1.

As shown in fig. 1, the present embodiment provides a system architecture 100. In fig. 1, a data acquisition device 160 is used to acquire training data. For the method for detecting a face according to the embodiment of the present application, the training data may include a training image or a training video.

After the training data is collected, data collection device 160 stores the training data in database 130, and training device 120 trains target model/rule 101 based on the training data maintained in database 130.

The above target model/rule 101 can be used to implement the method for face detection of the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the acquisition of the data acquisition device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, or the like, and may also be a server or a cloud. In fig. 1, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: a pending video or a pending image input by the client device 140.

In some embodiments, the client device 140 may be the same device as the execution device 110, for example, the client device 140 may be a terminal device as the execution device 110.

In other embodiments, the client device 140 and the execution device 110 may be different devices, for example, the client device 140 is a terminal device, the execution device 110 is a cloud, a server, or the like, the client device 140 may interact with the execution device 310 through a communication network of any communication mechanism/communication standard, the communication network may be a wide area network, a local area network, a peer-to-peer connection, or the like, or any combination thereof.

The computing module 111 of the execution device 110 is configured to process according to input data (e.g., an image to be processed) received by the I/O interface 112. In the process of executing the relevant processing such as calculation by the calculation module 111 of the execution device 110, the execution device 110 may call data, codes, and the like in the data storage system 150 for corresponding processing, and may store data, instructions, and the like obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the face detection result obtained as described above, to the client device 140, thereby providing it to the user.

It should be noted that the training device 120 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 1, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

As shown in fig. 1, a target model/rule 101 is obtained according to training of a training device 120, where the target model/rule 101 may be a neural network in the embodiment of the present application, specifically, the neural network in the embodiment of the present application may be a Convolutional Neural Network (CNN), a Regional Convolutional Neural Network (RCNN), a fast regional convolutional neural network (faster RCNN), or another type of neural network, and the present application is not limited specifically.

Currently, in a face detection system, a two-stage neural network architecture, such as the aforementioned faster RCNN neural network, is generally used.

For ease of understanding, the neural network of the master RCNN will first be briefly described in conjunction with fig. 2 and 3.

Fig. 2 shows a basic framework diagram of the master RCNN, and fig. 3 shows a target detection process diagram of the master RCNN.

As shown in fig. 2 and 3, the family RCNN may be divided into several parts of a region candidate network (RPN), a Convolutional Neural Network (CNN), a region of interest (ROI) pooling introduction, and a classifier.

The convolutional neural network CNN is used to perform feature convolution on an input image and extract a feature map (feature map) of the input image.

The region candidate network RPN is used to extract candidate boxes (region explosals) for the convolved features in the feature map. In some embodiments, accurate candidate frames are obtained by setting a plurality of anchor frames (anchors) in the feature map, then judging whether the anchor frames belong to anchor frames (positive anchors) including detection targets or anchor frames (negative anchors) not including detection targets through a softmax function, and then correcting the anchor frames by using bounding box regression (bounding box regression).

The ROI pooling is used for receiving the feature maps and the candidate frames, and extracting candidate feature maps (predictive feature maps) after integrating the information, so that a subsequent classifier can conveniently perform classification and identification.

And the classifier calculates the category of the candidate frame by using the candidate feature map, and simultaneously performs frame regression again to obtain the final accurate position of the detection frame.

As can be seen from the above description, if the fast RCNN network is used for face detection, after the feature map of the input image is obtained through the CNN network, in the first stage, the anchor in the RPN is used to approximately find the position of the candidate frame of the face and determine that the candidate frame is not the background, and in the second stage, the candidate frame is identified as the face through the subsequent processing, and the position of the candidate frame is more accurately located.

The conventional face detection method is complex in face detection process, a special server or a large server is required to execute the method, the method is not easy to apply and popularize on edge computing (edge computing) devices, and the complex face detection process can also seriously affect the detection efficiency.

In addition, in a dark or low-light environment, an input image cannot be shot or the quality of the shot input image is poor, face detection is difficult to perform in the dark or low-light environment, and the face detection method does not provide information of living body detection, so that the face detection result is inaccurate, and the comprehensive performance of face detection is affected.

Based on the above, the application provides a neural network architecture for friendly edge operation aiming at a device for friendly edge operation, the face detection is carried out based on the neural network architecture, the face detection process can be executed efficiently, the face detection process can be operated on an edge operation device, and a living body detection result is provided in the face detection process.

Next, a neural network architecture for face detection and a flow of a face detection method provided in the embodiment of the present application are described with reference to fig. 4 to 9.

Fig. 4 shows a schematic flow diagram of a face detection method 200. Alternatively, the execution subject of the face detection method 200 may be the execution device 110 in fig. 1 above.

As shown in fig. 4, the face detection method 200 may include the following steps.

S210: and acquiring a depth map of the target to be detected, and performing feature extraction on the depth map to obtain a first feature map.

S220: and carrying out face detection on the first feature map to obtain a face region feature map.

S230: and further extracting the features of the face region feature map to obtain a second feature map.

S240: and performing living body detection on the second feature map to obtain a living body detection result of the human face.

S250: and outputting at least one face region frame including the living face in the depth map according to the living body detection result of the face.

In the embodiment of the present application, the target to be detected includes, but is not limited to, any object such as a human face, a photograph, a video, a three-dimensional model, and the like. For example, the target to be detected may be a face of a target user, faces of other users, a user photo, a curved surface model with a photo attached, and the like.

As an example, in some embodiments, after the depth map of the object to be detected is acquired by the depth map acquisition device, the depth map is sent to the processing unit in the execution device for subsequent image processing. Optionally, the depth map collecting device may be integrated into the executing device, or may be provided separately from the executing device.

The depth image (depth image) in the embodiment of the present application is a depth image of an object to be detected, and the depth image is also referred to as a distance image (range image), where a pixel value in the depth image of the object to be detected represents distance information between each point on the surface of the object to be detected and the same point or the same plane.

For example, the depth map acquisition device is used for acquiring a depth map of a human face, and pixel values in the depth map represent distances from the image acquisition module to each point on the surface of the human face. When the depth map is a gray image, the change of the image pixel value can be expressed as the gray change of the image, so the gray change of the depth map also corresponds to the depth change of the face, and directly reflects the geometric shape of the face surface and the depth information.

In some possible embodiments, the structured light projection module may project structured light to the target to be detected, and the depth map acquisition device receives a reflected structured light signal of the structured light reflected by the target to be detected, and converts the reflected structured light signal to obtain the depth map.

Optionally, the structured light includes, but is not limited to, speckle images, lattice light, and other optical signals with a structured pattern, and the structured light projection module may be any device structure for projecting structured light, including but not limited to: a dot matrix light projector using a Vertical Cavity Surface Emitting Laser (VCSEL) light source, a speckle structured light projector, and other light emitting devices.

It should be understood that, in the embodiment of the present application, other image acquisition modules capable of acquiring depth information of a target to be detected may also be used to acquire a depth map, for example, time of flight (TOF) optical modules and other image acquisition modules acquire a depth map, and then transmit the depth map to the processing unit.

It should be further understood that, in the above step, point cloud (point cloud) data of the target to be detected may also be obtained, and the point cloud data is converted into a depth map, and a specific technical scheme for obtaining the point cloud data of the target to be detected and a specific technical scheme for converting the point cloud data into the depth map may refer to methods in related technologies, which is not specifically limited in this embodiment of the present application.

After the depth map of the target to be detected is obtained, a neural network-like framework is provided, and the depth map is subjected to subsequent processing to obtain a face region frame of the living body face in the depth map.

Fig. 5 shows a schematic block diagram of a neural network-like architecture 20 of an embodiment of the present application.

As shown in fig. 5, the neural network architecture 20 includes: a first face feature extraction module 21, a face detection module 22, a second face feature extraction module 23, and a concentration module 24.

Specifically, the first facial feature extraction module 21 is configured to execute the step S210, and perform feature extraction on the depth map of the target to be detected to obtain at least one first feature map.

In some embodiments, the first face feature extraction module 21 may include at least one convolution layer (convolution layer), and perform convolution calculation on the depth map of the object to be detected, so as to extract edge line features in the depth map, or to extract high-frequency features in the depth map, and if the object to be detected is a face, the face line features such as facial features and facial edge lines on the face are extracted by the first face feature extraction module 21.

In some embodiments, each of the at least one convolutional layer comprises one or more convolution kernels (kernel). Among them, the convolution kernel is also called a filter (filter) or a feature detector (feature detector). The matrix is called a convolution feature (convolved feature) or activation map (activation map) or feature map (feature map) by sliding the convolution kernel over the image and computing the dot product. For the same input image, different feature maps are generated by convolution kernels with different values, so that one or more first feature maps comprising line features can be obtained by one or more convolution kernels. By modifying the value of the convolution kernel, a different first feature map can be detected from the depth map.

It should be understood that the convolution kernel may be a 3 × 3 matrix, a 5 × 5 matrix, or other size matrix, which is not limited in the embodiments of the present application.

It should also be understood that, in this embodiment of the present application, the number of convolutional layers in the first face feature extraction module 21 may be between one layer and four layers, or may also be four or more convolutional layers, the size of the plurality of convolutional kernels in each convolutional layer may be the same or different, and the convolution step size of the plurality of convolutional kernels may be the same or different, which is not limited in this embodiment of the present application.

It should also be understood that, in the embodiment of the present application, the types of at least one convolutional layer in the first facial feature extraction module 21 may be the same or may be different, and the at least one convolutional layer includes, but is not limited to, two-dimensional convolution, three-dimensional convolution, point-by-point convolution (position convolution), deep convolution (depthwise convolution), separable convolution (partial convolution), deconvolution (deconvolution), and/or hole convolution (related convolution), and so on.

Optionally, in the first facial feature extraction module 21, after at least one convolution layer, an excitation layer (activation layer) may be further included, where the excitation layer includes an excitation function for performing nonlinear processing on each pixel value in the feature map obtained by convolution. Alternatively, excitation functions include, but are not limited to, modified linear unit (ReLU) functions, Exponential Linear Unit (ELU) functions, and several variant forms of ReLU functions, such as: leakage corrected linear units (leak ReLU, lreol), parametric corrected linear units (parametric ReLU, prellu), random corrected linear units (RReLU ), and the like. In the feature map processed by the excitation function, the pixel values have sparsity, and the excitation layer can realize that the sparse neural network structure can better mine relevant features and fit training data.

Since the first facial feature extraction module 21 is located at the front end of the neural network, the multilayer structure in the first facial feature extraction module 21 may also be referred to as a shallow layer of the neural network, and performs preliminary processing on the depth map.

Optionally, after the first face feature extraction module 21 performs step S210, the face detection module 22 in the neural network-like model continues to perform step S220 to perform face detection (face detection) on the first feature map to obtain a face region feature map.

As an example, fig. 6 shows an architecture diagram of a face detection module 22.

As shown in fig. 6, the face detection module 22 may include: convolutional layer network 221, face range convolutional layer 222, and face center convolutional layer 223.

Specifically, the convolutional layer network 221 includes at least two convolutional layers, so as to further perform a convolution operation on the first feature map output by the first face feature extraction module 21, so as to extract more face features in the first feature map.

It is understood that, in the convolutional layer network 221, the size of the convolution kernel, the type of the convolution kernel, and the value of the convolution kernel in each convolutional layer may be different, so as to extract feature information of different dimensions in the first feature map, thereby combining the feature maps including the face features. Meanwhile, the convolutional layer network 221 can control the number of channels of the feature map, the size of the feature map, and other process parameters, so as to facilitate the unification of the back layer structure of the neural network.

Optionally, as shown in fig. 6, the face detection module 22 may further include a face feature concentration layer 224, where the face feature concentration layer 224 is an attention mechanism (attention mechanism) module representing a convolution module, which may be a spatial (spatial) based attention mechanism module, a channel (channel) based attention mechanism module, or a combination of spatial and channel attention mechanism modules.

As an example, the facial feature concentration layer 224 is a space-based attention mechanism module, which concentrates on the spatial features of each feature map, and particularly, since the depth values of the five sense organs such as eyes and nose on the face are not consistent with the depth values of other planes, the facial feature concentration layer 224 is used for concentrating on the positions of the five sense organs of the face, and the pixel values of each feature map are subjected to weight distribution to highlight the five sense organs of the face in the feature map. By adding the face feature concentration layer 224 to the face detection module 22, the convolution network 221 can convolve to obtain a feature map which more highlights facial features, and the accuracy of subsequent face position detection and living body detection is increased.

Alternatively, if the human face feature concentration layer 224 is a spatial-based attention mechanism module, which includes but is not limited to a Spatial Transform Network (STN) model, it may also be any spatial attention mechanism module in the related art, which is intended to extract planar features of a feature map output by each layer of the convolutional layer in the convolutional layer network 221.

Further, after the concentration convolution processing of the convolution layer network 221 and the face feature concentration layer 224, an intermediate feature map is obtained, and after the intermediate feature map is subjected to convolution processing of the face range convolution layer 222 and the face center convolution layer 223, a face region prediction map (scale map) and a face center prediction map (center map) can be obtained, wherein the face region prediction map is used for representing the size of the region of the face in the first feature map, and the face center prediction map is used for representing the center position of the face in the first feature map.

Optionally, the face range convolution layer 222 and the face center convolution layer 223 are two 1 × 1 convolution layers, where a face center prediction map obtained by convolution of the face center convolution layer 223 is a face center heat map (center heat map).

As shown in fig. 7, the face region frame in the first feature map, that is, the face region feature map, may be obtained by mapping the detection result to the first feature map through the face region prediction map and the face center point heat map.

The face detection module 22 shown in fig. 6 is a neural network structure in an actual face detection process, wherein the convolution kernel parameters and other related parameters of each convolution layer are obtained through neural network training. In order to improve the accuracy and robustness in the actual face detection process, fig. 8 shows a schematic structural diagram of a face detection module 22 in the training stage.

As shown in fig. 8, compared with the face detection module 22 in fig. 6, in the training stage, the face detection module 22 further includes a center adjustment convolutional layer 225 for adjusting the position coordinates of the face center obtained by the face center convolutional layer 223. Optionally, in this embodiment of the present application, the channel of the face center convolution layer 223 includes two layers, which are respectively responsible for the offset in the horizontal direction and the offset in the vertical direction.

In the training stage, the face image samples under different scenes and different angles are collected, and the face frames (namely the real value ground route) in the samples are converted into the real values of the central point and the range labels. For example, for the center point of the face, when the center of the target falls on which position, the position is assigned with 1 (i.e. a positive sample), and other positions are assigned with 0 (i.e. a negative sample); for the face region, when the center of the target falls on which position, the log value of the size of the region is assigned at the position, and the other positions are assigned with 0.

The loss function (loss function) includes three parts, one part is the loss of the predicted central point position, the second part is the loss of the predicted face frame size, the third part is the loss of the predicted central point offset position, after the three parts are weighted respectively, the sum of the three parts is the loss function of the final face detection module 22, and the loss function is used for training the face detection module 22 to obtain the parameters of each layer in the module.

According to the scheme of the embodiment of the application, in the training process, the coordinates of the predicted face center position are adjusted by setting the center adjusting convolutional layer 225 and the like so as to increase the robustness and accuracy of center position prediction, and in the actual face detection process, the face area is obtained by using the face range convolutional layer 222 and the face center convolutional layer 223 only, so that the efficiency of the detection process can be improved, and the speed of face detection is increased.

Further, after the face region feature map is obtained through the face detection in step S220, step S230 is executed to perform further feature extraction on the face region feature map to obtain at least one second feature map.

Specifically, the step S230 may perform feature extraction through the second facial feature extraction module 23 in the neural network 20.

Alternatively, the second face feature extraction module 23 may include at least one convolution layer, where the at least one convolution layer is used to perform convolution calculation on the face region feature map, so as to extract living features in the face region feature map, such as facial texture features, details of five sense organs, and the like, for distinguishing a living face region from a non-living face region, in other words, after processing by the second face feature extraction module 23, the second feature map of the living face region has a larger feature difference from the second feature map of the non-living face.

Similar to the first face feature learning module described above, the convolution kernels in at least one convolution layer of the second face feature extraction module may be a 3 × 3 matrix, a 5 × 5 matrix, or a matrix with another size, which is not limited in this embodiment of the present application.

Optionally, the number of convolutional layers in the second face feature extraction module may be between one layer and four layers, or may also be more than four convolutional layers, the size of a plurality of convolutional cores in each convolutional layer may be the same or different, and the convolution step size of the plurality of convolutional cores may be the same or different, which is not limited in this embodiment of the present application.

Optionally, the type of at least one convolutional layer may be the same or may be different, including but not limited to two-dimensional convolution, three-dimensional convolution, point-by-point convolution, deep convolution, separable convolution, deconvolution, and/or hole convolution, among others.

Optionally, after at least one convolution layer, an excitation layer may be further included, where the excitation layer includes an excitation function for performing a non-linear processing on each pixel value in the feature map obtained by the convolution.

It can be understood that, in the training stage, the relevant parameters in the second face feature extraction module may be trained through a plurality of living body face area image samples and non-living body face area image samples to obtain a model of the second face feature extraction module in the embodiment of the present application in an optimized manner, so that in an actual face detection process, a feature map capable of distinguishing a living body from a non-living body may be obtained through processing by the second face feature extraction module.

Further, after the step S230, step S240 is executed to perform living body detection on the second feature map, determine whether the face in the second feature map is a living body face, and obtain a living body detection result of the face.

Specifically, in the embodiment of the present application, step S240 may be performed by concentration module 24 in the neural network 20.

It is to be understood that the focus module 24 is a focus module that focuses on live features in the second feature map.

Alternatively, in the embodiment of the present application, the concentration module 24 may be a space-based concentration mechanism module, a channel-based concentration mechanism module, or a combination of space and channel.

As an example, the concentration module 24 is a spatial and channel combined attention module, which includes but is not limited to a volume block attention module (CBAM) model, which may also be any spatial and channel combined attention module in the related art, and is intended to extract planar features of multiple feature maps and emphasis features in different channels.

Fig. 9 shows a schematic diagram of a focus module 24.

Specifically, the attentive module 24 includes multiple convolutional layers, each of which is followed by an attentive mechanism module to generate an optimized feature map.

As shown in fig. 9, after the first convolutional layer 241, the channel attention module 244 is added, after the second convolutional layer 242, the spatial attention module 245 is added, and then the target feature map is output through the third convolutional layer 243, and classification determination is performed based on the target feature map, so that whether the target feature map is a living human face can be determined.

By way of example, the processing of the second feature map to the target feature map is described below in connection with the structure of the focus module 24 in fig. 9.

The second feature map passes through the first convolution layer 241 to obtain N (N-channel) first intermediate feature maps, which are input as input feature maps into the channel attention module 244.

In the channel attention module 244, N first intermediate feature maps are compressed by max pooling (max pooling) and average pooling (average pooling) in a spatial dimension, respectively, to obtain two 1 × 1 × N first intermediate vectors, the two 1 × 1 × N first intermediate vectors are processed by a shared multi-layer perceptron (MLP), respectively, to obtain two 1 × 1 × N second intermediate vectors, the second intermediate vectors output by the MLP are subjected to an addition (elementary sum) operation based on elements, and then activated by an activation function, such as a sigmoid function, to generate a channel attention feature map (channel attention feature map). The channel attention feature map and the first intermediate feature map are subjected to element-based multiplication (elementary multiplication) to obtain N (N-channel) second intermediate feature maps input to the second convolutional layer 242.

Alternatively, in some embodiments, the second convolutional layer 242 performs a convolution operation on the N second intermediate feature maps, and inputs the convolved feature maps into the spatial attention module 245 as input feature maps, or in other embodiments, omits the structure of the second convolutional layer 242, and directly inputs the N second intermediate feature maps into the spatial attention module 245 as input feature maps.

In the spatial attention module 245, the input multiple feature maps, for example, N (N channels) second intermediate feature maps, are compressed in the channel dimension by maximum pooling (max pooling) and average pooling (average pooling), so as to obtain two W × H × 1 third intermediate feature maps, where W and H are the width and height of the second intermediate feature map. And then merging (concat) the two third intermediate vector results on the basis of channels, and then reducing the dimension to a fourth intermediate feature map of W multiplied by H multiplied by 1 through a convolution operation. The fourth intermediate feature map is processed by an activation function, for example, a sigmoid activation function, to generate a spatial attention feature map (spatial attention feature), and the spatial attention feature map and the second intermediate feature map are subjected to element-based multiplication (elementary multiplication) to obtain N (N-channel) fifth intermediate feature maps input to the third convolution layer 243.

Optionally, the fifth intermediate feature map is further convolved by the third convolution layer 243, and finally an output is obtained to obtain a target feature map, where the target feature map integrates spatial features and channel features, and living body detection is performed by using the target feature map, so that the reliability and robustness are high.

In the embodiment of the application, the light CBMA module is adopted, and compared with an attention module only focusing on a channel or an attention module only focusing on a space, the target characteristic map can be obtained more simply and effectively.

In the embodiment of the application, some compression reward and penalty modules (SE blocks) are eliminated from the above neural network architecture including the convolutional layer and the excitation layer, so that a lightweight neural network architecture can be realized, and the lightweight neural network is also convenient to operate on a device friendly to edge operation.

Meanwhile, the living body detection is facilitated to be absorbed in the human face features by adding the absorption module in the embodiment of the application, so that the accuracy of the living body detection result is improved.

According to the scheme of the embodiment of the application, the in-vivo detection result of the face position in the target to be detected and the face synchronously output can be output through the neural network, meanwhile, the neural network of the embodiment of the application has high operation efficiency and can be operated on a device for edge operation, the input image is a depth image of the target to be detected, the influence of ambient light on the face detection can be avoided, and the face detection can still be effectively carried out under the conditions of low illumination, no illumination or reverse illumination and the like.

The embodiments of the method for detecting a face in the present application are described in detail above with reference to fig. 4 to 9, and the embodiments of the apparatus for detecting a face in the present application are described in detail below with reference to fig. 10 to 12.

Fig. 10 is a schematic block diagram of a face detection apparatus 20 according to an embodiment of the present application, where the face detection apparatus 20 corresponds to the face detection method 200.

As shown in fig. 10, the face detection apparatus 20 includes:

an obtaining unit 210, configured to obtain a depth map of a target to be detected;

and the processing unit 220 is configured to perform image processing on the depth map to obtain a face region frame including a living human face in the depth map.

And an output unit 230 that outputs a face region frame including a living face in the depth map.

In the embodiment of the application, the image for face detection is a depth map of the target to be detected, so that the influence of ambient light on face detection can be avoided, and face detection can still be effectively performed under the conditions of low light, no light or reverse light and the like.

Specifically, as shown in fig. 11, the processing unit 220 may include:

a first face feature extraction module 21, configured to perform feature extraction on the depth map to obtain a first feature map;

a face detection module 22, configured to perform face detection on the first feature map to obtain a face region feature map;

the second face feature extraction module 23 is configured to perform feature extraction on the face region feature map to obtain a second feature map;

and the concentration module 24 performs living body detection on the second feature map to obtain a living body detection result of the human face.

Optionally, the processing unit 220 may include the neural network 20 in the above method embodiment to process the depth image of the target to be detected.

In the embodiment of the present application, the neural network 20 in the processing unit 220 is a lightweight neural network architecture, has high operation efficiency, can perform living body detection while performing face detection, and can output a face region frame including a living body face at a time, thereby improving the accuracy of face detection.

Specifically, the first facial feature extraction module 21, the face detection module 22, the second facial feature extraction module 23, and the concentration module 24 in the processing unit 220 correspond to the first facial feature extraction module 21, the face detection module 22, the second facial feature extraction module 23, and the concentration module 24 in the above-mentioned neuroid network 20, respectively.

It can be understood that, in the embodiment of the present application, the related technical solutions of the first face feature extraction module 21, the face detection module 22, the second face feature extraction module 23, and the concentration module 24 may refer to the related descriptions above, and are not described herein again.

In some possible embodiments, after the first facial feature extraction module 21 performs feature extraction on the depth map of the target to be detected, the obtained first feature map includes edge line features in the depth map.

In some possible embodiments, the first facial feature extraction module 21 may include: the convolution layer with no more than 4 layers improves the running speed of the module while ensuring the extraction performance.

Referring to fig. 6 and the related description above, in some possible embodiments, the face detection module 22 may include: convolutional layer network 221, face range convolutional layer 222, and face center convolutional layer 223.

Optionally, the convolutional layer network 221 is configured to perform convolution calculation on the first feature map to obtain an intermediate feature map;

the face range convolution layer 222 and the face center convolution layer 223 are respectively used for performing convolution calculation on the intermediate feature map to obtain a face region prediction map and a face center prediction map;

the face region prediction image and the face center prediction image are used for mapping the detection result to the first feature image to obtain a face region feature image.

Optionally, referring to fig. 6, the face detection module 22 further includes: a face feature concentration layer 224;

the convolutional layer network 221 and the face feature concentration layer 224 are used for performing convolution calculation on the first feature map to obtain an intermediate feature map;

the face feature concentration layer 224 is configured to perform weight distribution on the pixel values of the intermediate feature map to highlight the face five sense organs in the intermediate feature map.

As an example, among other things, the facial feature concentration layer 224 is a space-based attention module.

In addition, referring to fig. 8 and the related description above, in some possible embodiments, the face detection module 22 may further include a centering convolution layer 225, and the parameters of the face detection module 22 are obtained by training through a neural network.

In the neural network training phase, in the face detection module 22, the convolutional layer network 221 is further configured to: performing convolution calculation on the sample image to obtain a first sample characteristic image, wherein a human face area true value and a human face center true value are marked in the sample image;

face range convolution layer 222, face center convolution layer 223, and center adjust convolution layer 225 are used to: performing convolution calculation on the first sample feature map respectively to obtain a face area predicted value, a face center predicted value and a face center offset predicted value;

the face area prediction value, the face center offset prediction value, and the face area true value and the face center true value are used to calculate a loss function to obtain parameters of the face detection module 22.

As an example, the face range convolution layer 222 and the face center convolution layer 223 are two 1 × 1 convolution layers, wherein the face center prediction map is a face center heatmap.

Further, the second face feature extraction module 23 performs feature extraction on the face region feature map, and the obtained second feature map includes detail features of the face.

As an example, the second feature map includes facial features of five sense organs.

In some possible embodiments, the second face feature extraction module 23 includes convolution layers not greater than 4 layers, so as to improve the running speed of the module while ensuring the extraction performance.

Referring to fig. 9 and the associated description above, in some possible embodiments, in processing unit 220, focus module 24 is a focus mechanism module that combines space and channels.

By way of example, the concentration module 24 includes: a plurality of convolutional layers (first convolutional layer 241, second convolutional layer 242, and third convolutional layer 243), a channel attention module 244, and a spatial attention module 245;

the first convolution layer 241 in the multilayer convolution layer carries out convolution calculation on the second characteristic diagram to obtain a first intermediate characteristic diagram;

processing the first intermediate feature map with a channel attention module 244 to obtain a channel attention feature map;

performing convolution calculation on the channel attention feature map and the first intermediate feature map by using a second convolution layer 242 to obtain a second intermediate feature map;

processing the second intermediate feature map by using the spatial attention module 245 to obtain a spatial attention feature map;

performing convolution calculation on the spatial attention feature map and the second intermediate feature map by using a third convolution layer 243 to obtain a target feature map;

and obtaining a living body detection result of the human face based on the target feature map, wherein the target feature map comprises the living body features of the human face.

Since the processing unit 220 in the embodiment of the present application uses a lightweight neural network to perform the face detection operation, in some embodiments, the face detection device 20 may be an edge operation device.

Fig. 12 is a schematic hardware structure diagram of a face detection apparatus according to an embodiment of the present application. The face detection apparatus 30 shown in fig. 12 (the face detection apparatus 30 may specifically be a computer device) includes a memory 310, a processor 320, a communication interface 330, and a bus 340. The memory 310, the processor 320 and the communication interface 330 are connected to each other through a bus 340.

The memory 310 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 310 may store a program, and the processor 320 and the communication interface 330 are used to perform the steps of the method of face detection of the embodiments of the present application when the program stored in the memory 310 is executed by the processor 320.

The processor 320 may be a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, and is configured to execute related programs to implement functions required to be executed by modules in the face detection apparatus according to the embodiment of the present application, or to execute the face detection method according to the embodiment of the present application.

Processor 320 may also be an integrated circuit chip having signal processing capabilities. In the implementation process, the steps of the method for detecting a human face of the present application may be implemented by an integrated logic circuit of hardware in the processor 320 or instructions in the form of software. The processor 320 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 310, and the processor 320 reads information in the memory 310, and completes functions required to be executed by modules included in the face detection apparatus according to the embodiment of the present application, or executes a face detection method according to the embodiment of the method of the present application, in combination with hardware of the processor.

Communication interface 330 enables communication between apparatus 30 and other devices or communication networks using transceiver devices such as, but not limited to, transceivers. For example, input data may be obtained through the communication interface 330.

Bus 340 may include a path that transfers information between various components of device 30 (e.g., memory 310, processor 320, communication interface 330).

It should be noted that although the apparatus 30 shown in fig. 12 shows only the memory 310, the processor 320, the communication interface 340 and the bus 340, in a specific implementation, those skilled in the art will appreciate that the apparatus 30 also includes other devices necessary for normal operation. Also, those skilled in the art will appreciate that the apparatus 30 may also include hardware components for performing other additional functions, according to particular needs. Furthermore, those skilled in the art will appreciate that apparatus 30 may also include only those components necessary to practice embodiments of the present application, and need not include all of the components shown in FIG. 12.

It is to be understood that the face detection apparatus 30 may correspond to the face detection apparatus 20 in fig. 10 described above, the functions of the processing unit 220 in the face detection apparatus 20 may be implemented by the processor 320, and the functions of the acquisition unit 210 and the output unit 230 may be implemented by the communication interface 330. To avoid repetition, detailed description is appropriately omitted here.

The embodiment of the application also provides a processing device, which comprises a processor and an interface; the processor is used for executing the method for detecting the human face in any method embodiment.

It should be understood that the processing means may be a chip. For example, the processing device may be a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a system on chip (SoC), a Central Processing Unit (CPU), a Network Processor (NP), a digital signal processing circuit (DSP), a Microcontroller (MCU), a Programmable Logic Device (PLD), or other integrated chips.

The embodiment of the application also provides a platform system which comprises the human face detection device.

The embodiments of the present application also provide a computer-readable medium, on which a computer program is stored, which, when executed by a computer, implements the method of any of the above-mentioned method embodiments.

The embodiment of the present application further provides a computer program product, and the computer program product implements the method of any one of the above method embodiments when executed by a computer.

The embodiment of the application also provides electronic equipment which can comprise the face recognition device in the embodiment of the application.

For example, the electronic device is a smart door lock, a mobile phone, a computer, an access control system, or the like, which requires face recognition. The face recognition device comprises software and hardware devices used for face recognition in electronic equipment.

Optionally, the electronic device may further include a depth map acquisition device.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

As used in this specification, the terms "unit," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between 2 or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for face detection, comprising:

acquiring a depth map of a target to be detected, and performing feature extraction on the depth map to obtain a first feature map;

carrying out face detection on the first feature map to obtain a face region feature map;

carrying out feature extraction on the face region feature map to obtain a second feature map;

performing living body detection on the second feature map to obtain a living body detection result of the human face;

and outputting a face region frame including the living face in the depth map according to the living body detection result of the face.

2. The method of claim 1, wherein the extracting the features of the depth map to obtain a first feature map comprises:

and performing feature extraction on the depth map by adopting a first face feature extraction module to obtain the first feature map, wherein the first feature map comprises edge line features in the depth map.

3. The method according to claim 1 or 2, wherein the number of layers of the convolutional layers of the first facial feature extraction module is not more than 4.

4. The method according to any one of claims 1 to 3, wherein the performing face detection on the first feature map to obtain a face region feature map comprises:

and carrying out face detection on the first feature map by adopting a face detection module to obtain a face region feature map.

5. The method of claim 4, wherein the face detection module comprises: convolutional layer network, face scope convolution layer and face center convolution layer, wherein, it is right to adopt face detection module face detection is carried out to first characteristic map, obtains the regional characteristic map of face, includes:

performing convolution calculation on the first characteristic diagram by adopting the convolution layer network to obtain a first intermediate characteristic diagram;

performing convolution calculation on the first intermediate feature map by using the face range convolution layer and the face center convolution layer respectively to obtain a face region prediction map and a face center prediction map;

and obtaining the face region characteristic diagram in the first characteristic diagram according to the face region prediction diagram and the face center prediction diagram.

6. The method of claim 5, wherein the face detection module further comprises: a face feature concentration layer;

the face feature concentration layer is used for carrying out weight distribution on pixel values of the intermediate feature image so as to highlight face five-sense feature features in the intermediate feature image.

7. The method of claim 6, wherein the face feature concentration layer is a space-based attention module.

8. The method of any of claims 5 to 7, further comprising:

and carrying out neural network training on the face detection module to obtain parameters of the face detection module.

9. The method of claim 8, wherein the face detection module further comprises a centering convolution layer, wherein the performing neural network training on the face detection module to obtain parameters of the face detection module comprises:

acquiring a sample image, wherein the sample image is marked with a real value of a face region and a real value of a face center;

carrying out convolution calculation on the sample image by adopting the convolution layer network to obtain a first sample characteristic diagram;

performing convolution calculation on the first sample feature map by adopting the face range convolution layer, the face center convolution layer and the center adjustment convolution layer to obtain a face area prediction value, a face center prediction value and a face center offset prediction value;

and calculating a loss function to obtain parameters of the face detection module according to the face area predicted value, the face center offset predicted value, the face area real value and the face center real value.

10. The method of any of claims 5 to 9, wherein the face range convolution layer and the face center convolution layer are two 1 x 1 convolution layers, and wherein the face center prediction map is a face center heat map.

11. The method according to any one of claims 1 to 10, wherein the performing feature extraction on the face region feature map to obtain a second feature map comprises:

and performing feature extraction on the face region feature map by adopting a second face feature extraction module to obtain a second feature map, wherein the second feature map comprises the detail features of the face.

12. The method of claim 11, wherein the number of convolutional layers of the second face feature extraction module is no greater than 4.

13. The method according to claim 11 or 12, wherein the second feature map comprises facial features of five sense organs.

14. The method according to any one of claims 1 to 13, wherein the performing the live body detection on the second feature map to obtain a live body detection result of the human face comprises:

and performing living body detection on the second characteristic diagram by adopting a concentration module to obtain a living body detection result of the human face, wherein the concentration module is an attention mechanism module combining a space and a channel.

15. The method of claim 14, wherein the concentration module comprises: a plurality of layers of convolutional layers, a channel attention module, and a spatial attention module;

the adoption is absorbed in the module and is carried out live body detection to the second characteristic map, obtains the live body detection result of people's face, includes:

performing convolution calculation on the second feature map by adopting a first convolution layer to obtain a first intermediate feature map;

processing the first intermediate feature map by using the channel attention module to obtain a channel attention feature map;

performing convolution calculation on the channel attention feature map and the first intermediate feature map by using a second convolution layer to obtain a second intermediate feature map;

processing the second intermediate feature map by using the spatial attention module to obtain a spatial attention feature map;

performing convolution calculation on the space attention feature map and the second intermediate feature map by using a third convolution layer to obtain a target feature map;

and obtaining a living body detection result of the human face based on the target feature map, wherein the target feature map comprises the living body feature of the human face.

16. The method of any one of claims 1 to 15, wherein the method is run on an edge computing device.

17. An apparatus for face detection, comprising:

the acquisition unit is used for acquiring a depth map of a target to be detected;

the first face feature extraction module is used for extracting features of the depth map to obtain a first feature map;

the face detection module is used for carrying out face detection on the first feature map to obtain a face region feature map;

the second face feature extraction module is used for extracting features of the face region feature map to obtain a second feature map;

the concentration module is used for carrying out living body detection on the second feature map to obtain a living body detection result of the human face;

and the output module is used for outputting a face region frame including the living face in the depth map according to the living body detection result of the face.

18. The apparatus of claim 17, wherein the first feature map comprises edge line features in the depth map.

19. The apparatus according to claim 17 or 18, wherein the number of layers of the convolutional layers of the first facial feature extraction module is not more than 4.

20. The apparatus of any of claims 17 to 19, wherein the face detection module comprises: a convolutional layer network, a face range convolutional layer and a face center convolutional layer;

the convolutional layer network is used for carrying out convolution calculation on the first characteristic diagram to obtain an intermediate characteristic diagram;

the face range convolution layer and the face center convolution layer are respectively used for carrying out convolution calculation on the intermediate characteristic graph to obtain a face region prediction graph and a face center prediction graph;

the face region prediction graph and the face center prediction graph are used for mapping the detection result to the first feature graph to obtain the face region feature graph.

21. The apparatus of claim 20, wherein the face detection module further comprises: a face feature concentration layer;

22. The apparatus of claim 21, wherein the facial feature concentration layer is a space-based attention module.

23. The apparatus according to any one of claims 20 to 22, wherein the parameters of the face detection module are obtained by neural network training.

24. The apparatus of claim 23, wherein the face detection module further comprises a centering convolution layer;

the convolutional layer network is further configured to: performing convolution calculation on a sample image to obtain a first sample characteristic image, wherein a human face area true value and a human face center true value are marked in the sample image;

the face range convolution layer, the face center convolution layer and the center adjustment convolution layer are used for: performing convolution calculation on the first sample feature map respectively to obtain a face area predicted value, a face center predicted value and a face center offset predicted value;

the face area predicted value, the face center offset predicted value, and the face area real value and the face center real value are used for calculating a loss function to obtain parameters of the face detection module.

25. The apparatus of any one of claims 20 to 24, wherein the face range convolution layer and the face center convolution layer are two 1 x 1 convolution layers, and wherein the face center prediction map is a face center heat map.

26. The apparatus according to any one of claims 17 to 25, wherein the second feature map comprises detail features of a human face.

27. The apparatus of claim 26, wherein the second face feature extraction module comprises no more than 4 convolutional layers.

28. The apparatus according to claim 26 or 27, wherein the second feature map comprises facial features of five sense organs.

29. The apparatus of any one of claims 17 to 28, wherein the concentration module is a concentration mechanism module that combines space and channels.

30. The apparatus of claim 29, wherein the focus module comprises: a plurality of layers of convolutional layers, a channel attention module, and a spatial attention module;

performing convolution calculation on the second feature map by a first convolution layer in the multilayer convolution layers to obtain a first intermediate feature map;

31. The apparatus of any one of claims 17 to 30, wherein the apparatus is an edge computing apparatus.

32. An electronic device, comprising:

an apparatus for face detection as claimed in any one of claims 17 to 31.

33. The electronic device of claim 32, further comprising:

depth map acquisition device.

34. A computer-readable storage medium storing program instructions which, when executed by a computer, cause the computer to perform a method of face detection as claimed in any one of claims 1 to 16.

35. A computer program product containing instructions which, when executed by a computer, cause the computer to carry out the method of face detection according to any one of claims 1 to 16.