CN112036339B

CN112036339B - Face detection method and device and electronic equipment

Info

Publication number: CN112036339B
Application number: CN202010917905.0A
Authority: CN
Inventors: 张为义; 涂弘德; 刘以勒; 罗士杰
Original assignee: Fujian Cook Intelligent Technology Co ltd
Current assignee: Fujian Cook Intelligent Technology Co ltd
Priority date: 2020-09-03
Filing date: 2020-09-03
Publication date: 2024-04-09
Anticipated expiration: 2040-09-03
Also published as: CN112036339A

Abstract

A face detection method, a face detection device and an electronic device can also perform face detection and living body detection in dark or low-light environment, and improve face detection efficiency and accuracy, so that performance of a face detection technology is comprehensively improved. The face detection method comprises the following steps: obtaining a depth map of a target to be detected, and extracting features of the depth map to obtain a first feature map; performing face detection on the first feature map to obtain a face region feature map; extracting features of the facial region feature map to obtain a second feature map; performing living body detection on the second feature map to obtain a living body detection result of the human face; and outputting a human face region frame comprising the living human face in the depth map according to the living body detection result of the human face.

Description

Face detection method and device and electronic equipment

Technical Field

The present disclosure relates to the field of biometric detection technologies, and in particular, to a method, an apparatus, and an electronic device for face detection.

Background

Face detection (face detection) is a biological detection and recognition technology for performing identity recognition based on facial feature information of a person. The method comprises the steps of collecting images or video streams containing human faces by using a camera or a camera, automatically detecting and tracking the human faces in the images, and further carrying out a series of related technologies such as image preprocessing, image feature extraction, matching and recognition on the detected human faces, and the like, which are commonly called face recognition, portrait recognition or facial recognition. With the rapid development of computer and network technologies, face detection technologies have been widely applied to various industries and fields such as intelligent access control, mobile terminals, public security, entertainment, military and the like.

Most of the existing face detection technologies find out the approximate position of a candidate frame in an image, and determine that the image content in the candidate frame is not background, then accurately locate the position of the candidate frame, identify whether the candidate frame is a face, so that the face detection process is complex, the detection efficiency is low, no information of living body detection is provided, and in addition, the face detection technologies in the prior art cannot perform face detection in a dark or low-light environment.

Therefore, how to perform face detection and living body detection in dark or low-light environment and improve the accuracy and efficiency of face detection, thereby comprehensively improving the performance of the face detection device is a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a method, a device and electronic equipment for face detection, which can also perform face detection and living body detection in dark or low-light environment, and improve the face detection efficiency and accuracy, thereby comprehensively improving the performance of a face detection technology.

In a first aspect, a method for face detection is provided, including: obtaining a depth map of a target to be detected, and extracting features of the depth map to obtain a first feature map; performing face detection on the first feature map to obtain a face region feature map; extracting features of the facial region feature map to obtain a second feature map; performing living body detection on the second feature map to obtain a living body detection result of the human face; and outputting a human face region frame comprising the living human face in the depth map according to the living body detection result of the human face.

Based on the scheme of the embodiment of the application, the face position in the target to be detected can be obtained by outputting and synchronously outputting the living body detection result of the face, so that the accuracy of face detection is improved, meanwhile, the input image of the embodiment of the application is the depth map of the target to be detected, the influence of ambient light on the face detection can be avoided, the face detection can be effectively carried out under the conditions of low illumination, no illumination or reverse illumination, and the like, the face detection efficiency is improved, and the performance of the face detection technology is comprehensively improved.

In some possible embodiments, the feature extraction of the depth map to obtain a first feature map includes: and carrying out feature extraction on the depth map by adopting a first face feature extraction module to obtain the first feature map, wherein the first feature map comprises edge line features in the depth map.

In some possible embodiments, the number of layers of the convolution layer of the first face feature extraction module is not greater than 4.

In some possible embodiments, the performing face detection on the first feature map to obtain a face region feature map includes: and carrying out face detection on the first feature map by adopting a face detection module to obtain a face region feature map.

In some possible embodiments, the face detection module includes: the face detection module is used for face detection of the first feature map to obtain a face region feature map, and the face region feature map comprises: carrying out convolution calculation on the first feature map by adopting the convolution layer network to obtain a first intermediate feature map; the face range convolution layer and the face center convolution layer are adopted to respectively carry out convolution calculation on the first middle feature map, so as to obtain a face region prediction map and a face center prediction map; and obtaining the face region feature map in the first feature map according to the face region prediction map and the face center prediction map.

In some possible embodiments, the face detection module further includes: a face feature concentration layer; the face feature concentration layer is used for carrying out weight distribution on the pixel values of the middle feature image so as to highlight the face five-sense organ features in the middle feature image.

According to the scheme of the embodiment of the application, the face feature concentration layer is added into the face detection module, so that the convolutional network can convolve to obtain the feature map which more highlights the facial features of the face, and the accuracy of subsequent face position detection and living body detection is improved.

In some possible implementations, the face feature concentration layer is a space-based attention module.

In some possible embodiments, the method further comprises: and training the neural network of the face detection module to obtain parameters of the face detection module.

In some possible embodiments, the face detection module further includes a central adjustment convolution layer, where the training of the neural network by the face detection module to obtain parameters of the face detection module includes: acquiring a sample image, wherein the sample image is marked with a face region true value and a face center true value; carrying out convolution calculation on the sample image by adopting the convolution layer network to obtain a first sample feature map; carrying out convolution calculation on the first sample feature map by adopting the face range convolution layer, the face center convolution layer and the center adjustment convolution layer to obtain a face region predicted value, a face center predicted value and a face center deviation predicted value; and calculating a loss function according to the face region predicted value, the face center deviation predicted value, the face region true value and the face center true value to obtain parameters of the face detection module.

According to the scheme of the embodiment of the application, in the training process, the coordinates of the central position of the predicted face are adjusted by setting the central adjustment convolution layer and the like so as to increase the robustness and accuracy of central position prediction, and in the actual face detection process, only the face area is acquired by using the face range convolution layer and the face central convolution layer, so that the efficiency of the detection process can be improved, and the face detection speed can be accelerated.

In some possible embodiments, the face range convolution layer and the face center convolution layer are two 1×1 convolution layers, wherein the face center prediction graph is a face center heat graph.

In some possible embodiments, the feature extraction of the face region feature map to obtain a second feature map includes: and carrying out feature extraction on the face region feature map by adopting a second face feature extraction module to obtain a second feature map, wherein the second feature map comprises detail features of the face.

In some possible embodiments, the number of layers of the convolution layer of the second face feature extraction module is not greater than 4.

In some possible embodiments, the second feature map includes facial features of a face.

In some possible embodiments, the performing the living body detection on the second feature map to obtain a living body detection result of the face includes: and performing living body detection on the second feature map by adopting a concentration module to obtain a living body detection result of the human face, wherein the concentration module is an attention mechanism module combining the space and the channel.

According to the scheme of the embodiment of the application, the light concentration module is adopted, so that the target feature map can be obtained more simply and effectively compared with the concentration module focusing on the channel or the concentration module focusing on the space.

In some possible embodiments, the concentration module comprises: a multi-layer convolution layer, a channel attention module, and a spatial attention module; the adopting the concentration module carries out living body detection on the second feature map to obtain a living body detection result of the human face, and the method comprises the following steps: performing convolution calculation on the second feature map by adopting a first convolution layer to obtain a first intermediate feature map; processing the first intermediate feature map by adopting the channel attention module to obtain a channel attention feature map; performing convolution calculation on the channel attention feature map and the first intermediate feature map by adopting a second convolution layer to obtain a second intermediate feature map; processing the second intermediate feature map by adopting the spatial attention module to obtain a spatial attention feature map; performing convolution calculation on the space attention feature map and the second intermediate feature map by adopting a third convolution layer to obtain a target feature map; and obtaining a living body detection result of the human face based on the target feature map, wherein the target feature map comprises living body features of the human face.

According to the scheme of the embodiment of the application, the modules for realizing the steps form a light-weight neural network-like framework, and the light-weight neural network-like framework is convenient to operate on a device for friendly edge operation, so that the face detection method can be applied to more scenes.

In some possible embodiments, the method runs on an edge-operated device.

In a second aspect, there is provided an apparatus for face detection, including: the acquisition unit is used for acquiring a depth map of the target to be detected; the first face feature extraction module is used for carrying out feature extraction on the depth map to obtain a first feature map; the face detection module is used for carrying out face detection on the first feature map to obtain a face region feature map; the second face feature extraction module is used for carrying out feature extraction on the face region feature map to obtain a second feature map; the concentration module carries out living body detection on the second feature map to obtain a living body detection result of the human face; and the output module is used for outputting a human face region frame comprising the living human face in the depth map according to the living body detection result of the human face.

In some possible implementations, the first feature map includes edge line features in the depth map.

In some possible embodiments, the face detection module includes: a convolutional layer network, a face range convolutional layer and a face center convolutional layer;

the convolution layer network is used for carrying out convolution calculation on the first feature map to obtain an intermediate feature map; the face range convolution layer and the face center convolution layer are respectively used for carrying out convolution calculation on the intermediate feature map to obtain a face region prediction map and a face center prediction map; the face region prediction graph and the face center prediction graph are used for mapping the detection result to the first feature graph to obtain the face region feature graph.

In some possible embodiments, the face detection module further includes: a face feature concentration layer; the convolution layer network and the face feature concentration layer are used for carrying out convolution calculation on the first feature map to obtain an intermediate feature map; the face feature concentration layer is used for carrying out weight distribution on the pixel values of the middle feature image so as to highlight the face five-sense organ features in the middle feature image.

In some possible embodiments, the parameters of the face detection module are obtained through neural network training.

In some possible implementations, the face detection module further includes a centering convolutional layer, the convolutional layer network further configured to: carrying out convolution calculation on a sample image to obtain a first sample feature image, wherein the sample image is marked with a face region true value and a face center true value; the face range convolution layer, the face center convolution layer and the center adjustment convolution layer are configured to: respectively carrying out convolution calculation on the first sample feature map to obtain a face region predicted value, a face center predicted value and a face center deviation predicted value; the face region predicted value, the face center deviation predicted value, the face region true value and the face center true value are used for calculating a loss function to obtain parameters of the face detection module.

In some possible embodiments, the second feature map includes detailed features of the face.

In some possible embodiments, the concentration module is a concentration mechanism module that combines space and channels.

In some possible embodiments, the concentration module comprises: a multi-layer convolution layer, a channel attention module, and a spatial attention module; a first convolution layer in the multi-layer convolution layers carries out convolution calculation on the second feature map to obtain a first intermediate feature map; processing the first intermediate feature map by adopting the channel attention module to obtain a channel attention feature map; performing convolution calculation on the channel attention feature map and the first intermediate feature map by adopting a second convolution layer to obtain a second intermediate feature map; processing the second intermediate feature map by adopting the spatial attention module to obtain a spatial attention feature map; performing convolution calculation on the space attention feature map and the second intermediate feature map by adopting a third convolution layer to obtain a target feature map; and obtaining a living body detection result of the human face based on the target feature map, wherein the target feature map comprises living body features of the human face.

In some possible embodiments, the device is an edge computing device.

In a third aspect, there is provided an electronic device comprising: the apparatus of face detection in the second aspect or any possible implementation manner thereof.

In some possible implementations, the electronic device further includes: and the depth map acquisition device.

In a fourth aspect, a computer readable storage medium is provided for storing program instructions which, when executed by a computer, perform the method of face detection in any one of the possible implementations of the first aspect or the first aspect.

In a fifth aspect, there is provided a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of face detection of the first aspect or any of the possible implementations of the first aspect.

In particular, the computer program product may be run on the electronic device of the third aspect described above.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture provided in the present application.

Fig. 2 is a schematic diagram of a basic framework of a master RCNN according to an embodiment of the present application.

FIG. 3 is a schematic diagram of a process for target detection by a master RCNN in accordance with an embodiment of the present application.

Fig. 4 is a schematic flow chart diagram of a face detection method according to an embodiment of the present application.

Fig. 5 is a schematic block diagram of a neural network-like architecture according to an embodiment of the present application.

Fig. 6 is a schematic architecture diagram of a face detection module according to an embodiment of the present application.

Fig. 7 is a schematic view of a face region box in the first feature diagram according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of another face detection module according to an embodiment of the present application.

Fig. 9 is a schematic structural view of a focus module according to an embodiment of the present application.

Fig. 10 is a schematic block diagram of a face detection apparatus according to an embodiment of the present application.

Fig. 11 is a schematic block diagram of a processing unit according to an embodiment of the present application.

Fig. 12 is a schematic hardware configuration of a face detection apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

The embodiments of the present application are applicable to face detection systems, including but not limited to products based on optical face imaging. The face detection system may be applied to various electronic devices with image capturing devices (such as cameras), which may be personal computers, computer workstations, smart phones, tablet computers, smart cameras, media consumption devices, wearable devices, set-top boxes, game consoles, augmented reality (augmented reality, AR) AR/Virtual Reality (VR) devices, vehicle terminals, and the like, to which embodiments disclosed herein are not limited.

It should be understood that the specific examples herein are intended only to facilitate a better understanding of the embodiments of the present application by those skilled in the art and are not intended to limit the scope of the embodiments of the present application.

It should also be understood that, in various embodiments of the present application, the size of the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

It should also be understood that the various embodiments described in this specification may be implemented alone or in combination, and that the examples herein are not limited in this regard.

Unless defined otherwise, all technical and scientific terms used in the examples of this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

In order to better understand the solution of the embodiment of the present application, a possible application scenario of the embodiment of the present application will be briefly described with reference to fig. 1.

As shown in fig. 1, an embodiment of the present application provides a system architecture 100. In fig. 1, a data acquisition device 160 is used to acquire training data. For the face detection method of the embodiment of the application, the training data may include a training image or a training video.

After the training data is collected, the data collection device 160 stores the training data in the database 130 and the training device 120 trains the target model/rule 101 based on the training data maintained in the database 130.

The above-described object model/rule 101 can be used to implement the method of face detection of the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may be specifically a neural network. In practical applications, the training data maintained in the database 130 is not necessarily collected by the data collecting device 160, but may be received from other devices. It should be noted that the training device 120 is not necessarily completely based on the training data maintained by the database 130 to perform training of the target model/rule 101, and it is also possible to obtain the training data from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, etc., or may be a server or cloud end, etc. In fig. 1, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140, where the input data may include in embodiments of the present application: the video to be processed or the image to be processed input by the client device 140.

In some embodiments, the client device 140 may be the same device as the executing device 110, for example, the client device 140 may be a terminal device as the executing device 110.

In other embodiments, the client device 140 may be different from the executing device 110, for example, the client device 140 is a terminal device, and the executing device 110 is a cloud, a server, etc., where the client device 140 may interact with the executing device 310 through a communication network of any communication mechanism/communication standard, and the communication network may be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.

The computing module 111 of the execution device 110 is configured to process input data (e.g., an image to be processed) received by the I/O interface 112. In the process related to the execution of the computation by the computation module 111 of the execution device 110, the execution device 110 may call the data, the code, etc. in the data storage system 150 for the corresponding process, or may store the data, the instruction, etc. obtained by the corresponding process in the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the face detection result obtained as described above, to the client device 140, thereby providing the processing result to the user.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule 101 for different targets or different tasks, where the corresponding target model/rule 101 may be used to achieve the targets or complete the tasks, thereby providing the user with the desired result.

In the case shown in FIG. 1, the user may manually give input data that may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided in the embodiments of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawings is not limited in any way, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 1, the training device 120 is used to train a target model/rule 101, where the target model/rule 101 may be a neural network in the embodiment of the present application, specifically, the neural network in the embodiment of the present application may be a convolutional neural network (convolutional neuron network, CNN), a regional convolutional neural network (region CNN, RCNN), a fast regional convolutional neural network (master RCNN), or other types of neural networks, and the application is not limited specifically.

Currently, in a face detection system, a two-stage neural network architecture, such as the above-mentioned master RCNN neural network, is generally used.

For ease of understanding, the neural network of the master RCNN will be briefly described first with reference to fig. 2 and 3.

Fig. 2 shows a basic framework diagram of a master RCNN, and fig. 3 shows a target detection process diagram of the master RCNN.

As shown in fig. 2 and 3, the master RCNN can be divided into a region candidate network (region proposal network, RPN), a convolutional neural network CNN, a reference region of interest pooling (region of interest pooling, ROI pooling), and a classifier (classifier).

The convolutional neural network CNN is used for performing feature convolution on an input image, and extracting a feature map (feature map) of the input image.

The region candidate network RPN is used to extract candidate boxes (region pro-posals) for the convolved features in the feature map. In some embodiments, by setting a plurality of anchor frames (anchors) in the feature map, then determining that the anchor frames belong to an anchor frame (positive anchor) including a detection target or an anchor frame (negative anchor) not including a detection target through a softmax function, and correcting the anchor frames by using a frame regression (bounding box regression), an accurate candidate frame is obtained.

The ROI pooling is used for receiving the feature images and the candidate frames, and extracting the candidate feature images (proposal feature maps) after integrating the information, so that the subsequent classifier can conveniently perform classification and identification.

And the classifier calculates the category of the candidate frame by using the candidate feature map, and simultaneously, the final accurate position of the detection frame is obtained by frame regression again.

As can be seen from the above description, if the face detection is performed by using the face RCNN network, after the feature map of the input image is obtained by using the CNN network, the first stage uses the anchor in the RPN to find out the position of the face candidate frame and determine whether the face candidate frame is the background, and the second stage uses the subsequent processing to identify whether the face candidate frame is the face and more precisely locate the position of the face candidate frame.

By adopting the traditional face detection method, the face detection process is complex, a special server or a large server is needed to execute the method, the application and popularization on an edge computing (edge computing) device are not facilitated, and the detection efficiency is seriously affected by the complex face detection process.

In addition, in a dark or low-light environment, an input image cannot be shot or the quality of the shot input image is poor, face detection is difficult to carry out in the dark or low-light environment, and the face detection method also does not provide living detection information, so that a face detection result is inaccurate, and the comprehensive performance of face detection is affected.

Based on this, the application provides a neural network-like architecture of friendly edge operation to the device of friendly edge operation, carries out face detection based on the neural network-like architecture, can carry out face detection process with high efficiency, makes it can operate on edge operation device to provide the result of living body detection in the in-process of face detection, on this basis, the resistance to light of the face detection method of this application is better, can not receive the change influence of light signal, still can carry out face detection under low illumination or no light condition.

Next, first, referring to fig. 4 to fig. 9, a neural network-like architecture for face detection and a face detection method flow provided in the embodiments of the present application are described.

Fig. 4 shows a schematic flow diagram of a face detection method 200. Alternatively, the execution subject of the face detection method 200 may be the execution device 110 in fig. 1 above.

As shown in fig. 4, the face detection method 200 may include the following steps.

S210: and obtaining a depth map of the target to be detected, and carrying out feature extraction on the depth map to obtain a first feature map.

S220: and carrying out face detection on the first feature map to obtain a face region feature map.

S230: and further extracting the features of the facial region feature map to obtain a second feature map.

S240: and performing living body detection on the second feature map to obtain a living body detection result of the human face.

S250: and outputting at least one face region frame comprising the living face in the depth map according to the living body detection result of the face.

In the embodiment of the application, the target to be detected includes, but is not limited to, any object such as a face, a photo, a video, a three-dimensional model and the like. For example, the object to be detected may be a face of the target user, faces of other users, photographs of users, a surface model with photographs attached, and so on.

As an example, in some embodiments, after the depth map acquisition device acquires the depth map of the object to be detected, the depth map is sent to the processing unit in the execution device to perform subsequent image processing work. Optionally, the depth map acquisition device may be integrated in the execution device, or may be separately provided from the execution device.

In this embodiment of the present application, a depth image (depth image) is a depth image of a target to be detected, and the depth image is also referred to as a range image (range image), where a pixel value in the depth image of the target to be detected represents distance information between each point on the surface of the target to be detected and the same point or the same plane.

For example, the depth map acquisition device is used for acquiring a depth map of a human face, and pixel values in the depth map represent distances between points on the surface of the human face and the image acquisition module. When the depth map is a gray image, the change of the pixel value of the image can also be expressed as the gray change of the image, so that the gray change of the depth map also corresponds to the depth change of the face, and the geometric shape and the depth information of the face surface are directly reflected.

In some possible embodiments, the structural light may be projected to the target to be detected by the structural light projection module, and the depth map acquisition device receives a reflected structural light signal of the structural light reflected by the target to be detected and converts the reflected structural light signal to obtain the depth map.

Optionally, the above structured light includes, but is not limited to, a speckle image, a light signal with a structured pattern such as a lattice light, and the structured light projection module may be any device structure that projects structured light, including, but not limited to: a light emitting device such as a lattice light projector using a vertical cavity surface emitting laser (vertical cavity surface emitting laser, VCSEL) light source and a speckle structure light projector is provided.

It should be understood that, in the embodiment of the present application, other image acquisition modules capable of acquiring depth information of an object to be detected may be further used to acquire a depth map, for example, an image acquisition module such as a time of flight (TOF) optical module may be used to acquire the depth map, and then the depth map is transmitted to the processing unit.

It should be further understood that in the above steps, point cloud (point cloud) data of the object to be detected may also be acquired and converted into a depth map, and the technical scheme for specifically acquiring the point cloud data of the object to be detected and the specific scheme for converting the point cloud data into the depth map may refer to a method in the related art, which is not specifically limited in the embodiments of the present application.

After obtaining a depth map of a target to be detected, the application provides a neural network-like architecture, and the depth map is subjected to subsequent processing to obtain a face region frame of a living face in the depth map.

Fig. 5 shows a schematic block diagram of a neural network architecture 20 of an embodiment of the present application.

As shown in fig. 5, this type of neural network architecture 20 includes: a first face feature extraction module 21, a face detection module 22, a second face feature extraction module 23, and a concentration module 24.

Specifically, the first face feature extraction module 21 is configured to perform the step S210, and perform feature extraction on a depth map of a target to be detected, so as to obtain at least one first feature map.

In some embodiments, the first face feature extraction module 21 may include at least one convolution layer (convolutional layer), and perform convolution calculation on a depth map of a target to be detected, so as to extract edge line features in the depth map, or extract high-frequency features in the depth map, and if the target to be detected is a face, the face feature extraction module 21 extracts facial features such as facial features lines and face edge lines on the face.

In some embodiments, each of the at least one convolution layers includes one or more convolution kernels (kernel). Wherein the convolution kernel is also called a filter or feature detector. The matrix obtained by sliding the convolution kernel over the image and computing the dot product is called a convolution feature (convolved feature) or activation map or feature map. For the same input image, convolution kernels of different values will generate different feature maps, so that one or more first feature maps comprising line features can be obtained by one or more convolution kernels. By modifying the values of the convolution kernel, a different first feature map may be detected from the depth map.

It should be appreciated that the convolution kernel may be a 3×3 matrix, a 5×5 matrix, or other size matrix, which is not limited in this embodiment of the present application.

It should also be understood that, in the embodiment of the present application, the number of the convolution layers in the first face feature extraction module 21 may be between one layer and four layers, or may also be more than four layers, where the sizes of the convolution kernels in each layer may be the same or different, and the convolution steps of the convolution kernels may be the same or different.

It should also be appreciated that in embodiments of the present application, the type of at least one convolution layer in the first face feature extraction module 21 may be the same or may be different, including, but not limited to, two-dimensional convolution, three-dimensional convolution, point-by-point convolution (pointwise convolution), depth convolution (depthwise convolution), separable convolution (separable convolution), deconvolution, and/or hole convolution (dilated convolution), among others.

Optionally, after at least one convolution layer, the first face feature extraction module 21 may further include an activation layer, where an activation function is included in the activation layer, and the activation layer is used to perform a nonlinear processing on each pixel value in the feature map obtained by convolution. Optionally, the excitation functions include, but are not limited to, a modified linear unit (rectified linear unit, reLU) function, an exponential linear unit (exponential linear unit, ELU) function, and several variant forms of the ReLU function, such as: a leakage correction linear unit (LReLU), a parametric correction linear unit (parametric ReLU, prime), a random correction linear unit (randomised ReLU, RReLU), and the like. In the feature map processed by the excitation function, the pixel values have sparsity, and the excitation layer can realize that the sparse neural network structure can better excavate relevant features and fit training data.

Since the first face feature extraction module 21 is located at the front end of the neural network, the multi-layer structure in the first face feature extraction module 21 may also be called a shallow layer of the neural network, and performs the preliminary processing on the depth map.

Optionally, after the step S210 is executed by the first face feature extraction module 21, the face detection module 22 in the neural network model continues to execute the step S220, and face detection (face detection) is performed on the first feature map, so as to obtain a face region feature map.

By way of example, fig. 6 shows a schematic diagram of the architecture of a face detection module 22.

As shown in fig. 6, the face detection module 22 may include: a convolutional layer network 221, a face-range convolutional layer 222, and a face-center convolutional layer 223.

Specifically, the convolution layer network 221 includes at least two convolution layers, so as to further perform convolution operation on the first feature map output by the first face feature extraction module 21, so as to extract more face features in the first feature map.

It will be appreciated that in the convolutional layer network 221, the size, type and value of the convolutional kernel in each convolutional layer may be different, so as to extract feature information of different dimensions in the first feature map, so as to combine to obtain a feature map including face features. Meanwhile, the convolutional layer network 221 can control the channel number of the feature map, the size of the feature map and other process parameters, so that the unified processing of the rear layer structure of the neural network is facilitated.

Optionally, as shown in fig. 6, the face detection module 22 may further include a face feature concentration layer 224, where the face feature concentration layer 224 is a attention mechanism (attention mechanism) module that represents a convolution module, and may be a spatial-based (spatial) attention mechanism module, a channel-based (channel) attention mechanism module, or a combination of spatial and channel attention mechanism modules.

As an example, the face feature focusing layer 224 is a spatial-based attention mechanism module focusing on the spatial feature of each feature map, and in particular, because the depth values of the five-sense organ parts such as eyes and nose on the face are inconsistent with the depth values of other planes, the face feature focusing layer 224 is configured to focus on the positions of the five-sense organs of the face, and weight distribution is performed on the pixel values of each feature map to highlight the five-sense organs of the face in the feature map. By adding the face feature concentration layer 224 to the face detection module 22, the convolutional network 221 can be made to convolve to obtain a feature map which more highlights facial features, so that the accuracy of subsequent face position detection and in-vivo detection is increased.

Alternatively, if the face feature concentration layer 224 is a spatial-based attention mechanism module, which includes but is not limited to a spatial transformation network (spatial transformer networks, STN) model, it may also be any spatial attention mechanism module in the related art, which is aimed at extracting the planar features of the feature map output by each layer of the convolutional layer network 221.

Further, after the concentration convolution processing of the convolution layer network 221 and the face feature concentration layer 224, an intermediate feature map is obtained, and after the convolution processing of the face range convolution layer 222 and the face center convolution layer 223, the intermediate feature map can obtain a face region prediction map (scale map) and a face center prediction map (center map), where the face region prediction map is used for representing the region size of the face in the first feature map, and the face center prediction map is used for representing the center position of the face in the first feature map.

Optionally, the face range convolution layer 222 and the face center convolution layer 223 are two 1×1 convolution layers, where the face center prediction map obtained by the face center convolution layer 223 is a face center point heat map (center heat map).

As shown in fig. 7, the face region frame in the first feature map, that is, the face region feature map, may be obtained by mapping the detection result of the face region prediction map and the face center point heat map to the first feature map.

The face detection module 22 shown in fig. 6 is a neural network structure in the actual face detection process, where the convolution kernel parameters of each convolution layer and other related parameters are all required to be obtained through neural network training. In order to improve accuracy and robustness in the actual face detection process, fig. 8 shows a schematic diagram of the structure of a face detection module 22 in the training phase.

As shown in fig. 8, in comparison with the face detection module 22 in fig. 6, in the training stage, the face detection module 22 further includes a center adjustment convolution layer 225, configured to adjust the position coordinates of the face center obtained by the face center convolution layer 223. Optionally, in the embodiment of the present application, the channel of the face center convolution layer 223 includes two layers, which are respectively responsible for the offset in the horizontal direction and the vertical direction.

In the training stage, face image samples under different scenes and different angles are collected, and face frames (namely the real value group trunk) in the samples are converted into the real values of the center point and the range label. For example, for a face center point, when the target center falls at which position, a 1 is assigned to that position (i.e., positive sample), and 0 is assigned to other positions (i.e., negative sample); for the face region, when the center of the object falls at which position, the log value of the size of the region is assigned at the position, and the other positions are assigned with 0.

The loss function (loss function) includes three parts, one part is the loss of the predicted center point position, the second part is the loss of the predicted face frame size, the third part is the loss of the predicted center point offset position, and after the three parts are weighted respectively, the sum of the three parts is the loss function of the final face detection module 22, and the face detection module 22 is trained by using the loss function to obtain the parameters of each layer in the module.

According to the scheme of the embodiment of the application, in the training process, the coordinates of the predicted face center position are adjusted by setting the center adjustment convolution layer 225 and the like so as to increase the robustness and accuracy of the center position prediction, and in the actual face detection process, the face region is obtained only by using the face range convolution layer 222 and the face center convolution layer 223, so that the efficiency of the detection process can be improved, and the face detection speed can be accelerated.

Further, after the face region feature map is obtained through the face detection in the step S220, step S230 is executed to perform further feature extraction on the face region feature map, so as to obtain at least one second feature map.

Specifically, this step S230 may perform feature extraction by the second face feature extraction module 23 in the neural network 20.

Optionally, the second face feature extraction module 23 may include at least one convolution layer, where the at least one convolution layer is configured to perform convolution calculation on the face region feature map, so as to extract living features in the face region feature map, such as facial texture features, five-sense organ detail features, and the like, for distinguishing living face regions from non-living face regions, in other words, after processing by the second face feature extraction module 23, the second feature map of the living face region has a larger feature difference from the second feature map of the non-living face.

Similar to the first face feature learning module described above, the convolution kernel in at least one convolution layer of the second face feature extraction module may be a 3*3 matrix, a 5*5 matrix, or a matrix of other sizes, which is not limited in this embodiment of the present application.

Optionally, the number of the convolution layers in the second face feature extraction module may be between one layer and four layers, or may also be more than four convolution layers, where the sizes of the multiple convolution kernels in each convolution layer may be the same or different, and the convolution step sizes of the multiple convolution kernels may be the same or different.

Alternatively, the type of at least one convolution layer may be the same or may be different, including but not limited to two-dimensional convolution, three-dimensional convolution, point-by-point convolution, depth convolution, separable convolution, deconvolution, and/or hole convolution, among others.

Optionally, after at least one convolution layer, an excitation layer may further be included, where the excitation layer includes an excitation function for performing a nonlinear processing on each pixel value in the feature map obtained by convolution.

It can be understood that in the training stage, the relevant parameters in the second face feature extraction module can be trained through a plurality of live face area image samples and non-live face area image samples so as to optimize and obtain the model of the second face feature extraction module in the embodiment of the application, so that in the actual face detection process, the feature map capable of distinguishing the live body from the non-live body can be obtained through the processing of the second face feature extraction module.

Further, after the step S230, a step S240 is performed to perform living detection on the second feature map, and determine whether the face in the second feature map is a living face, so as to obtain a living detection result of the face.

Specifically, in the embodiment of the present application, step S240 may be performed by the concentration module 24 in the neural network-like 20.

It is understood that the concentration module 24 is a concentration module that focuses on the living body feature in the second feature map.

Alternatively, in the embodiment of the present application, the concentration module 24 may be a space-based concentration mechanism module, a channel-based concentration mechanism module, or a combination of space and channel concentration mechanism module.

As an example, the focus module 24 is a spatial and channel-combined focus mechanism module that includes, but is not limited to, a convolution block focus module (convolutional block attention module, CBAM) model, which may also be any of the spatial and channel-combined focus mechanism modules of the related art, intended to extract planar features of multiple feature maps and key features in different channels.

Fig. 9 shows a schematic diagram of the construction of a focus module 24.

Specifically, the focus module 24 includes multiple convolutions, each of which is followed by an attention mechanism module to generate an optimized profile.

As shown in fig. 9, after the first convolution layer 241, a channel attention module 244 is added, after the second convolution layer 242, a spatial attention module 245 is added, and then, through the third convolution layer 243, a target feature map is output, and based on the target feature map, classification judgment is performed, so as to determine whether the target feature map is a living face.

As an example, the processing procedure of the second feature map to the target feature map is described below in conjunction with the structure of the concentration module 24 in fig. 9.

The second feature map is passed through the first convolution layer 241 to obtain N (N-channel) first intermediate feature maps, which are input as input feature maps to the channel attention module 244.

In the channel attention module 244, the N first intermediate feature maps are compressed in the spatial dimension by maximum pooling (max pooling) and average pooling (average pooling) respectively to obtain two 1×1×n first intermediate vectors, the two 1×1×n first intermediate vectors are processed by a shared multi-layer perceptron (MLP) respectively to obtain two 1×1×n second intermediate vectors, the second intermediate vectors output by the MLP are subjected to element-based addition (elementwise addition), and then an activation function, such as sigmoid function activation, is performed to generate the channel attention feature map (channel attention featuremap). The channel attention profile and the first intermediate profile are subjected to an element-based multiplication (elementwise multiplication) operation to obtain N (N-channel) second intermediate profiles that are input to the second convolution layer 242.

Optionally, in some embodiments, the second convolution layer 242 performs a convolution operation on the N second intermediate feature maps, and inputs the convolved feature maps into the spatial attention module 245 as input feature maps, or in other embodiments, omits the structure of the second convolution layer 242, and directly inputs the N second intermediate feature maps into the spatial attention module 245 as input feature maps.

In the spatial attention module 245, the input multiple feature maps, for example, N (N channels) second intermediate feature maps are compressed in the channel dimension by maximum pooling (max pooling) and average pooling (average pooling), to obtain two w×h×1 third intermediate feature maps, where W and H are the width and height of the second intermediate feature maps. The two third intermediate vector results are then combined (concat) based on the channel, and then subjected to a convolution operation to reduce the dimension to a w×h×1 fourth intermediate feature map. The fourth intermediate feature map is then processed by an activation function, such as a sigmoid activation function, to generate a spatial attention feature map (spatial attention feature), and the spatial attention feature map and the second intermediate feature map are subjected to an element-based multiplication (elementwise multiplication) operation, so as to obtain N (N-channel) fifth intermediate feature maps input to the third convolution layer 243.

Optionally, the fifth intermediate feature map is further convolved by the third convolution layer 243, and finally output to obtain a target feature map, where the target feature map integrates spatial features and channel features, and the living body is detected by the target feature map, so that the method has higher reliability and robustness.

In the embodiment of the application, the light-weight CBMA module is adopted, so that the target feature map can be obtained more simply and effectively than the attention module focusing on the channel or the attention module focusing on the space.

In the embodiment of the application, some compression reward modules (squeeze and excitation block, SE block) are omitted from the above-mentioned neural network-like architecture including the convolution layer and the excitation layer, so that a lightweight neural network architecture can be realized, and the lightweight neural network-like architecture is also convenient to operate on a device with friendly edge operation.

Meanwhile, the concentration module in the embodiment of the application is added, so that the living body detection is beneficial to concentrating on the face characteristics, and the accuracy of the living body detection result is improved.

Based on the scheme of the embodiment of the application, the face position in the target to be detected can be output and obtained through the neural network, and the living body detection result of the face can be synchronously output.

The method embodiments of face detection in the present application are described in detail above with reference to fig. 4 to 9, and the apparatus embodiments of face detection in the present application are described in detail below with reference to fig. 10 to 12, and it should be understood that the apparatus embodiments and the method embodiments correspond to each other, and similar descriptions may refer to the method embodiments.

Fig. 10 is a schematic block diagram of a face detection apparatus 20 according to an embodiment of the present application, the face detection apparatus 20 corresponding to the face detection method 200 described above.

As shown in fig. 10, the face detection apparatus 20 includes:

an obtaining unit 210, configured to obtain a depth map of a target to be detected;

the processing unit 220 is configured to perform image processing on the depth map to obtain a face region frame including a living face in the depth map.

The output unit 230 outputs a face region frame including a living face in the depth map.

In the embodiment of the application, the image for face detection is the depth map of the target to be detected, so that the influence of ambient light on face detection can be avoided, and the face detection can be effectively performed under the conditions of low illumination, no illumination or reverse illumination and the like.

Specifically, as shown in fig. 11, the processing unit 220 may include:

A first face feature extraction module 21, configured to perform feature extraction on the depth map to obtain a first feature map;

the face detection module 22 is configured to perform face detection on the first feature map to obtain a face region feature map;

the second face feature extraction module 23 performs feature extraction on the face region feature map to obtain a second feature map;

the concentration module 24 performs living body detection on the second feature map to obtain a living body detection result of the human face.

Optionally, the processing unit 220 may include the neural network 20 in the above method embodiment to process the depth image of the target to be detected.

In the embodiment of the present application, the neural network 20 in the processing unit 220 is a light-weight neural network architecture, and has high operation efficiency, and can perform living body detection while performing face detection, and can output a face region frame including a living body face at one time, so as to improve accuracy of face detection.

Specifically, the first face feature extraction module 21, the face detection module 22, the second face feature extraction module 23, and the concentration module 24 in the processing unit 220 correspond to the first face feature extraction module 21, the face detection module 22, the second face feature extraction module 23, and the concentration module 24 in the neural network 20, respectively.

It can be appreciated that, in the embodiment of the present application, the relevant technical solutions of the first face feature extraction module 21, the face detection module 22, the second face feature extraction module 23, and the concentration module 24 may be referred to the relevant description above, and will not be repeated here.

In some possible embodiments, after the first facial feature extraction module 21 performs feature extraction on the depth map of the target to be detected, the obtained first feature map includes edge line features in the depth map.

In some possible embodiments, the first face feature extraction module 21 may include: and the convolution layer is not more than 4 layers, so that the extraction performance is ensured and the running speed of the module is improved.

Referring to fig. 6 and related description above, in some possible implementations, the face detection module 22 may include: a convolutional layer network 221, a face range convolutional layer 222, and a face center convolutional layer 223.

Optionally, the convolutional layer network 221 is configured to perform convolutional calculation on the first feature map to obtain an intermediate feature map;

the face range convolution layer 222 and the face center convolution layer 223 are respectively used for performing convolution calculation on the intermediate feature map to obtain a face region prediction map and a face center prediction map;

The face region prediction graph and the face center prediction graph are used for mapping the detection result to the first feature graph to obtain a face region feature graph.

Optionally, referring to fig. 6, the face detection module 22 further includes: a face feature concentration layer 224;

the convolution layer network 221 and the face feature concentration layer 224 are configured to perform convolution calculation on the first feature map to obtain an intermediate feature map;

the face feature concentration layer 224 is configured to perform weight distribution on the pixel values of the intermediate feature map, so as to highlight the facial features in the intermediate feature map.

As an example, the face feature concentration layer 224 is a space-based attention module, among other things.

In addition, referring to fig. 8 and related description above, in some possible embodiments, the face detection module 22 may further include a centering convolution layer 225, where parameters of the face detection module 22 are trained via a neural network.

In the neural network training phase, in the face detection module 22, the convolutional layer network 221 is further configured to: carrying out convolution calculation on a sample image to obtain a first sample feature image, wherein the sample image is marked with a face region true value and a face center true value;

The face range convolution layer 222, the face center convolution layer 223, and the center adjustment convolution layer 225 are configured to: respectively carrying out convolution calculation on the first sample feature map to obtain a face region predicted value, a face center predicted value and a face center deviation predicted value;

the face region predicted value, the face center offset predicted value, and the face region true value and the face center true value are used to calculate a loss function to obtain parameters of the face detection module 22.

As an example, the face-range convolution layer 222 and the face-center convolution layer 223 are two 1×1 convolution layers, where the face-center prediction graph is a face-center heat graph.

Further, the feature extraction is performed on the face region feature map by the second face feature extraction module 23, and the obtained second feature map includes the detail features of the face.

As an example, the second feature map includes facial features of a face.

In some possible embodiments, the second face feature extraction module 23 includes no more than 4 convolution layers, which increases the operation speed of the module while ensuring the extraction performance.

Referring to fig. 9 and the related description above, in some possible embodiments, in the processing unit 220, the concentration module 24 is a concentration mechanism module that combines space and channels.

As an example, the focus module 24 includes: a multi-layer convolution layer (first, second, and third convolution layers 241, 242, 243), a channel attention module 244, and a spatial attention module 245;

a first convolution layer 241 in the multi-layer convolution layers carries out convolution calculation on the second feature map to obtain a first intermediate feature map;

processing the first intermediate feature map with a channel attention module 244 to obtain a channel attention feature map;

performing convolution calculation on the channel attention feature map and the first intermediate feature map by adopting a second convolution layer 242 to obtain a second intermediate feature map;

processing the second intermediate feature map by using a spatial attention module 245 to obtain a spatial attention feature map;

performing convolution calculation on the space attention feature map and the second intermediate feature map by adopting a third convolution layer 243 to obtain a target feature map;

and obtaining a living body detection result of the human face based on the target feature map, wherein the target feature map comprises living body features of the human face.

Since the processing unit 220 performs the face detection operation using a lightweight neural network in the embodiment of the present application, in some embodiments, the face detection device 20 may be an edge operation device.

Fig. 12 is a schematic hardware structure of the face detection apparatus according to the embodiment of the present application. The face detection apparatus 30 shown in fig. 12 (the face detection apparatus 30 may be a computer device in particular) includes a memory 310, a processor 320, a communication interface 330, and a bus 340. Wherein the memory 310, the processor 320, and the communication interface 330 are communicatively coupled to each other via a bus 340.

The memory 310 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 310 may store programs that, when executed by the processor 320, the processor 320 and the communication interface 330 are configured to perform the steps of the method of face detection of the embodiments of the present application.

The processor 320 may employ a general-purpose central processing unit (central processing unit, CPU), microprocessor, application specific integrated circuit (application specific integrated circuit, ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to perform functions required by modules in the face detection apparatus of the embodiments of the present application or to perform the face detection methods of the embodiments of the present application.

Processor 320 may also be an integrated circuit chip with signal processing capabilities. In implementation, various steps of the face detection method of the present application may be performed by integrated logic circuitry of hardware in the processor 320 or by instructions in the form of software. The processor 320 may also be a general purpose processor, a digital signal processor (digital signal processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 310, and the processor 320 reads information in the memory 310, and combines with hardware thereof to perform functions required to be performed by modules included in the face detection apparatus of the embodiment of the present application, or perform a method for face detection of the embodiment of the present application.

The communication interface 330 enables communication between the apparatus 30 and other devices or communication networks using a transceiver apparatus such as, but not limited to, a transceiver. For example, the input data may be acquired through the communication interface 330.

Bus 340 may include a path to transfer information between various components of device 30 (e.g., memory 310, processor 320, communication interface 330).

It should be noted that although the apparatus 30 shown in fig. 12 only shows the memory 310, the processor 320, the communication interface 340, and the bus 340, those skilled in the art will appreciate that in a particular implementation, the apparatus 30 also includes other devices necessary to achieve proper operation. Also, as will be appreciated by those skilled in the art, the apparatus 30 may also include hardware devices that perform other additional functions, as desired. Furthermore, it will be appreciated by those skilled in the art that the apparatus 30 may also include only the necessary components to implement the embodiments of the present application, and not necessarily all of the components shown in FIG. 12.

It should be understood that the face detection apparatus 30 may correspond to the face detection apparatus 20 in fig. 10 described above, the functions of the processing unit 220 in the face detection apparatus 20 may be implemented by the processor 320, and the functions of the acquisition unit 210 and the output unit 230 may be implemented by the communication interface 330. To avoid repetition, detailed descriptions are omitted here as appropriate.

The embodiment of the application also provides a processing device, which comprises a processor and an interface; the processor is configured to perform the method of face detection in any of the method embodiments described above.

It should be understood that the processing means may be a chip. For example, the processing device may be a field-programmable gate array (field-programmable gate array, FPGA), an application-specific integrated chip (application specific integrated circuit, ASIC), a system on chip (SoC), a central processing unit (central processor unit, CPU), a network processor (network processor, NP), a digital signal processing circuit (digital signal processor, DSP), a microcontroller (micro controller unit, MCU), a programmable controller (programmable logic device, PLD) or other integrated chip.

The embodiment of the application also provides a platform system which comprises the face detection device.

The present application also provides a computer readable medium having stored thereon a computer program which, when executed by a computer, implements the method of any of the method embodiments described above.

The present application also provides a computer program product which, when executed by a computer, implements the method of any of the method embodiments described above.

The embodiment of the application also provides electronic equipment, which can comprise the face recognition device of the embodiment of the application.

For example, the electronic device is a smart door lock, a mobile phone, a computer, an access control system, and the like, which need to apply face recognition. The face recognition device comprises software and hardware devices for face recognition in the electronic equipment.

Optionally, the electronic device may further include a depth map acquisition device.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The terms "unit," "module," "system," and the like as used in this specification are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between 2 or more computers. Furthermore, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with one another in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in essence or a part contributing to the prior art or a part of the technical solutions, or in the form of a software product, which is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of face detection, comprising:

obtaining a depth map of a target to be detected, and carrying out feature extraction on the depth map to obtain a first feature map;

performing face detection on the first feature map to obtain a face region feature map, including:

and carrying out face detection on the first feature map by adopting a face detection module to obtain a face region feature map, wherein the face detection module comprises: the face detection module is used for carrying out face detection on the first feature map to obtain a face region feature map, and the face region feature map comprises:

performing convolution calculation on the first feature map by adopting the convolution layer network to obtain a first intermediate feature map;

the face range convolution layer and the face center convolution layer are adopted to respectively carry out convolution calculation on the first intermediate feature map, so as to obtain a face region prediction map and a face center prediction map;

obtaining the face region feature map in the first feature map according to the face region prediction map and the face center prediction map;

extracting features of the facial region feature map to obtain a second feature map;

Performing living body detection on the second feature map to obtain a living body detection result of the human face, wherein the living body detection result comprises the following steps:

performing living body detection on the second feature map by adopting an concentration module to obtain a living body detection result of the human face, wherein the concentration module comprises: a multi-layer convolution layer, a channel attention module, and a spatial attention module;

the adopting the concentration module to carry out living body detection on the second feature map to obtain a living body detection result of the human face, comprising the following steps:

performing convolution calculation on the second feature map by adopting a first convolution layer to obtain a first intermediate feature map;

processing the first intermediate feature map by adopting the channel attention module to obtain a channel attention feature map;

performing convolution calculation on the channel attention feature map and the first intermediate feature map by adopting a second convolution layer to obtain a second intermediate feature map;

processing the second intermediate feature map by adopting the spatial attention module to obtain a spatial attention feature map;

performing convolution calculation on the space attention feature map and the second intermediate feature map by adopting a third convolution layer to obtain a target feature map;

classifying and judging based on the target feature map to obtain a living body detection result of the human face, wherein the target feature map comprises living body features of the human face;

And outputting a human face region frame comprising the living human face in the depth map according to the living body detection result of the human face.

2. The method of claim 1, wherein the feature extracting the depth map to obtain a first feature map includes:

and carrying out feature extraction on the depth map by adopting a first face feature extraction module to obtain the first feature map, wherein the first feature map comprises edge line features in the depth map.

3. The method of claim 2, wherein the number of layers of the convolution layer of the first face feature extraction module is no greater than 4.

4. The method of claim 1, wherein the face detection module further comprises: a face feature concentration layer;

the face feature concentration layer is used for carrying out weight distribution on the pixel values of the intermediate feature image so as to highlight the face five-sense organ features in the intermediate feature image.

5. The method of claim 4, wherein the facial feature concentration layer is a space-based attention module.

6. The method as recited in claim 1, further comprising:

and training the neural network of the face detection module to obtain parameters of the face detection module.

7. The method of claim 6, wherein the face detection module further comprises a central adjustment convolution layer, wherein the training the face detection module with the neural network to obtain the parameters of the face detection module comprises:

acquiring a sample image, wherein the sample image is marked with a face region true value and a face center true value;

carrying out convolution calculation on the sample image by adopting the convolution layer network to obtain a first sample feature map;

the face range convolution layer, the face center convolution layer and the center adjustment convolution layer are adopted to respectively carry out convolution calculation on the first sample feature map, so as to obtain a face region predicted value, a face center predicted value and a face center deviation predicted value;

and calculating a loss function according to the face region predicted value, the face center deviation predicted value, the face region true value and the face center true value to obtain parameters of the face detection module.

8. The method of any of claims 1 to 7, wherein the face range convolution layer and the face center convolution layer are two 1 x 1 convolution layers, wherein the face center prediction graph is a face center heat graph.

9. The method according to any one of claims 1 to 7, wherein the feature extraction of the face region feature map to obtain a second feature map includes:

and carrying out feature extraction on the face region feature map by adopting a second face feature extraction module to obtain a second feature map, wherein the second feature map comprises detail features of the face.

10. The method of claim 9, wherein the number of layers of the convolution layers of the second face feature extraction module is no greater than 4.

11. The method of claim 9, wherein the second feature map includes facial features of a human face.

12. The method of any one of claims 1 to 7, wherein the method is run on an edge-operated device.

13. A device for face detection, comprising:

the acquisition unit is used for acquiring a depth map of the target to be detected;

the first face feature extraction module is used for carrying out feature extraction on the depth map to obtain a first feature map;

the face detection module is used for carrying out face detection on the first feature map to obtain a face region feature map, and the face detection module comprises: a convolutional layer network, a face range convolutional layer and a face center convolutional layer;

The convolution layer network is used for carrying out convolution calculation on the first feature map to obtain an intermediate feature map;

the face range convolution layer and the face center convolution layer are respectively used for carrying out convolution calculation on the intermediate feature map to obtain a face region prediction map and a face center prediction map;

the face region prediction graph and the face center prediction graph are used for mapping detection results of the face region prediction graph and the face center prediction graph into the first feature graph to obtain the face region feature graph;

the second face feature extraction module is used for carrying out feature extraction on the face region feature map to obtain a second feature map;

the concentration module is used for performing living body detection on the second feature map to obtain a living body detection result of the human face, wherein the concentration module comprises: a multi-layer convolution layer, a channel attention module, and a spatial attention module;

a first convolution layer in the multi-layer convolution layers carries out convolution calculation on the second feature map to obtain a first intermediate feature map;

obtaining a living body detection result of the human face based on the target feature map, wherein the target feature map comprises living body features of the human face;

and the output module is used for outputting a face region frame comprising the living face in the depth map according to the living body detection result of the face.

14. The apparatus of claim 13, wherein the first feature map comprises edge line features in the depth map.

15. The apparatus according to claim 13 or 14, wherein the number of layers of the convolution layers of the first face feature extraction module is no greater than 4.

16. The apparatus of claim 13, wherein the face detection module further comprises: a face feature concentration layer;

17. The apparatus of claim 16, wherein the facial feature concentration layer is a space-based attention module.

18. The apparatus of claim 13, wherein the parameters of the face detection module are trained via a neural network.

19. The apparatus of claim 18, wherein the face detection module further comprises a center-adjustment convolution layer;

the convolutional layer network is further configured to: carrying out convolution calculation on a sample image to obtain a first sample feature image, wherein the sample image is marked with a face region true value and a face center true value;

the face range convolution layer, the face center convolution layer and the center adjustment convolution layer are used for: respectively carrying out convolution calculation on the first sample feature map to obtain a face region predicted value, a face center predicted value and a face center deviation predicted value;

the face region predicted value, the face center offset predicted value, the face region true value and the face center true value are used for calculating a loss function to obtain parameters of the face detection module.

20. The apparatus of claim 13, wherein the face range convolution layer and the face center convolution layer are two 1 x 1 convolution layers, wherein the face center prediction map is a face center heat map.

21. The apparatus according to claim 13 or 14, wherein the second feature map comprises detail features of a face.

22. The apparatus of claim 21, wherein the second face feature extraction module includes a number of convolutional layers no greater than 4.

23. The apparatus of claim 21, wherein the second feature map includes facial features of a human face.

24. The device according to claim 13 or 14, characterized in that the device is an edge computing device.

25. An electronic device, comprising:

a face detection apparatus as claimed in any one of claims 13 to 24.

26. The electronic device of claim 25, wherein the electronic device further comprises:

and the depth map acquisition device.

27. A computer readable storage medium storing program instructions which, when executed by a computer, perform the method of face detection as claimed in any one of claims 1 to 12.

28. A computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of face detection as claimed in any one of claims 1 to 12.