WO2023178906A1

WO2023178906A1 - Liveness detection method and apparatus, and electronic device, storage medium, computer program and computer program product

Info

Publication number: WO2023178906A1
Application number: PCT/CN2022/110261
Authority: WO
Inventors: 王柏润; 刘建博; 张帅; 伊帅
Original assignee: 上海商汤智能科技有限公司
Priority date: 2022-03-22
Filing date: 2022-08-04
Publication date: 2023-09-28
Also published as: CN114677730A

Abstract

A liveness detection method and apparatus, and an electronic device, a storage medium, a computer program and a computer program product. The method comprises: acquiring an infrared facial image and a color facial image, that are collected by a binocular camera, of an object subjected to detection (201); performing feature extraction on the infrared facial image so as to obtain a first facial feature map, and performing feature extraction on the color facial image so as to obtain a second facial feature map (202); obtaining target category attribute information of said object on the basis of the second facial feature map (203); obtaining a third facial feature map on the basis of the target category attribute information and the second facial feature map (204); and obtaining a liveness detection result of said object on the basis of the first facial feature map and the third facial feature map (205).

Description

Living body detection methods and devices, electronic equipment, storage media, computer programs, computer program products

Cross-application of related applications

This disclosed embodiment is based on a Chinese patent application with application number 202210283792.2, application date is March 22, 2022, and the application name is "living body detection method, device, electronic equipment and storage medium", and requires priority of this Chinese patent application The entire content of this Chinese patent application is hereby incorporated by reference into this disclosure.

Technical field

The present disclosure relates to but is not limited to the field of computer vision technology, and in particular, to a living body detection method and device, electronic equipment, storage media, computer programs, and computer program products.

Background technique

Computer vision is a hot topic in current research. It is a synthesis of image processing, artificial intelligence, pattern recognition and other technologies. It has also been widely used in various fields of society. When it comes to the application of computer vision, face recognition is always inseparable, and a key step in face recognition is liveness detection. Common liveness algorithms can be divided into interactive liveness algorithms and silent liveness algorithms according to the form of liveness detection. , according to the type of camera module, it can be divided into monocular in vivo algorithm, binocular in vivo algorithm and three-dimensional (3-Dimensional, 3D) in vivo algorithm. Current living body detection algorithms often appear in the form of a single model, but in some scenarios, the capacity of a single model is often difficult to achieve the accuracy of live body detection.

Contents of the invention

Embodiments of the present disclosure provide a living body detection method and device, electronic equipment, storage media, computer programs, and computer program products, which are beneficial to improving the accuracy of binocular living body detection.

Embodiments of the present disclosure provide a living body detection method, which method includes:

Obtain the infrared face image and color face image of the detection object collected by the binocular camera;

Perform feature extraction on the infrared face image to obtain the first face feature map, and perform feature extraction on the color face image to obtain the second face feature map;

According to the second facial feature map, obtain the target category attribute information of the detection object;

According to the target category attribute information and the second face feature map, the third face feature map is obtained;

According to the first face feature map and the third face feature map, the liveness detection result of the detection object is obtained.

The embodiment of the present disclosure obtains the infrared face image and the color face image of the detection object collected by the binocular camera; performs feature extraction on the infrared face image to obtain the first face feature map, and performs feature extraction on the color face image. Perform feature extraction on the image to obtain a second facial feature map; obtain the target category attribute information of the detection object based on the second facial feature map; obtain the target category attribute information of the detection object based on the target category attribute information and the second facial feature map , obtain a third facial feature map; obtain a living body detection result of the detection object based on the first facial feature map and the third facial feature map. In this way, the second face feature map extracted from the color face image is classified to obtain the category attribute information of the detection object (ie, the target category attribute information), and the second face feature map is converted into the third face feature map based on the category attribute information of the detection object. Three face feature maps to achieve feature extraction with category attribute information, and use features with category attribute information (i.e., the third face feature map) and infrared face features (i.e., the first face feature map) for live detection, It is helpful to improve the accuracy of binocular live body detection.

Embodiments of the present disclosure provide a living body detection device, which includes an acquisition unit and a processing unit;

An acquisition unit configured to acquire infrared face images and color face images of the detection object collected by the binocular camera;

The processing unit is configured to perform feature extraction on the infrared face image to obtain a first face feature map, and perform feature extraction on the color face image to obtain a second face feature map;

The processing unit is also configured to obtain target category attribute information of the detection object based on the second facial feature map;

The processing unit is also configured to obtain a third facial feature map based on the target category attribute information and the second facial feature map;

The processing unit is also configured to obtain a liveness detection result of the detection object based on the first facial feature map and the third facial feature map.

An embodiment of the present disclosure provides an electronic device. The electronic device includes: a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory. So that the electronic device performs the method of life detection.

Embodiments of the present disclosure provide a computer-readable storage medium that stores a computer program, and the computer program causes the computer to perform the method of life detection.

Embodiments of the present disclosure provide a computer program that includes computer readable code. When the computer readable code is read and executed by a computer, part of the method in any embodiment of the present disclosure is implemented or All steps.

Embodiments of the present disclosure provide a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer is operable to cause the computer to perform the method of life detection.

Description of the drawings

In order to explain the embodiments of the present disclosure or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

Figure 1 is a schematic diagram of an application environment provided by an embodiment of the present disclosure;

Figure 2A is a schematic flow chart of a living body detection method provided by an embodiment of the present disclosure;

Figure 2B is a schematic flowchart of a method for determining an attention matrix provided by an embodiment of the present disclosure;

Figure 2C is a schematic flow chart of a living body detection method provided by an embodiment of the present disclosure;

Figure 3 is a schematic network structure diagram of a living body detection model provided by an embodiment of the present disclosure;

Figure 4 is a schematic diagram of selecting a third branch provided by an embodiment of the present disclosure;

Figure 5 is a schematic diagram of multiple pixels corresponding to a certain feature provided by an embodiment of the present disclosure;

Figure 6 is a schematic flow chart of another living body detection method provided by an embodiment of the present disclosure;

Figure 7 is a schematic structural diagram of a living body detection device provided by an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

In order to enable those skilled in the art to better understand the present disclosure, the following will clearly and completely describe the technical solutions in the present disclosure embodiments in conjunction with the accompanying drawings. Obviously, the described embodiments are only These are part of the embodiments of this disclosure, not all of them. Based on the embodiments in this disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of this disclosure.

Where the terms "include" and "have" and any variations thereof appear in this disclosure, claims, and drawings, they are intended to cover a non-exclusive inclusion. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units that are not listed, or optionally also includes Other steps or units inherent to such processes, methods, products or devices. In addition, the terms "first", "second" and "third" are used to distinguish different objects and are not used to describe a specific order.

Reference in this disclosure to an "embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described in this disclosure may be combined with other embodiments.

Please refer to Figure 1. Figure 1 is a schematic diagram of an application environment provided by an embodiment of the present disclosure. As shown in Figure 1, the application environment at least includes a binocular camera 101 and an electronic device 102. The binocular camera 101 and the electronic device 102 connected via wired or wireless network. Among them, the binocular camera 101 includes a visible light camera module 1011 and an infrared camera module 1012. The visible light camera module 1011 and the infrared camera module 1012 are used to synchronously collect images of the detection object when the detection object enters the image collection range. , obtain color images and infrared images respectively, and store the color images and infrared images in the face recognition system or directly send them to the electronic device 102. The electronic device 102 receives or matches the color images and infrared images from the system. Next, face detection is performed on it, and a color face image is intercepted from the color image and an infrared face image is intercepted from the infrared image based on the position information of the face detection frame. The electronic device 102 calls the living body that supports multi-category attribute information. The detection model performs live body detection on color face images and infrared face images. Since the live body detection model uses branches corresponding to each category attribute information for feature extraction, the extracted live body features have unique category attribute information, so that it can Improve the accuracy of living body classification, thereby improving the accuracy of living body detection.

For example, the electronic device 102 may be an independent physical server or a server cluster, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, as well as cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, etc. In some possible implementations, the living body detection method can be implemented by the processor calling computer-readable instructions stored in the memory.

Please refer to Figure 2A. Figure 2A is a schematic flow chart of a living body detection method provided by an embodiment of the present disclosure. The method can be implemented based on the application environment shown in Figure 1 and applied to electronic devices. As shown in Figure 2A, the method includes Steps 201 to 205:

201: Obtain the infrared face image and color face image of the detection object collected by the binocular camera.

In the embodiment of the present disclosure, the electronic device can obtain in real time the infrared image and color image of the detection object synchronously collected by the binocular camera, and can also obtain the infrared image and color image of the detection object synchronously collected by the binocular camera from the face recognition system. Images are not limited here. For example, when the electronic device acquires an infrared image and a color image, it intercepts the infrared face image and the color face image respectively from the two images based on the detection frame generated by the face detection algorithm.

Exemplarily, obtaining the infrared face image and color face image of the detection object collected by the binocular camera includes:

A1: Select the one with the highest face quality as the target color image from at least two color images of the detection object stored in the face recognition system, where the at least two color images are generated by the visible light camera module in the binocular camera It is obtained by continuously collecting the detection objects. The electronic device extracts features from the faces in at least two color images through a pre-trained face quality detection model to obtain features containing face size, angle, and sharpness information, and then performs classification prediction on the features to obtain Face quality detection scores for each of at least two color images, and the one with the highest score is selected as the target color image.

A2: Perform face quality detection on the faces in at least two infrared images stored in the face recognition system, obtain the face quality detection score of each infrared image in the at least two infrared images, and calculate the face quality detection The difference between the score and the face quality detection score of the target color image. The electronic device also extracts features including face size, angle, and sharpness information through the face quality detection model, and then classifies the features to obtain the face quality detection score of each infrared image.

A3: Among at least two infrared images, the one with the smallest difference between the face quality detection score and the target color image is used as the candidate infrared image.

A4: Detect face key points on the target color image and candidate infrared image respectively, and obtain 106 first key points including human eyes, cheekbones, nose, ears, chin and cheek areas in the target color image and 106 first key points in the candidate infrared image. Includes 106 second key points of human eyes, cheekbones, nose, ears, chin and cheek areas.

A5: Calculate the similarity between the 106 first key points and the 106 second key points. If the similarity is less than the preset threshold, the target color image and the candidate infrared image are determined to be the detection objects of the binocular camera at the same time. Based on the image pairs collected, based on the detection frames in the target color image and the candidate infrared image when detecting facial key points, the face area image is intercepted from the target color image and the candidate infrared image respectively, and the infrared face of the detection object is obtained. images and color face images.

In this implementation, when the electronic device needs to obtain an image from the face recognition system, the electronic device may not know which two color images and infrared images were collected from the detection object at the same time. For any detection object, First, select the one with the highest face quality from at least two color images, and then select the one with the face quality detection score closest to the color image from all infrared images of the face recognition system as the candidate infrared image, and then select 106 facial key points to perform key point matching on the two images. If the similarity between the key points is less than the preset threshold, it is considered that the candidate infrared image and the target color image are detected at the same time. This will help improve the accuracy of image matching in scenarios where electronic devices need to obtain images from the face recognition system.

202: Perform feature extraction on the infrared face image to obtain the first face feature map, and perform feature extraction on the color face image to obtain the second face feature map.

In the embodiment of the present disclosure, a living body detection model structure is proposed. As shown in Figure 3, the living body detection model includes a first branch (303), a second branch (304), a category attribute classifier (305), at least two The third branch (306) and the living body detection classifier (307), wherein the first branch (303) is used to extract features from the input infrared face image (301) to obtain the first face feature map, and the second branch (304) is used to extract features from the input color face image (302) to obtain a second face feature map, in which the first face feature map and the second face feature map cover important areas on the face. (For example: human eyes, cheekbones, nose, ears, chin, cheeks, etc.) Semantic information about whether it is a living body. For example, the semantic information can be one or at least two of material, texture and gloss. Optionally, both the first branch and the second branch can use at least two Inception structures in series for feature extraction. The Inception structure uses convolution kernels of different sizes, which means different sizes of receptive fields, that is, different scales are achieved. The fusion of features, therefore, the first face feature map and the second face feature map have richer semantic information.

In a possible implementation, the living body detection model further includes a category attribute classifier, at least two third branches and a living body detection classifier, wherein the second branch, the attribute classifier and the at least two third branches are connected in sequence, At least two third branches are independent of each other, and each of the at least two third branches corresponds to different category attribute information. The output of each third branch is spliced with the output of the first branch respectively, and the splicing The final output is used as the input of the living body detection classifier. In this way, in this embodiment, for detection objects with different categories of attribute information, the corresponding third branch can be used for inference in the same living body detection model. Compared with detection objects with different categories of attribute information, different living body detection models need to be adopted. The solution for live detection is helpful to save the memory overhead caused by storing at least two models and is more robust. Each third branch in the model can be migrated using the parameters of the existing model, which can be achieved during the training phase. Efficient iteration, while only adding a category attribute classifier after the second branch has a negligible impact on the inference speed of the entire model.

In a possible implementation, performing feature extraction on the infrared face image to obtain the first face feature map includes: inputting the infrared face image into the first branch of the living body detection model for feature extraction to obtain the first face feature map. Feature map; performing feature extraction on the color face image to obtain the second face feature map, including: inputting the color face image into the second branch of the living body detection model for feature extraction to obtain the second face feature map. In this way, in this embodiment, different neural network branches are used to extract features of infrared face images and color face images respectively. Since the first branch is trained using the supervision information of infrared face sample images, the second branch is trained using color face images. If the supervision information of face sample images is trained, then the infrared face images and color face images are input into their corresponding branches for feature extraction, which is beneficial to extracting features with richer semantic information.

203: Obtain the target category attribute information of the detection object based on the second face feature map.

In the embodiment of the present disclosure, the second facial feature map is input into an attribute classifier, so that one or at least two types of semantic information are classified and predicted by the attribute classifier to obtain target category attribute information. In some embodiments, the target category attribute information can be gender, age group, location identification (such as belonging to the first region), etc., and the attribute classifier uses category attribute information as supervision during training. Therefore, it is based on a classifier that contains rich semantics. The second feature map of the information can predict the target category attribute information of the detection object, such as which country the detection object is from and which age group the detection object belongs to.

In a possible implementation, the features in the second face feature map include one or at least two semantic information of material, texture and gloss. According to the second face feature map, the target category attribute of the detection object is obtained. Information includes: inputting the second face feature map into an attribute classifier, and classifying and predicting one or at least two types of semantic information through the attribute classifier to obtain target category attribute information, where the target category attribute information includes a location identifier; according to The target category attribute information and the second facial feature map are used to obtain the third facial feature map, which includes: determining the third branch corresponding to the location identification from at least two third branches, and inputting the second facial feature map The third branch corresponding to the location identification performs feature extraction to obtain a third face feature map. In this way, in this embodiment, the second facial feature map is classified by the attribute classifier in the living body detection model to obtain the target category attribute information of the detection object (such as the identification of the location), and then from at least two third branches Determine the third branch corresponding to the attribute information of the target category, and use the third branch to extract features from the second face feature map, so that the third face feature map can carry features unique to the attribute information of the category, thereby enabling Relatively improve the accuracy of living body detection.

204: Obtain the third face feature map based on the target category attribute information and the second face feature map.

In the embodiment of the present disclosure, after the electronic device obtains the target category attribute information of the detection object, it can determine the third branch corresponding to the target category attribute information from at least two third branches, as shown in Figure 4. If If the identification of the location of the detection object is the first identification (401), it can be determined from at least two third branches such as the first identification branch (402), the second identification branch (403), and the third identification branch (404). In the first identification branch (402), the second facial feature map is input into the first identification branch (402) for feature extraction to obtain a third facial feature map (405). Optionally, at least two third branches can also use at least two consecutive Inception structures for feature extraction. Each third branch uses unique category attribute information as supervision during the training process. Therefore, the third face feature The graph carries characteristics unique to this category of attribute information, which can relatively improve the accuracy of living body detection.

205: Obtain the liveness detection result of the detection object based on the first face feature map and the third face feature map.

In the embodiment of the present disclosure, for example, based on the first facial feature map and the third facial feature map, the living body detection result of the detection object is obtained, including:

B1: Splice the first facial feature map and the third facial feature map to obtain the fourth facial feature map;

B2: Obtain the degree of attention of each facial feature in the fourth facial feature map and obtain the degree of attention matrix;

B3: Multiply the fourth face feature map and the attention matrix to obtain the first weighted feature map;

B4: Classify the first weighted feature map to obtain the live detection result of the detection object.

In some embodiments, in step B1, an attention model may be first used to generate attention coefficients for each facial feature in the fourth facial feature map, and a matrix composed of attention coefficients may be determined as the first attention matrix. Among them, the attention model can be any existing attention model. It should be understood that the attention model can predict which target the human eye will pay more attention to when viewing a certain image, that is, it can be calculated through the characteristics of the image. Attention coefficient of features. The electronic device performs key point detection on the color face image, and obtains M key points of the preset area of interest and the coordinate information and category information of the M key points. Among them, the preset attention areas refer to the eye area, cheekbone area, nose area, ear area, chin area and cheek area. The M key points refer to the 106 key points in step A4, and M is greater than 1. integer. The electronic device calculates the second attention matrix based on the fourth facial feature map and the M key points, and adds the elements at corresponding positions of the first attention matrix and the second attention matrix to obtain the attention matrix.

For example, as shown in Figure 2B, determining the second attention matrix based on the fourth facial feature map and M key points may include the following steps 211 to 215:

211: For the location of each facial feature in the fourth facial feature map, determine at least two pixels corresponding to the location of each facial feature in the color face image, and obtain at least two pixels. coordinate information.

It should be understood that since the fourth facial feature map is obtained by splicing the first facial feature map and the third facial feature map, the original image is convolved and pooled based on the deep learning structure (Inception) structure. As shown in Figure 5, for the location of any facial feature in the fourth facial feature map (503), at least two pixels (502) corresponding to it in the color face image (501) can be determined. That is, the features at this position are calculated through the features of at least two pixels (502), such as the 9 pixels (502) in the black rectangular box in Figure 5. At the same time, the at least two pixels can be obtained (502) The coordinate information in the color face image (501), for example, the coordinate information of a certain pixel point a is (x4, y1).

212: For each pixel point among at least two pixel points, use the coordinate information of each pixel point and the coordinate information of M key points to calculate the distance between each pixel point and each of the M key points. distance.

In some embodiments, assuming that the coordinate information of a key point b among the M key points is (x5, y7), then the distance between the pixel point a and the key point b is

From this, the distance between each of the at least two pixel points and each of the M key points can be calculated.

213: Based on the distance between each pixel and each of the M key points and the category information of each key point, assign a weight to each pixel to obtain M reference weights for each pixel.

In some embodiments, the embodiments of the present disclosure pre-set weights for M key points. For example, each key point can correspond to a different weight, such as key points in the eye area, according to the distance from the closest to the center of the eyeball. For further distances, the weights show a decreasing trend, that is, more attention is paid to key points in the eye area that are closer to the center of the eyeball. Other areas can adopt the same or similar weight setting method with reference to the eye area, then the weights of M key points It can be α ₁ , α ₂ , α ₃ ,,,, α _M . For pixel point a among at least two pixel points, if it is related to one of the M key points (for example: key point b) If the distance between them is less than the preset distance threshold, then the weight of the key point b is assigned to the pixel a. If the distance between it and the key point b is greater than or equal to the preset distance threshold, then the weight of the pixel a is assigned. 0. For example, the same weight can also be set for key points of the same category of information. For example, the weights of n key points in the eye area can be set to α ₁ , and the weights of o key points in the nose area can be set to α 1 . is set to α ₂ , the weights of the q key points in the chin area can be set to α ₃ , etc., then the M key points also have M weights. The difference is that the weights of key points with the same category of information are also the same. For pixel a and key point b, when the distance between them is less than the preset distance threshold, the weight of key point b is also assigned to pixel a. When the distance between the two is greater than or equal to the preset distance threshold When , pixel a is also assigned a weight of 0. Based on the above two weight allocation methods, each pixel will be assigned M weights, and the M weights will be used as the reference weight of each pixel.

214: Determine the average of the M reference weights as the weight of each pixel.

215: Based on the weight of each pixel, determine the weight of each facial feature, and determine the matrix composed of the weight of each facial feature as the second attention matrix.

In some embodiments, since each facial feature in the fourth facial feature map corresponds to at least two pixels in the color face image, each facial feature can be calculated based on the weight of each pixel in step 214 For example, the average of the weights of at least two pixels can be used as the weight of the corresponding facial feature in the fourth face feature map. For example, the mode of the weights of at least two pixels can also be used as the fourth face. The weight of the corresponding facial feature in the feature map is used to determine the matrix composed of the weight of each facial feature as the second attention matrix.

In the embodiment of the present disclosure, the features in the fourth facial feature map are multiplied by the elements at corresponding positions in the attention matrix to obtain the first weighted feature map. The features in the first weighted feature map can better express the detection object. Semantic information of facial focus areas.

In this implementation, the attention model is used to generate the first attention matrix, and then the second attention matrix is constructed for the fourth facial feature map based on key point detection and weight distribution. Since the attention model generates attention coefficients It is possible that a small amount of information about the key areas of focus may be missed, and assigning weights to the features in the fourth face feature map based on preset weights can make up for the possible omissions of the attention model, so that the focus of the face can be All areas can receive attention, which enables the obtained first weighted feature map to fully express the semantic information of the key areas of interest, which in turn helps improve the accuracy of living body classification.

Exemplarily, as shown in Figure 2C, obtaining the living body detection result of the detection object based on the first facial feature map and the third facial feature map includes the following steps 221 to 226:

221: Obtain the degree of attention of each facial feature in the first facial feature map, and obtain the degree of attention matrix E.

In some embodiments, refer to the method of obtaining the attention matrix in step B2. First, an attention model is used to generate the attention coefficient of each facial feature in the first facial feature map, and the matrix composed of the attention coefficients is determined as The third attention matrix performs key point detection on the infrared face image, and also obtains N key points (such as 106 key points) in the preset area of interest and the coordinate information and category information of the N key points. For the first The position of each facial feature in the facial feature map is determined, and at least two pixel points corresponding to the position of each facial feature in the infrared facial image are obtained, and the coordinate information of the at least two pixel points is obtained, for at least For each pixel in the two pixels, the coordinate information of each pixel and the coordinate information of N key points are used to calculate the distance between each pixel and each of the N key points. Based on the The distance between each pixel and each of the N key points and the category information of each key point are assigned a weight to each pixel to obtain N reference weights for each pixel. N The average of the reference weights is determined as the weight of each pixel point, the average or mode of the weights of the at least two pixel points is determined as the weight of each facial feature in the first facial feature map, and the The matrix composed of the weight of each facial feature is determined as the fourth attention matrix. The third attention matrix and the fourth attention matrix are added to obtain the attention matrix E.

222: Obtain the degree of attention of each facial feature in the third facial feature map, and obtain the degree of attention matrix F.

In some embodiments, an attention model is first used to generate the attention coefficient of each facial feature in the third facial feature map, the matrix composed of the attention coefficients is determined as the fifth attention matrix, and the color face image is keyed Point detection also obtains S key points (such as 106 key points) in the preset area of interest and the coordinate information and category information of the S key points. For the location of each facial feature in the third facial feature map , determine at least two pixels corresponding to the location of each facial feature in the color face image, and obtain the coordinate information of the at least two pixels. For each pixel in the at least two pixels, use each The coordinate information of each pixel point and the coordinate information of S key points are calculated. The distance between each pixel point and each of the S key points is calculated based on the distance between each pixel point and each of the S key points. The distance between points and the category information of each key point are assigned a weight to each pixel point to obtain S reference weights for each pixel point, and the average of the S reference weights is determined for each pixel point. The weight of the at least two pixels is determined as the weight of each facial feature in the third facial feature map, and the matrix composed of the weight of each facial feature is determined as the sixth Attention matrix, add the fifth attention matrix and the sixth attention matrix to obtain the attention matrix F.

223: Multiply the first face feature map and the attention matrix E to obtain the second weighted feature map;

224: Multiply the third face feature map and the attention matrix F to obtain the third weighted feature map;

225: Splice the second weighted feature map and the third weighted feature map to obtain a spliced weighted feature map;

226: Classify the spliced weighted feature map to obtain the live detection result of the detection object.

In this implementation, another feature splicing method is used, that is, the attention model, key point detection and weight distribution are used to generate the attention matrix E for the first face feature map, and then the first face feature map and the attention matrix By multiplying E, we can also obtain the second weighted feature map that can fully express the semantic information of the key area of interest; use the attention model, key point detection and weight distribution to generate the attention matrix F for the third face feature map, and combine the third face feature map with Multiplying the face feature map and the attention matrix F can also obtain the third weighted feature map that can fully express the semantic information of the key area of interest; splicing the second weighted feature map and the third weighted feature map fully integrates the color facial feature map. The semantic information of key areas of interest in face images and the semantic information of key areas of interest in infrared face images are also conducive to improving the accuracy of living body classification.

The embodiment of the present disclosure obtains the infrared face image and the color face image of the detection object collected by the binocular camera; performs feature extraction on the infrared face image to obtain the first face feature map, and performs feature extraction on the color face image. Perform feature extraction on the image to obtain a second facial feature map; obtain the target category attribute information of the detection object based on the second facial feature map; obtain the target category attribute information of the detection object based on the target category attribute information and the second facial feature map , obtain a third facial feature map; obtain a living body detection result of the detection object based on the first facial feature map and the third facial feature map. In this way, the second face feature map extracted from the color face image is classified to obtain the category attribute information of the detection object (ie, the target category attribute information), and the second face feature map is converted into the third face feature map based on the category attribute information of the detection object. Three face feature maps to achieve feature extraction with category attribute information, and use features with category attribute information (i.e., the third face feature map) and infrared face features (i.e., the first face feature map) for live detection, It is helpful to improve the accuracy of binocular live body detection. In addition, for detection objects with different categories of attribute information, embodiments of the present disclosure can use the corresponding third branch in the same living body detection model to perform inference. Compared with detection objects with different categories of attribute information, different living body detection models need to be used. The living body detection solution helps save the memory overhead caused by storing at least two models and is more robust. Each third branch in the model can be migrated using the parameters of the existing model, which can achieve high efficiency in the training phase. Iteration, meanwhile, only adds a category attribute classifier after the second branch, which has a negligible impact on the inference speed of the entire model.

Please refer to Figure 6. Figure 6 is a schematic flow chart of another living body detection method provided by an embodiment of the present disclosure. As shown in Figure 6, the method includes steps 601 to 608:

601: Obtain the infrared face image and color face image of the detection object collected by the binocular camera;

602: Perform feature extraction on the infrared face image to obtain the first face feature map, and perform feature extraction on the color face image to obtain the second face feature map;

603: Obtain the target category attribute information of the detection object based on the second face feature map;

604: Obtain the third face feature map based on the target category attribute information and the second face feature map;

605: Splice the first facial feature map and the third facial feature map to obtain the fourth facial feature map;

606: Obtain the degree of attention of each facial feature in the fourth facial feature map and obtain the degree of attention matrix;

607: Multiply the fourth face feature map and the attention matrix to obtain the first weighted feature map;

608: Classify the first weighted feature map to obtain the live detection result of the detection object.

The implementation of steps 601 to 608 has been described in the embodiments shown in FIGS. 2A to 5 , and can achieve the same or similar beneficial effects.

Based on the description of the method embodiment shown in Figure 2A or Figure 6, please refer to Figure 7. Figure 7 is a schematic structural diagram of a life detection device provided by an embodiment of the present disclosure. As shown in Figure 7, the device includes an acquisition unit 701 and Processing unit 702, wherein:

The acquisition unit 701 is configured to acquire the infrared face image and color face image of the detection object collected by the binocular camera;

The processing unit 702 is configured to perform feature extraction on the infrared face image to obtain a first face feature map, and perform feature extraction on the color face image to obtain a second face feature map;

The processing unit 702 is also configured to obtain the target category attribute information of the detection object based on the second facial feature map;

The processing unit 702 is also configured to obtain a third facial feature map based on the target category attribute information and the second facial feature map;

The processing unit 702 is also configured to obtain the living body detection result of the detection object based on the first facial feature map and the third facial feature map.

It can be seen that in the living body detection device shown in Figure 7, by acquiring the infrared face image and the color face image of the detection object collected by the binocular camera; performing feature extraction on the infrared face image, the first person is obtained facial feature map, and perform feature extraction on the color face image to obtain a second face feature map; obtain the target category attribute information of the detection object according to the second face feature map; and obtain the target category attribute information according to the target category Attribute information and the second facial feature map are used to obtain a third facial feature map; based on the first facial feature map and the third facial feature map, a living body detection result of the detection object is obtained. In this way, the second face feature map extracted from the color face image is classified to obtain the category attribute information of the detection object (ie, the target category attribute information), and the second face feature map is converted into the third face feature map based on the category attribute information of the detection object. Three face feature maps to achieve feature extraction with category attribute information, using features with category attribute information (i.e., the third face feature map) and infrared face features (i.e., the first face feature map) for live detection, It is helpful to improve the accuracy of binocular live body detection.

In a possible implementation, in terms of obtaining the living body detection result of the detection object based on the first facial feature map and the third facial feature map, the processing unit 702 is configured as:

Splice the first facial feature map and the third facial feature map to obtain the fourth facial feature map;

Obtain the degree of attention of each facial feature in the fourth facial feature map and obtain the degree of attention matrix;

Multiply the fourth face feature map and the attention matrix to obtain the first weighted feature map;

Classify the first weighted feature map to obtain a living body detection result of the detection object.

In a possible implementation, in terms of obtaining the degree of attention of each facial feature in the fourth facial feature map and obtaining the degree of attention matrix, the processing unit 702 is configured as:

The attention model is used to generate the attention coefficient of each facial feature in the fourth facial feature map, and the matrix composed of the attention coefficients is determined as the first attention matrix;

Perform key point detection on the color face image to obtain M key points in the preset area of interest, where M is an integer greater than 1;

Based on the fourth facial feature map and M key points, determine the second attention matrix;

Add the first attention matrix and the second attention matrix to obtain the attention matrix.

In a possible implementation, the M key points also include coordinate information and category information of each key point, and then based on the fourth facial feature map and the M key points, the second attention matrix is determined, and the processing unit 702 is configured as:

For the location of each facial feature in the fourth facial feature map, determine at least two pixels corresponding to the location of each facial feature in the color face image, and obtain the coordinate information of at least two pixels. ;

For each pixel point among at least two pixel points, use the coordinate information of each pixel point and the coordinate information of M key points to calculate the distance between each pixel point and each of the M key points;

Based on the distance between each pixel and each of the M key points and the category information of each key point, a weight is assigned to each pixel to obtain M reference weights for each pixel;

Determine the average of the M reference weights as the weight of each pixel;

Based on the weight of each pixel, the weight of each facial feature is determined, and a matrix composed of the weight of each facial feature is determined as the second attention matrix.

In a possible implementation, when performing feature extraction on infrared face images to obtain the first face feature map, the processing unit 702 is configured as:

Input the infrared face image into the first branch of the living body detection model for feature extraction to obtain the first face feature map;

In terms of extracting features from the color face image to obtain the second face feature map, the processing unit 702 is configured as:

The color face image is input into the second branch of the live body detection model for feature extraction to obtain the second face feature map.

In a possible implementation, the living body detection model further includes a category attribute classifier, at least two third branches and a living body detection classifier, wherein the second branch, the attribute classifier and the at least two third branches are connected in sequence, At least two third branches are independent of each other, and each of the at least two third branches corresponds to different category attribute information. The output of each third branch is spliced with the output of the first branch respectively, and the splicing The final output is used as the input of the living body detection classifier.

In a possible implementation, the features in the second facial feature map include one or at least two semantic information of material, texture and gloss, and the target category of the detection object is obtained based on the second facial feature map. In terms of attribute information, the processing unit 702 is configured as:

Input the second face feature map into the attribute classifier to classify and predict one or at least two types of semantic information through the attribute classifier to obtain target category attribute information, where the target category attribute information includes a location identifier;

In terms of obtaining the third facial feature map based on the target category attribute information and the second facial feature map, the processing unit 702 is configured as:

A third branch corresponding to the location identification is determined from at least two third branches, and the second face feature map is input into the third branch corresponding to the location identification for feature extraction to obtain a third face feature map.

According to an embodiment of the present disclosure, each unit in the life detection device shown in FIG. 7 can be separately or entirely combined into one or several other units to form, or one (some) of the units can be further Splitting it into at least two functionally smaller units can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present disclosure. The above-mentioned units are divided based on logical functions. In practical applications, the function of one unit can also be realized by at least two units, or the functions of at least two units can be realized by one unit. In other embodiments of the present disclosure, the living body detection device may also include other units. In practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by at least two units in cooperation.

According to another embodiment of the present disclosure, the system can be configured by including a central processing unit (Central Processing Unit, CPU), a random access storage medium (Random Access Memory, RAM), a read-only storage medium (Read-Only Memory, ROM), etc. A computer program (including program code) capable of executing each step involved in the corresponding method as shown in Figure 2A or Figure 6 is run on a general-purpose computing device such as a computer with processing elements and storage elements to construct the method as shown in Figure 7 The living body detection device shown in the figure, and the living body detection method of the embodiment of the present disclosure are implemented. The computer program may be recorded on, for example, a computer-readable recording medium, loaded into the above-mentioned computing device through the computer-readable recording medium, and run therein.

Based on the description of the above method embodiments and device embodiments, embodiments of the present disclosure provide an electronic device. See FIG. 8 . The electronic device includes a transceiver 801, a processor 802, and a memory 803. They are connected via bus 804. The memory 803 is used to store computer programs and data, and can transmit the data stored in the memory 803 to the processor 802. The processor 802 is used to read the computer program in the memory 803 to perform the following operations:

It can be seen that in the electronic device shown in Figure 8, by acquiring the infrared face image and the color face image of the detection object collected by the binocular camera; performing feature extraction on the infrared face image, the first face is obtained feature map, and perform feature extraction on the color face image to obtain a second face feature map; obtain target category attribute information of the detection object according to the second face feature map; and obtain target category attribute information according to the target category attribute Information and the second facial feature map are used to obtain a third facial feature map; based on the first facial feature map and the third facial feature map, a living body detection result of the detection object is obtained. In this way, the second face feature map extracted from the color face image is classified to obtain the category attribute information of the detection object (ie, the target category attribute information), and the second face feature map is converted into the third face feature map based on the category attribute information of the detection object. Three face feature maps to achieve feature extraction with category attribute information, and use features with category attribute information (i.e., the third face feature map) and infrared face features (i.e., the first face feature map) for live detection, It is helpful to improve the accuracy of binocular live body detection.

In a possible implementation, the processor 802 performs the following steps to obtain the living body detection result of the detection object based on the first facial feature map and the third facial feature map, including:

In a possible implementation, the processor 802 executes to obtain the degree of attention of each facial feature in the fourth facial feature map, and obtains the degree of attention matrix, which includes:

In a possible implementation, the M key points also include coordinate information and category information of each key point. The processor 802 determines the second attention matrix based on the fourth facial feature map and the M key points, include:

Determine the average of the M reference weights as the weight of each pixel;

In a possible implementation, the processor 802 performs feature extraction on the infrared face image to obtain the first face feature map, including: inputting the infrared face image into the first branch of the living body detection model for feature extraction, and obtaining The first facial feature map;

The processor 802 performs feature extraction on the color face image to obtain a second face feature map, including:

In a possible implementation, the features in the second face feature map include one or at least two semantic information of material, texture and gloss. The processor 802 executes to obtain the detection object according to the second face feature map. Target category attribute information, including:

According to the target category attribute information and the second face feature map, the third face feature map is obtained, including:

Illustratively, the electronic device may include but is not limited to a transceiver 801, a processor 802, and a memory 803. Those skilled in the art can understand that the schematic diagram is only an example of an electronic device and does not constitute a limitation on the electronic device. May include more or fewer parts than shown, or combinations of certain parts, or different parts.

It should be noted that since the processor 802 of the electronic device implements the steps in the life detection method of the embodiment of the present disclosure when executing the computer program, the embodiments of the life detection method are all applicable to the electronic device, and can achieve the same or better results. Similar beneficial effects.

Embodiments of the present disclosure also provide a computer-readable storage medium that stores a computer program, and the computer program is executed by a processor to implement any of the living body detection methods described in the above method embodiments. some or all of the steps. Wherein, the computer-readable storage medium may only store the computer program corresponding to the living body detection method.

A computer-readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device, and may be a volatile storage medium or a non-volatile storage medium. The computer-readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer disks, hard drives, magnetic disks, optical disks, random access memory, read-only memory, Erasable Programmable Read-Only Memory Read Only Memory, EPROM) or flash memory, Static Random-Access Memory (Static Random-Access Memory, SRAM), Portable Compact Disk Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), Digital Multi-Function Disk (Digital Video Disc, DVD), memory stick, floppy disk, mechanical encoding device, such as a punched card or a raised structure in a groove with instructions stored thereon, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or through electrical wires. transmitted electrical signals.

Embodiments of the present disclosure also provide a computer program. The computer program includes computer readable code. When the computer readable code is read and executed by a computer, part of the method in any embodiment of the present disclosure is implemented. or all steps.

Embodiments of the present disclosure also provide a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. The computer program is operable to cause the computer to perform the steps described in the above method embodiments. Some or all steps of any living body detection method.

It should be noted that for the sake of simple description, the foregoing method embodiments are expressed as a series of action combinations. However, those skilled in the art should know that the present disclosure is not limited by the described action sequence. Because in accordance with the present disclosure, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are optional embodiments, and the actions and modules involved are not necessarily necessary for the present disclosure.

In the above embodiments, each embodiment is described with its own emphasis. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

In the several embodiments provided in this disclosure, it should be understood that the disclosed device can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, at least two units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical or other forms.

The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to at least two network units. . Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above integrated units can be implemented in the form of hardware or software program modules.

The integrated unit, if implemented in the form of a software program module and sold or used as an independent product, may be stored in a computer-readable memory. Based on this understanding, the technical solution of the present disclosure is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, It includes several instructions to cause a computer device (which can be a personal computer, a server or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present disclosure.

The embodiments of the present disclosure have been introduced in detail above. Examples are used in this article to illustrate the principles and implementation modes of the present disclosure. The description of the above embodiments is only used to help understand the methods and core ideas of the present disclosure; at the same time, for this disclosure Those of ordinary skill in the field may make changes in the implementation and application scope based on the ideas of the present disclosure. In summary, the contents of this description should not be understood as limiting the present disclosure.

Claims

A living body detection method, the method includes:

Obtain the infrared face image and color face image of the detection object collected by the binocular camera;

Perform feature extraction on the infrared face image to obtain a first face feature map, and perform feature extraction on the color face image to obtain a second face feature map;

Obtain the target category attribute information of the detection object according to the second facial feature map;

Obtain a third facial feature map according to the target category attribute information and the second facial feature map;

According to the first facial feature map and the third facial feature map, a living body detection result of the detection object is obtained.
The method according to claim 1, wherein obtaining the living body detection result of the detection object based on the first facial feature map and the third facial feature map includes:

Splicing the first facial feature map and the third facial feature map to obtain a fourth facial feature map;

Obtain the degree of attention of each facial feature in the fourth facial feature map and obtain the degree of attention matrix;

Multiply the fourth facial feature map and the attention matrix to obtain a first weighted feature map;

Classify the first weighted feature map to obtain a living body detection result of the detection object.
The method according to claim 2, wherein said obtaining the degree of attention of each facial feature in the fourth facial feature map and obtaining the degree of attention matrix includes:

Using an attention model to generate attention coefficients for each facial feature in the fourth facial feature map, the matrix composed of the attention coefficients is determined as a first attention matrix;

Perform key point detection on the color face image to obtain M key points of the preset area of interest, where M is an integer greater than 1;

Based on the fourth facial feature map and the M key points, determine a second attention matrix;

The first attention matrix and the second attention matrix are added to obtain the attention matrix.
The method according to claim 3, wherein the M key points further include coordinate information and category information of each key point, and the method is based on the fourth facial feature map and the M key points. , determine the second attention matrix, including:

For the location of each facial feature in the fourth facial feature map, determine at least two pixels corresponding to the location of each facial feature in the color face image, and obtain the at least Coordinate information of two pixels;

For each pixel point in the at least two pixel points, use the coordinate information of each pixel point and the coordinate information of the M key points to calculate the relationship between each pixel point and the M key points. The distance between each key point in

Based on the distance between each pixel and each of the M key points and the category information of each key point, a weight is assigned to each pixel to obtain each pixel. M reference weights of points;

Determine the average value of the M reference weights as the weight of each pixel;

Based on the weight of each pixel point, the weight of each facial feature is determined, and a matrix composed of the weight of each facial feature is determined as a second attention matrix.
The method according to any one of claims 1 to 4, wherein the feature extraction of the infrared face image to obtain the first face feature map includes:

Input the infrared face image into the first branch of the living body detection model for feature extraction to obtain the first face feature map;

The feature extraction of the color face image to obtain a second face feature map includes:

The color face image is input into the second branch of the living body detection model for feature extraction to obtain the second face feature map.
The method of claim 5, wherein the living body detection model further includes a category attribute classifier, at least two third branches and a living body detection classifier, wherein the second branch, the attribute classifier and the living body detection classifier The at least two third branches are connected in sequence, the at least two third branches are independent of each other, and each of the at least two third branches corresponds to different category attribute information, and each of the at least two third branches corresponds to different category attribute information. The outputs of the three branches are respectively spliced with the output of the first branch, and the spliced output is used as the input of the living body detection classifier.
The method according to claim 6, wherein the features in the second facial feature map include one or at least two semantic information of material, texture and gloss. , obtain the target category attribute information of the detection object, including:

The second face feature map is input into the attribute classifier, and the one or at least two semantic information are classified and predicted by the attribute classifier to obtain the target category attribute information. Attribute information includes location identification;

Obtaining a third facial feature map based on the target category attribute information and the second facial feature map includes:

Determine the third branch corresponding to the location identification from the at least two third branches, input the second face feature map into the third branch corresponding to the location identification for feature extraction, and obtain The third facial feature map.
A living body detection device, the device includes an acquisition unit and a processing unit;

The acquisition unit is configured to acquire the infrared face image and color face image of the detection object collected by the binocular camera;

The processing unit is configured to perform feature extraction on the infrared face image to obtain a first face feature map, and perform feature extraction on the color face image to obtain a second face feature map;

The processing unit is further configured to obtain target category attribute information of the detection object based on the second facial feature map;

The processing unit is further configured to obtain a third facial feature map based on the target category attribute information and the second facial feature map;

The processing unit is further configured to obtain a living body detection result of the detection object based on the first facial feature map and the third facial feature map.
The device of claim 8, wherein the processing unit is further configured to:

Splicing the first facial feature map and the third facial feature map to obtain a fourth facial feature map;

Obtain the degree of attention of each facial feature in the fourth facial feature map and obtain the degree of attention matrix;

Multiply the fourth facial feature map and the attention matrix to obtain a first weighted feature map;

Classify the first weighted feature map to obtain a living body detection result of the detection object.
The device of claim 9, wherein the processing unit is further configured to:

Using an attention model to generate attention coefficients for each facial feature in the fourth facial feature map, the matrix composed of the attention coefficients is determined as a first attention matrix;

Perform key point detection on the color face image to obtain M key points of the preset area of interest, where M is an integer greater than 1;

Based on the fourth facial feature map and the M key points, determine a second attention matrix;

The first attention matrix and the second attention matrix are added to obtain the attention matrix.
The device according to claim 10, wherein the M key points further include coordinate information and category information of each key point, and the processing unit is further configured to:

For the location of each facial feature in the fourth facial feature map, determine at least two pixels corresponding to the location of each facial feature in the color face image, and obtain the at least Coordinate information of two pixels;

For each pixel point in the at least two pixel points, use the coordinate information of each pixel point and the coordinate information of the M key points to calculate the relationship between each pixel point and the M key points. The distance between each key point in

Based on the distance between each pixel and each of the M key points and the category information of each key point, a weight is assigned to each pixel to obtain each pixel. M reference weights of points;

Determine the average value of the M reference weights as the weight of each pixel;

Based on the weight of each pixel point, the weight of each facial feature is determined, and a matrix composed of the weight of each facial feature is determined as a second attention matrix.
The device according to any one of claims 8 to 11, wherein the processing unit is further configured to:

Input the infrared face image into the first branch of the living body detection model for feature extraction to obtain the first face feature map;

The feature extraction of the color face image to obtain a second face feature map includes:

The color face image is input into the second branch of the living body detection model for feature extraction to obtain the second face feature map.
The device according to claim 12, wherein the living body detection model further includes a category attribute classifier, at least two third branches and a living body detection classifier, wherein the second branch, the attribute classifier and the living body detection classifier The at least two third branches are connected in sequence, the at least two third branches are independent of each other, and each of the at least two third branches corresponds to different category attribute information, and each of the at least two third branches corresponds to different category attribute information. The outputs of the three branches are respectively spliced with the output of the first branch, and the spliced output is used as the input of the living body detection classifier.
The device according to claim 13, wherein the features in the second facial feature map include one or at least two semantic information of material, texture and gloss, and the processing unit is further configured to:

The second face feature map is input into the attribute classifier, and the one or at least two semantic information are classified and predicted by the attribute classifier to obtain the target category attribute information. Attribute information includes location identification;

Determine the third branch corresponding to the location identification from the at least two third branches, input the second face feature map into the third branch corresponding to the location identification for feature extraction, and obtain The third facial feature map.
An electronic device includes: a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the electronic device executes as The method of any one of claims 1 to 7.
A computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the method according to any one of claims 1 to 7.
A computer program includes a computer readable code. When the computer readable code is run on a device, a processor in the device executes the method for implementing any one of claims 1 to 7.
A computer program product configured to store computer-readable instructions that, when executed, cause a computer to perform the method of any one of claims 1 to 7.