CN116110111B

CN116110111B - Face recognition method, electronic equipment and storage medium

Info

Publication number: CN116110111B
Application number: CN202310291936.3A
Authority: CN
Inventors: 梁俊杰
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-09-08
Anticipated expiration: 2043-03-23
Also published as: CN116110111A

Abstract

The application discloses a face recognition method, electronic equipment and a storage medium, wherein the face recognition method comprises the following steps: acquiring a plurality of face images and depth images corresponding to each face image; acquiring a first mouth region image in each face image, and determining a second mouth region image in a depth image corresponding to the face image by using the first mouth region image; inputting the first mouth region image into a mouth shape classification model to obtain a target mouth shape corresponding to the first mouth region; inputting the second mouth region image into a living body recognition model to obtain a living body detection result; and determining that the target object corresponding to the face image is a living body according to the target mouth shape and the living body detection result meeting preset requirements. Through the mode, the human face recognition method provided by the application can improve the living body detection efficiency and strengthen the defending capability of human face recognition to be cracked.

Description

Face recognition method, electronic equipment and storage medium

Technical Field

The application relates to the technical field of face recognition, in particular to a face recognition method, electronic equipment and a storage medium.

Background

Face living body identification is very common in the authentication scene, especially in the financial business scene such as bank account opening, transaction, loan, etc., face identification is needed to assist in authentication of identity, so as to ensure the safety and reliability of remote business and avoid the property loss of users caused by counterfeiting of others.

The existing face recognition technology comprises action biopsy, silence living body detection, face dazzling living body detection and the like, but the existing face recognition technology is easy to finish on-line face recognition by using a dummy face image because of a relatively fixed operation flow, and has poor safety.

Disclosure of Invention

The application provides a face recognition method, electronic equipment and a storage medium, which can improve the safety of face recognition.

In order to solve the technical problems, the application adopts a technical scheme that: provided is a face recognition method, including: acquiring a plurality of face images and depth images corresponding to each face image; acquiring a first mouth region image in each face image, and determining a second mouth region image in a depth image corresponding to the face image by using the first mouth region image; inputting the first mouth region image into a mouth shape classification model to obtain a target mouth shape corresponding to the first mouth region; inputting the second mouth region image into a living body recognition model to obtain a living body detection result; and determining that the target object corresponding to the face image is a living body according to the target mouth shape and the living body detection result meeting preset requirements.

The method for acquiring the plurality of face images and the depth image corresponding to each face image comprises the following steps: prompting at least two preset characters; and acquiring a plurality of face images and depth images corresponding to the face images in the process that the target object reads at least two preset characters.

Wherein, at least two preset characters at least correspond to two mouth shapes.

The method for acquiring the first mouth region image in each face image comprises the following steps: carrying out key point identification on each face image to obtain a plurality of key points; determining key points belonging to the mouth from a plurality of key points; and determining a first mouth region image according to the key points of the mouth.

Inputting the first mouth region image into a mouth shape classification model to obtain a target mouth shape corresponding to the first mouth region, wherein the method comprises the following steps: inputting the first mouth region image into a mouth shape classification model to obtain probability values of all mouth shapes corresponding to the first mouth region; and taking the mouth shape corresponding to the maximum probability value as a target mouth shape.

Inputting the first mouth region image into a mouth shape classification model to obtain probability values of all mouth shapes corresponding to the first mouth region, wherein the method comprises the following steps: and inputting coordinate information of key points in the first mouth region image into a mouth shape classification model to obtain probability values of all mouth shapes corresponding to the first mouth region.

The determining a second mouth region image in the depth image corresponding to the face image by using the first mouth region image comprises the following steps: acquiring position information of a first mouth area; and determining a target area corresponding to the position information in the depth image, and determining an image corresponding to the target area as a second mouth area image.

The second mouth area image is input to a living body recognition model to obtain a living body detection result, and the method comprises the following steps: randomly acquiring a preset number of second mouth area images; inputting a preset number of second mouth region images into a living body recognition model to obtain a living body probability value corresponding to each second mouth region image; the plurality of living body probability values are averaged, and a living body detection result is determined based on the averaged living body probability values.

The living body identification model comprises an attention mechanism layer and a plurality of network layers which are connected in sequence, wherein the attention mechanism layer is connected between part of adjacent network layers and is used for carrying out channel weighting on the characteristics output by the previous network layer in the adjacent network layers.

Wherein the network layer comprises a plurality of convolution kernels with different sizes; the convolution kernel of the first size is used to extract global features and the convolution kernel of the second size is used to extract texture features, the convolution kernel of the first size being larger than the convolution kernel of the second size.

Wherein, in response to the target mouth shape and the living body detection result meeting the preset requirements, determining that the target object corresponding to the face image is a living body comprises: and determining that the target object corresponding to the face image is a living body in response to the target mouth shape belonging to the preset mouth shape and in response to the living body detection result being displayed as the living body.

In order to solve the technical problems, the application adopts another technical scheme that: an electronic device is provided, the electronic device includes an image acquisition unit, and a memory and a processor coupled to the image acquisition unit, wherein the image acquisition unit is used for acquiring a plurality of face images and depth images corresponding to each face image, the memory is used for storing a computer program, and the processor is used for executing the computer program to implement the face recognition method.

In order to solve the technical problems, the application adopts another technical scheme that: there is provided a computer readable storage medium for storing a computer program for implementing the face recognition method described above when executed by a processor.

The beneficial effects of the application are as follows: compared with the prior art, the face recognition method provided by the application has the advantages that the first mouth region image in each face image is obtained, the second mouth region image is determined in the depth image corresponding to the face image by utilizing the first mouth region image, the mouth shape of the first mouth region image is further predicted, the living body of the second mouth region image is predicted, when the two predictions meet the preset requirements, the target object corresponding to the face image can be determined to be a living body, the safety of face recognition is improved, and the accuracy of face recognition is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

fig. 1 is a schematic flow chart of a first embodiment of a face recognition method provided by the application;

FIG. 2 is a flow chart of an embodiment of step 13 provided in the present application;

FIG. 3 is a flow chart of an embodiment of step 14 provided in the present application;

FIG. 4 is a schematic diagram of a network structure of a living body identification model provided by the present application;

FIG. 5 is a schematic diagram of an embodiment of an electronic device according to the present application;

fig. 6 is a schematic structural diagram of an embodiment of a computer readable storage medium according to the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Along with the continuous development of technology, the application scene of face recognition is more and more extensive, especially in the financial industry, such as bank account opening, transaction, loan, etc., face recognition is needed to assist in verifying the identity, so as to ensure the safety and reliability of remote service and avoid the property loss of users caused by counterfeiting of others.

For example, when a bank APP (Application program) is used for carrying out large-scale transaction, in order to ensure property safety of a user, besides providing corresponding identity information, key and other information for the user, the user is required to carry out face recognition at the bank APP, if the user is required to carry out actions such as mouth opening and blinking, so as to ensure that the user carries out face recognition in real time instead of using photos for carrying out face recognition, however, because the conventional common face living identification technology has a relatively fixed operation flow, such as mouth opening after blinking, nodding after blinking and head shaking, and because of lacking important three-dimensional features, the problems of on-line face recognition by using a dummy face image, high false detection rate and the like are easily caused, and the safety and the accuracy are poor.

In order to solve the above-mentioned problems, the present application proposes a new face recognition technology, referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of a face recognition method provided by the present application, where the method includes:

step 11: and acquiring a plurality of face images and depth images corresponding to each face image.

In some embodiments, the target object is further collected to read a plurality of face images and depth images corresponding to each face image in the process of reading at least two preset characters by prompting at least two preset characters. Wherein, at least two preset characters at least correspond to two mouth shapes.

The mouth shape can comprise three types of mouth opening, mouth blessing and mouth closing. Wherein, the characters corresponding to the mouth opening can be o, ya, duck and the like; the characters corresponding to the beep mouth can be books, trees, double books, couplings and the like; the characters corresponding to the closed mouth can be closed, leather, pen, nose, rice, etc.

In order to enhance the security of face recognition, it is possible to choose to let the target object read at least two types of mouth shapes. For example, three characters are selected, which correspond to at least two mouth shapes.

In some embodiments, the text library can be designed, and when the face recognition is required, preset texts corresponding to different mouth shapes can be randomly selected from the text library, so that a plurality of face images and depth images corresponding to each face image in the process that the target object reads at least two preset texts are collected.

In addition, the face image and the depth image may be acquired by a device provided with a LiDAR (Laser Radar), and the face image may be an RGB image or a gray-scale image. For example, a smart phone with LiDAR is used for acquiring RGB face images and corresponding depth images.

Step 12: and acquiring a first mouth region image in each face image, and determining a second mouth region image in the depth image corresponding to the face image by using the first mouth region image.

In some embodiments, a plurality of key points may be obtained by performing key point recognition on each face image, then the key points belonging to the mouth are determined from the plurality of key points, and further the first mouth region image is determined according to the key points of the mouth. Because the key points have corresponding coordinate information, the key point with the largest X-axis coordinate, the key point with the largest Y-axis coordinate, the key point with the smallest X-axis coordinate and the key point with the smallest Y-axis coordinate in all the key points of the mouth can be found, and at the moment, the mouth region image is determined according to the four key points.

For example, the landmark (feature point/key point) of the face image is extracted by using MediaPipe, and a mouth region image containing landmark, that is, a first mouth region image, can be determined from the face image containing landmark. The MediaPipe is a framework for constructing a machine learning pipeline, and can be used for processing video, audio and the like sequence data.

In addition, the change in the mouth shape may be determined based on the positional information of the change in the key point in the first mouth region image.

In some embodiments, the position information of the first mouth region may be acquired, so as to determine a target region corresponding to the position information in the depth image, and determine an image corresponding to the target region as the second mouth region image.

For example, the MediaPipe is used to extract the landmark of the face image, so as to obtain the position information of the mouth region according to the landmark, and further the position information of the mouth region can be mapped into the depth image, so as to obtain the mouth region image (i.e. the second mouth region image) in the depth image. It will be appreciated that there is a one-to-one correspondence between the pixels of the face image and the depth image, and therefore the mouth region image corresponding to the face image may be determined in the depth image by means of a location map.

In addition, when the target object opens or bleeds, the second mouth region image corresponding to the first mouth region image is provided with an obvious concave region, the concave region only exists when the target object is a living body, the concave region does not exist in the face image or the photo which is flipped, and the depth information of the depth image with the concave region is inconsistent with the depth information of the depth image without the concave region, so that whether the concave region of the target object exists or not is an important judgment basis for detecting whether the target object is a living body or not, and the safety of face recognition can be improved by detecting the depth image when the face recognition is performed.

Step 13: and inputting the first mouth region image into a mouth shape classification model to obtain a target mouth shape corresponding to the first mouth region.

The mouth shape classification model may be an SVM classifier based on an SVM (Support Vector Machines, support vector machine) algorithm.

In some embodiments, referring to fig. 2, step 13 may include the following procedure:

step 21: and inputting the first mouth region image into a mouth shape classification model to obtain probability values of all mouth shapes corresponding to the first mouth region.

Specifically, coordinate information of key points in the first mouth region image is input into a mouth shape classification model to obtain probability values of all mouth shapes corresponding to the first mouth region.

For example, the x-coordinates and y-coordinates of all the key points in the first mouth region images are used as features and input into the SVM classifier, and since 1 first mouth region image contains a plurality of key points, the SVM classifier can be used for carrying out mouth shape prediction on each first mouth region image based on the x-coordinates and y-coordinates of the key points, a plurality of probability values can be obtained in 1 first mouth region image, and the number of probability values is consistent with the number of mouth shapes. If the mouth shapes are 3, the mouth shape prediction is performed on the first mouth region image by using an SVM classifier, and 3 probability values can be obtained by 1 first mouth region image.

Step 22: and taking the mouth shape corresponding to the maximum probability value as a target mouth shape.

The 1 first mouth region image can obtain probability values for all mouth shapes prediction, and the mouth shape corresponding to the maximum probability value can be selected from all obtained probability values to serve as a target mouth shape. For example, 3 kinds of mouth shapes are set, namely mouth opening, mouth blessing and mouth closing, and mouth shape prediction is performed on 1 first mouth region image by using a mouth shape classification model, so that the probability of mouth opening corresponding to the first mouth region model is 20%, the probability value of mouth closing is 70%, and the probability value of mouth blessing is 10%, and at the moment, the mouth shape corresponding to the maximum probability value can be used as a target mouth shape, namely the target mouth shape is the mouth closing.

In some embodiments, a corresponding mouth shape determining rule may be designed to determine a mouth shape corresponding to the first mouth region image using the mouth shape determining rule. The mouth shape judging rule requires that at least two different mouth shapes are selected when the preset characters are selected, and the mouth shapes corresponding to the adjacent two characters are required to be inconsistent.

If 3 characters are selected from the N characters, wherein the 3 characters are pens, books and must, the corresponding mouth shapes are [1,2,1], wherein the mouth opening is represented by 1, the mouth opening is represented by 2, the mouth closing is represented by 3, and if the mouth shape corresponding to the N characters is [1,1,1,1,2,2,1,1,1,1,1,1], the mouth area image selected when the mouth shape prediction is performed can be considered to accord with the mouth shape judgment rule; if the mouth shape corresponding to the N characters is [2,2,2,1,1,1,1,1], the mouth region image selected when the mouth shape prediction is performed is considered to be inconsistent with the mouth shape judgment rule.

In addition, the training method of the mouth shape classification model may include the following procedures (not shown):

s1: a set of mouth region images is acquired.

S2: and inputting the mouth region images to be trained in the mouth region image set into a mouth shape classification model to perform mouth shape prediction.

The mouth region images to be trained contained in the mouth region image set can be RGB mouth region images marked by using key points, and mouth shape prediction can be performed according to coordinate information of the key points in the RGB mouth region images.

S3: and training the mouth shape classification model by utilizing the predicted mouth shape of the mouth shape classification model and the real mouth shape in the mouth region image to be trained, so as to obtain the final mouth shape classification model.

Step 14: and inputting the second mouth region image into the living body recognition model to obtain a living body detection result.

The living body identification model may be a depth map model, and is used for performing living body detection on the depth map.

In some embodiments, referring to fig. 3, step 14 may comprise the following procedure:

step 31: and randomly inputting a preset number of second mouth region images into the living body recognition model to obtain a living body probability value corresponding to each second mouth region image.

In other embodiments, a preset number of second mouth region images may be input to the living body recognition model at preset intervals, so as to obtain a living body probability value corresponding to each second mouth region image.

The living body recognition model recognizes the second mouth region image, and the recognition result includes two kinds, one is living body and the other is non-living body, that is, 1 second mouth region image can obtain two probability values.

And inputting a certain number of second mouth region images into the living body recognition model so as to carry out living body recognition on the living body probability value corresponding to each second mouth region image by using the living body recognition model, thereby obtaining a living body probability value and a non-living body probability value.

In some embodiments, referring to fig. 4, fig. 4 is a schematic diagram of a network structure of a living body recognition model provided by the present application, where the living body recognition model includes an attention mechanism layer and a plurality of network layers connected in sequence, and the attention mechanism layer is connected between a part of adjacent network layers, and is used for channel weighting of features output by a previous network layer in the adjacent network layers. For example, the 3 rd, 7 th, 13 th network layers following the stem in a lightweight network embed blocks (i.e., the attention mechanism layer).

The convolution kernels with different sizes are mixed in a single network layer/convolution layer, so that different types of features can be conveniently captured from the input second mouth region image, a large convolution kernel can capture high-resolution features (such as global features), a small convolution kernel can capture low-resolution features (such as texture features of the image), and the accuracy and efficiency of the model can be improved through the combination of the convolution kernels with different sizes.

Specifically, the attention mechanism layer performs a weighting operation on the channel, and the importance of the channel is obtained by multiplying the channel by corresponding weights (0-1), so that the model can achieve better performance in a mode that the effective feature map has large weight, and the ineffective or small-effect feature map has small weight. A series of transformations are performed in the network layer/convolution layer to obtain a weight matrix (ranging from 0 to 1) of 1 x C (C is the number of channels), and then the matrix is dot multiplied with the original convolution kernel. This allows the model to focus more on important channels through the attention mechanism layer.

The living body identification model combines an attention mechanism with a lightweight network (namely a network layer), so that the operation efficiency can be improved, and the attention mechanism layer is connected between part of adjacent network layers, so that the lightweight network can automatically pay attention to important features, and the accuracy of model detection is improved.

Step 32: the plurality of living body probability values are averaged, and a living body detection result is determined based on the averaged living body probability values.

The 1 second mouth region image may obtain two probability values, and the N second mouth region images may obtain n×2 probability values, where the n×2 probability values include N living probability values and N non-living probability values, and at this time, the N living probability values may be averaged to obtain an average living probability value, and the N non-living probability values may be averaged to obtain an average non-living probability value. The living detection result may be determined based on the average living probability value or the average non-living probability value. For example, the average living probability value is 80% and the average non-living probability value is 20%, and it can be determined that the living body is present.

In addition, the training method of the living body recognition model may include the following flow (not shown):

s1: a set of depth images is acquired.

S2: and inputting the depth image to be trained in the depth image set into a living body recognition model to carry out living body recognition.

The depth image to be trained is a second mouth region image, and by detecting the image depth information, it is determined whether or not it is a living body.

The depth image to be trained contained in the mouth region image set can be a depth image, and living body identification can be performed according to depth information in the depth image.

S3: and training the living body recognition model by utilizing the predicted living body recognition result of the living body recognition model and the real living body recognition result of the depth image to be trained, so as to obtain a final living body recognition model.

Steps 13 and 14 may not be sequential.

Step 15: and determining that the target object corresponding to the face image is a living body according to the target mouth shape and the living body detection result meeting preset requirements.

Specifically, in response to the target mouth shape belonging to the preset mouth shape, and in response to the living body detection result being displayed as a living body, it is determined that the target object corresponding to the face image is a living body.

And when the target object is determined to be a living body, carrying out face recognition on the face image to obtain a corresponding face recognition result. For example, when the target object performs face recognition on the bank APP, the bank APP can randomly select characters corresponding to at least two types of mouth shapes, such as 'o' and 'kava', the two characters correspond to mouth opening and mouth closing respectively, the prompt target object operates based on prompt information on a display interface of the bank APP, a plurality of corresponding face images and depth images corresponding to each face image can be obtained, then based on the face images and the depth images, acquiring a first mouth region image and a second mouth region image, judging whether a mouth shape corresponding to the first mouth region image belongs to one of mouth opening and mouth closing mouth shapes, and judging whether a target object in the second mouth region image is a living body, if the mouth shape corresponding to the first mouth region image belongs to one of mouth opening and mouth closing mouth shapes, and the target object is a living body, determining that the target object corresponding to the face image is the living body.

In sum, the safety of face recognition can be improved by utilizing the randomness of characters and mouth shapes, the possibility that the face recognition is successfully counterfeited by other people is reduced, and the recognition mode of mouth shape synchronization and depth map fusion utilizes three-dimensional characteristics, so that the accuracy of the face recognition can be improved.

Compared with the prior art, the face recognition method provided by the application can be used for obtaining the first mouth region image in each face image, determining the second mouth region image in the depth image corresponding to the face image by utilizing the first mouth region image, further carrying out mouth shape prediction on the first mouth region image and carrying out living body prediction on the second mouth region image, and determining that the target object corresponding to the face image is a living body when both predictions meet the preset requirements.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of an electronic device according to the present application, where the electronic device 50 includes an image capturing module 501, and a memory 502 and a processor 503 coupled to the image capturing module 501, where the image capturing module 501 is configured to capture a plurality of face images and depth images corresponding to each face image, the memory 502 is configured to store a computer program, and the processor 503 is configured to execute the computer program to implement the face recognition method according to any one of the foregoing embodiments, which is not described herein.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an embodiment of a computer readable storage medium provided by the present application, where the computer readable storage medium 60 is used to store a computer program 601, and the computer program 601, when executed by a processor, is used to implement the face recognition method of any one of the above embodiments, which is not described herein.

In summary, the technical scheme of the application can improve the safety in the identification process based on the identification mode of mouth shape synchronization and depth map fusion. And the related model extracts detail features through an attention mechanism, and fine features of a mouth area are better extracted, so that the accuracy of the model is greatly improved. Further, a lightweight network is used, so that algorithm time consumption is greatly reduced, and the deployment on mobile equipment is facilitated.

The processor to which the present application relates may be referred to as a CPU (Central Processing Unit ), possibly an integrated circuit chip, a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.

The storage medium used in the present application includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), or an optical disk.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and the drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the present application.

Claims

1. A method of face recognition, the method comprising:

prompting at least two preset characters; wherein the at least two preset characters at least correspond to two mouth shapes, one mouth shape is associated with a plurality of preset characters;

acquiring a plurality of face images in the process that a target object reads at least two preset characters and depth images corresponding to the face images;

carrying out key point identification on each face image to obtain a plurality of key points;

determining key points belonging to the mouth from the key points;

determining a first mouth region image according to the key points of the mouth, and acquiring position information of the first mouth region;

determining a target area corresponding to the position information in the depth image, and determining an image corresponding to the target area as a second mouth area image;

inputting coordinate information of key points in the first mouth region image into a mouth shape classification model to obtain probability values of all mouth shapes corresponding to the first mouth region;

taking the mouth shape corresponding to the maximum probability value as a target mouth shape; wherein, the mouth shape classification model is an SVM classifier based on an SVM algorithm; and

randomly acquiring a preset number of second mouth area images;

inputting a preset number of second mouth region images into a living body recognition model to obtain a living body probability value corresponding to each second mouth region image;

averaging a plurality of the living body probability values, and determining a living body detection result based on the average living body probability value; the living body identification model comprises an attention mechanism layer and a plurality of network layers which are connected in sequence, wherein the attention mechanism layer is connected between part of adjacent network layers and is used for carrying out channel weighting on the characteristics output by the previous network layer in the adjacent network layers; the network layer comprises a plurality of convolution kernels with different sizes; a convolution kernel of a first size is used for extracting global features, a convolution kernel of a second size is used for extracting texture features, and the convolution kernel of the first size is larger than the convolution kernel of the second size;

determining that a target object corresponding to the face image is a living body in response to the target mouth shape belonging to a preset mouth shape and in response to the living body detection result being displayed as the living body; the preset requirement of the mouth shapes is that the mouth shapes corresponding to the adjacent two preset characters are inconsistent.

2. An electronic device comprising an image acquisition unit for acquiring a plurality of face images and depth images corresponding to each face image, and a memory coupled to the image acquisition unit for storing a computer program, and a processor for executing the computer program to implement the method of claim 1.

3. A computer readable storage medium for storing a computer program for implementing the method of claim 1 when executed by a processor.