CN115082991B

CN115082991B - Face living body detection method and device and electronic equipment

Info

Publication number: CN115082991B
Application number: CN202210743718.4A
Authority: CN
Inventors: 周军
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2024-07-02
Anticipated expiration: 2042-06-27
Also published as: CN115082991A

Abstract

The application provides a face living body detection method, a face living body detection device and electronic equipment, wherein the method comprises the following steps: acquiring a facial action video corresponding to a face to be detected; the facial action video is a video containing a plurality of facial actions which are sequentially arranged; extracting a plurality of frame images from the facial motion video; extracting face features of the frame images in the UV space according to the face actions corresponding to each frame image; the method comprises the steps of inputting face features corresponding to a plurality of frame images respectively into a preset face detection model according to the sequence of the frame images to obtain a living body detection result corresponding to a face to be detected; the preset face detection model is a model obtained by training a preset neural network based on a face action video sample set. According to the application, the facial features in the UV space are extracted from the plurality of frame images in the facial motion video of the face to be detected for recognition, so that the difference of the cut and pasted paper faces can be amplified, the detection accuracy is improved, and the attack of the pasted paper faces is effectively defended.

Description

Face living body detection method and device and electronic equipment

Technical Field

The present application relates to the field of image technologies, and in particular, to a method and an apparatus for detecting a human face living body, and an electronic device.

Background

The clipping face attack means that an attacker performs a series of living authentication and similarity comparison and equivalent operation by printing and clipping the local face (forehead, nose, etc.) of the attacked person to the face.

At present, some defense means aiming at face attack are usually realized by constructing negative sample data of a cut portrait sticker, comparing the negative sample data with positive sample data and training a CNN classifier based on the positive and negative samples to distinguish, the method is used for distinguishing through a single image, the utilization information is usually less, if the cutting means of an attacker is good enough, the possibility of misjudgment of the single image is usually high, and the identification accuracy is not high.

Disclosure of Invention

The application aims to provide a human face living body detection method, a human face living body detection device and electronic equipment, which can amplify the difference of cut and pasted paper human faces, improve the detection accuracy and effectively defend the attack of the pasted paper human faces by identifying the extracted human face features in the UV space from a plurality of frame images in the facial action video of the human face to be detected.

In a first aspect, an embodiment of the present application provides a method for detecting a face in vivo, where the method includes: acquiring a facial action video corresponding to a face to be detected; the facial action video is a video containing a plurality of facial actions which are sequentially arranged; extracting a plurality of frame images from the facial motion video; extracting face features of the frame images in the UV space according to the face actions corresponding to each frame image; the method comprises the steps of inputting face features corresponding to a plurality of frame images respectively into a preset face detection model according to the sequence of the frame images to obtain a living body detection result corresponding to a face to be detected; the preset face detection model is a model obtained by training a preset neural network based on a face action video sample set.

In a preferred embodiment of the present application, the step of extracting facial features of the frame images in the UV space according to facial actions corresponding to each frame image includes: for each frame image, the following steps are performed: identifying a face region image in the frame image; converting the face region image into a UV space to obtain a UV face image corresponding to the frame image; according to the facial action corresponding to the frame image, converting the UV face image into a UV mask image; the UV mask image is a UV image of an area which blocks an area corresponding to the facial removal action; and extracting features of the UV mask image to obtain face features of the frame image in the UV space.

In a preferred embodiment of the present application, the step of converting the UV face image into the UV mask image according to the facial motion corresponding to the frame image includes: determining a target area corresponding to the facial action from the UV face image; and setting the pixel values of the areas except the target area in the UV face image as specified pixel values to obtain a UV mask image.

In a preferred embodiment of the present application, the step of determining the target area corresponding to the facial motion from the UV face image includes: if the face action corresponding to the frame image is a blink action, determining an eye area in the UV face image as a target area corresponding to the face action; if the facial motion corresponding to the frame image is a mouth opening motion, determining a mouth region in the UV face image as a target region corresponding to the facial motion; if the facial motion corresponding to the frame image is a nodding motion or a waving motion, determining a nose area and a designated facial area in the UV face image as target areas corresponding to the facial motion; the face region is designated as a face region other than the eye region, mouth region, and nose region.

In a preferred embodiment of the present application, the step of extracting features of the UV mask image to obtain facial features of the frame image in the UV space includes: performing feature extraction on the UV mask image by using a preset convolutional neural network to obtain the face features of the frame image in the UV space; wherein if the face motion corresponding to the frame image is a nodding motion, the feature extraction weight ratio of the nose region and the specified face region is 3:2; if the face motion corresponding to the frame image is a panning motion, the feature extraction weight ratio of the nose region and the specified face region is 2:5.

In a preferred embodiment of the present application, the step of extracting a plurality of frame images from the facial motion video includes: randomly extracting a plurality of frame images from the facial action video; or extracting a plurality of frame images from the facial motion video at preset frame intervals.

In a preferred embodiment of the present application, the training process of the preset face detection model is as follows: acquiring a facial action video sample set; the samples in the sample set include: facial features extracted based on facial action videos and labels corresponding to the facial action videos; the tag includes at least one of: nodding true, nodding false, opening mouth true, opening mouth false, blinking true, blinking false; training a preset neural network by using the facial action video sample set to obtain a preset face detection model.

In a preferred embodiment of the present application, the preset neural network includes: LSTM time loops the neural network.

In a second aspect, an embodiment of the present application further provides a device for detecting a living body of a face, where the device includes: the video acquisition module is used for acquiring facial action videos corresponding to the faces to be detected; the facial action video is a video containing a plurality of facial actions which are sequentially arranged; an image extraction module for extracting a plurality of frame images from the facial motion video; the feature extraction module is used for extracting the face features of the frame images in the UV space according to the facial actions corresponding to each frame image; the model detection module is used for inputting the face features corresponding to the frame images respectively into a preset face detection model according to the sequence of the frame images to obtain a living body detection result corresponding to the face to be detected; the preset face detection model is a model obtained by training a preset neural network based on a face action video sample set.

In a third aspect, an embodiment of the present application further provides an electronic device, including a processor and a memory, where the memory stores computer executable instructions executable by the processor, where the processor executes the computer executable instructions to implement the method according to the first aspect.

In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing computer-executable instructions that, when invoked and executed by a processor, cause the processor to implement the method of the first aspect.

In the method, the device and the electronic equipment for detecting the human face living body, which are provided by the embodiment of the application, facial action videos corresponding to the human face to be detected are firstly obtained; the face action video is a video containing a plurality of face actions which are sequentially arranged; then extracting a plurality of frame images from the facial action video, and extracting facial features of the frame images in the UV space according to the facial action corresponding to each frame image; finally, respectively inputting the face features corresponding to the plurality of frame images into a preset face detection model according to the sequence of the frame images to obtain a living body detection result corresponding to the face to be detected; the preset face detection model is a model obtained by training a preset neural network based on a face action video sample. The embodiment of the application carries out living body detection by applying the face features in the UV space extracted from the face action video of the face to be detected, can capture and amplify the difference of the cut-and-paste faces, and improves the accuracy rate of the living body detection of the face, thereby effectively defending the attack of the paste faces.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a face living body detection method according to an embodiment of the present application;

Fig. 2 is a flow chart of feature extraction in a face living body detection method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of UV conversion according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a UV mask image according to an embodiment of the present application;

fig. 5 is a block diagram of a face living body detection apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions of the present application will be clearly and completely described in connection with the embodiments, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the prior art, when defending against the face attack, negative sample data of a cut portrait sticker are often constructed to be compared with positive sample data, and a CNN classifier is trained to distinguish based on the positive and negative samples.

Based on the above, the embodiment of the application provides a human face living body detection method, a device and an electronic device, which can amplify the difference of cut and pasted paper human faces, improve the detection accuracy and effectively defend the attack of the pasted paper human faces by identifying the extracted human face features in the UV space from a plurality of frame images in the facial action video of the human face to be detected.

For the sake of understanding the present embodiment, a detailed description will be given of a face living body detection method disclosed in the embodiment of the present application.

In the scene of human face living body detection, some behaviors of attacking through a paper-attached human face often occur, but the existing identification method is only used for identifying through one collected human face image, when an attacker places the local human face paper of the attacker on the face through printing and cutting, the human face recognition is likely to pass through, and some embezzlement operations are likely to be performed, so that property loss of the attacker is caused. In order to effectively defend the attack of the face of the sticker, the embodiment of the application provides a face living body detection method, which is shown in fig. 1, and specifically comprises the following steps:

step S102, obtaining a facial action video corresponding to a face to be detected; the face motion video is a video including a plurality of face motions arranged in sequence.

In specific implementation, a face action video of a face to be detected can be recorded through an image pickup device, wherein the face action video is a video containing a plurality of face actions which are sequentially arranged, that is, each frame image in the video corresponds to an action, and the actions can be common actions which need to be carried out by a detected person, such as mouth opening, blinking, nodding or head shaking.

Step S104, extracting a plurality of frame images from the facial action video.

In this step, several frame images may be randomly extracted from all frame images of the face motion video, or may be regularly extracted at preset frame intervals. In a preferred embodiment, the preset frame interval is 4 frames, the FPS (FRAMES PER seconds of transmission frame number) of the video recorded by the mobile phone is about 30, the calculation overhead is too large if each frame is normally taken, and the image is not lost too much when the image is extracted at intervals of 4 frames, and the 4 frames are equivalent to taking every 100 ms.

Step S106, according to the facial action corresponding to each frame image, the facial features of the frame images in the UV space are extracted.

In order to enlarge the difference of the faces of the paper, the face area image in each frame image is converted into a UV face image, then the face characteristics are extracted from the UV face image according to the face action corresponding to each frame image for subsequent detection, and the detection accuracy can be improved.

Step S108, respectively inputting the face features corresponding to the plurality of frame images into a preset face detection model according to the sequence of the frame images to obtain a living body detection result corresponding to the face to be detected; the preset face detection model is a model obtained by training a preset neural network based on a face action video sample set.

In a preferred manner, the preset neural network is an LSTM time-loop neural network, and an RNN neural network may be used. Because the face features are extracted from the video stream of the facial action video and have a temporal continuity, the face features corresponding to the plurality of frame images are required to be input into a face detection model trained based on an LSTM time-cyclic neural network according to the sequence of the frame images for detection. And inputting the face features corresponding to the frame images into a face detection model one by one according to a time sequence to obtain a living body detection result.

In the method, the device and the electronic equipment for detecting the human face living body, which are provided by the embodiment of the application, facial action videos corresponding to the human face to be detected are firstly obtained; the face action video is a video containing a plurality of face actions which are sequentially arranged; then extracting a plurality of frame images from the facial action video, and extracting facial features of the frame images in a UV space according to facial actions corresponding to each frame image; finally, respectively inputting the face features corresponding to the plurality of frame images into a preset face detection model according to the sequence of the frame images to obtain a living body detection result corresponding to the face to be detected; the preset face detection model is a model obtained by training a preset neural network based on a face action video sample. The embodiment of the application carries out living body detection by applying the face features in the UV space extracted from the face action video of the face to be detected, can capture and amplify the difference of the cut-and-paste faces, and improves the accuracy rate of the living body detection of the face, thereby effectively defending the attack of the paste faces.

The embodiment of the application also provides a human face living body detection method, which is realized on the basis of the previous embodiment, and the characteristic extraction process and the model training process are mainly described in the embodiment.

Referring to fig. 2, the process of extracting the face features of the frame images in the UV space according to the facial motion corresponding to each frame image specifically includes the following steps:

for each frame image, the following steps are performed:

Step S202, a face area image in the frame image is identified. In practice, the identification can be performed by mediapipe face detectors.

Step S204, converting the face area image into a UV space to obtain a UV face image corresponding to the frame image.

Based on the existing UV conversion algorithm, the extracted face region image is converted into UV space coordinates to be changed into a UV texture map, and the size of the UV texture map is 256×256. Referring to fig. 3, (a) shows a face area image cut out from the frame image by mediapipe face detectors; (b) a UV map; (c) And representing a UV face image obtained by converting the face region image through UV, namely a UV texture map.

Step S206, converting the UV face image into a UV mask image according to the facial motion corresponding to the frame image.

The UV mask image is a UV image that blocks an area except for the area corresponding to the facial motion, that is, in the UV mask image, only the area corresponding to the facial motion is represented, and other areas are set to be areas of uniform color through masks (such as setting of specified pixel values), so that the facial features of the area corresponding to the facial motion can be well represented in the UV mask image.

The specific implementation is realized by the following steps:

(1) And determining a target area corresponding to the facial action from the UV face image.

The following different target areas can be determined according to different facial actions, see fig. 4:

If the facial motion corresponding to the frame image is a blinking motion, determining an eye region (such as a dark black region where eyes are positioned in the figure) in the UV face image as a target region corresponding to the facial motion;

if the facial motion corresponding to the frame image is a mouth opening motion, determining a mouth region (such as a light gray region where a mouth is positioned in the figure) in the UV face image as a target region corresponding to the facial motion;

If the facial motion corresponding to the frame image is a nodding motion or a panning motion, a nose region (e.g., a region where a nose is located in the figure) and a specified face region (a face region other than an eye region, a mouth region, and a nose region) in the UV face image are determined as target regions corresponding to the facial motion.

(2) And setting the pixel values of the areas except the target area in the UV face image as specified pixel values to obtain a UV mask image.

For example, the face motion corresponding to the frame image is a blinking motion, and the target area is a dark black area where eyes are located in the image, and then the pixel values of all areas except the dark black area are set to 0, i.e. black is displayed; for another example, the face motion corresponding to the frame image is a mouth opening motion, and the target area is a light gray area where the mouth is located in the image, then the pixel values of all areas except the light gray area are set to 0, that is, black is displayed; the same principle of nodding or waving is not repeated here.

And step S208, extracting features of the UV mask image to obtain face features of the frame image in the UV space.

In the specific implementation, the characteristic extraction of the UV mask image can be performed by using a preset convolutional neural network to obtain the face characteristics of the frame image in the UV space; wherein if the face motion corresponding to the frame image is a nodding motion, the feature extraction weight ratio of the nose region and the specified face region is 3:2; if the face motion corresponding to the frame image is a panning motion, the feature extraction weight ratio of the nose region and the specified face region is 2:5.

The preset convolutional neural network can be Densenet201,201, and the network performs weight training by using common positive and negative example UV space face images, so that feature extraction can be more accurate.

The specific training process is as follows: the method comprises the steps of classifying and training a single image into positive and negative categories, wherein the classification categories are 8 categories, namely nodding true, nodding false, opening false, blinking true and blinking false, and the loss function of classification tasks uses cross entropy loss, so that the distinction of the single image is not obvious, and the error recognition rate is high by using a classification model to directly judge, so that the label smoothing auxiliary training is used in the embodiment of the application, and the feature extraction can be more accurate.

In the step, extracting Densenet a last layer of feature map and expanding to obtain 2048-dimension features; i.e. each frame image corresponds to a 2048 dimensional feature.

The training process of the preset face detection model is as follows:

(1) Acquiring a facial action video sample set; the samples in the sample set include: facial features extracted based on facial action videos and labels corresponding to the facial action videos; the tag includes at least one of: nodding true, nodding false, opening mouth true, opening mouth false, blinking true, blinking false;

(2) And training the LSTM time cyclic neural network by using the facial action video sample set to obtain a preset face detection model.

In the embodiment of the present application, the structural parameters of the preset face detection model are as follows:

The input characteristic length is 2048, the hidden layer characteristic is 1024, the number of circulating layers is 3, and the output is 8 categories (nod true, nod false, nod true, nod false).

The human face living body detection method provided by the embodiment of the application can be perfectly embedded into the existing motion living body identification based on the existing motion living body detection without extra investment; the error acceptance rate (FAR) of the scheme is greatly improved compared with the prior common CNN scheme, and in the test, FAR is less than 0.001 and FRR is less than 0.01. According to the embodiment of the application, the face characteristics in the UV space are extracted from the plurality of frame images in the facial motion video of the face to be detected to identify, so that the difference of the cut-and-paste faces can be amplified, the detection accuracy is improved, and the attack of the paste faces is effectively defended.

Based on the above method embodiment, the embodiment of the present application further provides a device for detecting a human face living body, as shown in fig. 5, where the device includes:

The video acquisition module 52 is configured to acquire a facial motion video corresponding to a face to be detected; the facial action video is a video containing a plurality of facial actions which are sequentially arranged; an image extraction module 54 for extracting a plurality of frame images from the facial motion video; the feature extraction module 56 is configured to extract facial features of the frame images in the UV space according to facial actions corresponding to each frame image; the model detection module 58 is configured to input face features corresponding to the plurality of frame images respectively to a preset face detection model according to a sequence of the frame images, so as to obtain a living body detection result corresponding to the face to be detected; the preset face detection model is an LSTM model obtained by training an LSTM time-cycle neural network based on a face action video sample.

According to the human face living body detection device provided by the embodiment of the application, the face action video corresponding to the human face to be detected is obtained through the video obtaining module; the face action video is a video containing a plurality of face actions which are sequentially arranged; then extracting a plurality of frame images from the facial action video through an image extraction module, and extracting facial features of the frame images in a UV space through a feature extraction module according to facial actions corresponding to each frame image; the method comprises the steps of inputting face features corresponding to a plurality of frame images respectively into a preset face detection model according to a frame image sequence through a model detection module to obtain a living body detection result corresponding to a face to be detected; the preset face detection model is a model obtained by training a preset neural network based on a face action video sample. The embodiment of the application carries out living body detection by applying the face features in the UV space extracted from the face action video of the face to be detected, can capture and amplify the difference of the cut-and-paste faces, and improves the accuracy rate of the living body detection of the face, thereby effectively defending the attack of the paste faces.

In a preferred embodiment of the present application, the feature extraction module 56 is configured to perform the following steps for each frame image: identifying a face region image in the frame image; converting the face region image into a UV space to obtain a UV face image corresponding to the frame image; according to the facial action corresponding to the frame image, converting the UV face image into a UV mask image; the UV mask image is a UV image of an area which blocks an area corresponding to the facial removal action; and extracting features of the UV mask image to obtain face features of the frame image in the UV space.

In a preferred embodiment of the present application, the feature extraction module 56 is further configured to determine a target area corresponding to the facial action from the UV face image; and setting the pixel values of the areas except the target area in the UV face image as specified pixel values to obtain a UV mask image.

In a preferred embodiment of the present application, the feature extraction module 56 is further configured to determine, if the facial motion corresponding to the frame image is a blink, an eye region in the UV face image to be a target region corresponding to the facial motion; if the facial motion corresponding to the frame image is a mouth opening motion, determining a target area corresponding to the facial motion from a mouth area in the UV face image; if the facial motion corresponding to the frame image is a nodding motion or a waving motion, determining a nose area and a designated facial area in the UV face image as target areas corresponding to the facial motion; the face region is designated as a face region other than the eye region, mouth region, and nose region.

In a preferred embodiment of the present application, the feature extraction module 56 is further configured to perform feature extraction on the UV mask image by using a preset convolutional neural network, so as to obtain a face feature of the frame image in the UV space; wherein if the face motion corresponding to the frame image is a nodding motion, the feature extraction weight ratio of the nose region and the specified face region is 3:2; if the face motion corresponding to the frame image is a panning motion, the feature extraction weight ratio of the nose region and the specified face region is 2:5.

In a preferred embodiment of the present application, the image extraction module 54 is further configured to randomly extract a plurality of frame images from the facial motion video; or extracting a plurality of frame images from the facial motion video at preset frame intervals.

In a preferred embodiment of the present application, the apparatus further comprises: the model training module is used for executing the following training process: acquiring a facial action video sample set; the samples in the sample set include: facial features extracted based on facial action videos and labels corresponding to the facial action videos; the tag includes at least one of: nodding true, nodding false, opening mouth true, opening mouth false, blinking true, blinking false; and training the LSTM time cyclic neural network by using the facial action video sample set to obtain a preset face detection model.

The device provided by the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brief description, reference may be made to the corresponding content in the foregoing method embodiment where the device embodiment is not mentioned.

The embodiment of the present application further provides an electronic device, as shown in fig. 6, which is a schematic structural diagram of the electronic device, where the electronic device includes a processor 61 and a memory 60, the memory 60 stores computer executable instructions that can be executed by the processor 61, and the processor 61 executes the computer executable instructions to implement the following method:

Acquiring a facial action video corresponding to a face to be detected; the facial action video is a video containing a plurality of facial actions which are sequentially arranged; extracting a plurality of frame images from the facial motion video; extracting face features of the frame images in the UV space according to the face actions corresponding to each frame image; the method comprises the steps of inputting face features corresponding to a plurality of frame images respectively into a preset face detection model according to the sequence of the frame images to obtain a living body detection result corresponding to a face to be detected; the preset face detection model is a model obtained by training a preset neural network based on a face action video sample.

In the electronic equipment provided by the embodiment of the application, the face features in the UV space extracted from the facial action video of the face to be detected are applied to carry out living detection, so that the difference of the cut-and-paste faces can be captured and amplified, the accuracy rate of the living detection of the face is improved, and the attack of the paste faces is effectively defended.

The electronic device is further configured to perform other method steps in the method embodiments described above, which are not described herein.

In the embodiment shown in fig. 6, the electronic device further comprises a bus 62 and a communication interface 63, wherein the processor 61, the communication interface 63 and the memory 60 are connected by means of the bus 62.

The memory 60 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is achieved via at least one communication interface 63 (which may be wired or wireless), and may use the internet, a wide area network, a local network, a metropolitan area network, etc. Bus 62 may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The bus 62 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one bi-directional arrow is shown in FIG. 6, but not only one bus or type of bus.

The processor 61 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 61 or by instructions in the form of software. The processor 61 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), and the like; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory and the processor 61 reads the information in the memory and in combination with its hardware performs the steps of the method of the previous embodiment.

The embodiment of the application also provides a computer readable storage medium, which stores computer executable instructions that, when being called and executed by a processor, cause the processor to implement the above method, and the specific implementation can refer to the foregoing method embodiment and will not be described herein.

The method, the apparatus and the computer program product of the electronic device provided in the embodiments of the present application include a computer readable storage medium storing program codes, where the instructions included in the program codes may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be described herein.

The relative steps, numerical expressions and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description of the present application, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for detecting a human face in vivo, the method comprising:

Acquiring a facial action video corresponding to a face to be detected; the face action video is a video containing a plurality of face actions which are sequentially arranged;

extracting a plurality of frame images from the facial motion video;

extracting face features of the frame images in a UV space according to the face actions corresponding to each frame image;

The face features corresponding to the frame images are input to a preset face detection model according to the sequence of the frame images, and a living body detection result corresponding to the face to be detected is obtained; the preset face detection model is a model obtained by training a preset neural network based on a face action video sample set.

2. The method according to claim 1, wherein the step of extracting facial features of the frame images in UV space according to facial motion corresponding to each of the frame images comprises:

for each of the frame images, the following steps are performed:

identifying a face region image in the frame image;

converting the face region image into a UV space to obtain a UV face image corresponding to the frame image;

Converting the UV face image into a UV mask image according to the facial action corresponding to the frame image; the UV mask image is a UV image of an area which is used for shielding and removing the area corresponding to the facial action;

And extracting the features of the UV mask image to obtain the face features of the frame image in the UV space.

3. The method of claim 2, wherein the step of converting the UV face image into a UV mask image according to the facial motion corresponding to the frame image comprises:

Determining a target area corresponding to the facial action from the UV face image;

And setting the pixel values of the areas except the target area in the UV face image as specified pixel values to obtain a UV mask image.

4. A method according to claim 3, wherein the step of determining a target area corresponding to the facial action from the UV face image comprises:

if the face action corresponding to the frame image is a blink action, determining an eye area in the UV face image as a target area corresponding to the face action;

if the facial motion corresponding to the frame image is a mouth opening motion, determining a mouth region in the UV face image as a target region corresponding to the facial motion;

if the facial motion corresponding to the frame image is a nodding motion or a waving motion, determining a nose area and a designated facial area in the UV face image as target areas corresponding to the facial motion; the specified face region is a face region other than the eye region, the mouth region, and the nose region.

5. The method of claim 4, wherein the step of extracting features from the UV mask image to obtain facial features of the frame image in UV space comprises:

Performing feature extraction on the UV mask image by using a preset convolutional neural network to obtain face features of the frame image in a UV space; wherein if the face motion corresponding to the frame image is a nodding motion, the feature extraction weight ratio of the nose region and the specified face region is 3:2; if the facial motion corresponding to the frame image is a head shaking motion, the feature extraction weight ratio of the nose area and the appointed facial area is 2:5.

6. The method of claim 1, wherein the step of extracting a plurality of frame images from the facial motion video comprises:

Randomly extracting a plurality of frame images from the facial action video;

or extracting a plurality of frame images from the facial action video according to a preset frame interval.

7. The method according to claim 1, wherein the training process of the preset face detection model is as follows:

acquiring a facial action video sample set; the samples in the sample set include: facial features extracted based on facial action videos and labels corresponding to the facial action videos; the tag includes at least one of: nodding true, nodding false, opening mouth true, opening mouth false, blinking true, blinking false;

And training the preset neural network by using the facial action video sample set to obtain a preset face detection model.

8. The method of claim 1, wherein the pre-set neural network comprises: LSTM time loops the neural network.

9. A human face living body detection apparatus, characterized by comprising:

The video acquisition module is used for acquiring facial action videos corresponding to the faces to be detected; the face action video is a video containing a plurality of face actions which are sequentially arranged;

an image extraction module for extracting a plurality of frame images from the facial motion video;

the feature extraction module is used for extracting face features of the frame images in the UV space according to the face actions corresponding to each frame image;

The model detection module is used for inputting the face features corresponding to the frame images into a preset face detection model according to the sequence of the frame images to obtain a living body detection result corresponding to the face to be detected; the preset face detection model is a model obtained by training a preset neural network based on a face action video sample set.

10. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the method of any one of claims 1 to 8.

11. A computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any one of claims 1 to 8.