CN113657245A

CN113657245A - Method, device, medium and program product for human face living body detection

Info

Publication number: CN113657245A
Application number: CN202110929761.5A
Authority: CN
Inventors: 于珂珂; 李生金; 侯晓辉
Original assignee: Hiscene Information Technology Co Ltd
Current assignee: Hiscene Information Technology Co Ltd
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2021-11-16
Anticipated expiration: 2041-08-13
Also published as: CN113657245B

Abstract

An object of the present application is to provide a method, an apparatus, a medium, and a program product for face liveness detection, the method including: acquiring an image area of a target face frame in a current video frame; carrying out central differential convolution processing on the image area of the target face image to obtain a first feature map; sequentially performing feature extraction processing on the first feature map through a plurality of convolution modules, and fusing a second feature map output by the last feature extraction processing and a third feature map output by other feature extraction processing to obtain a target feature map, wherein each convolution module comprises at least one depth separable convolution layer; and determining a human face living body detection result corresponding to the image area of the target human face frame in the current video frame according to the target feature map.

Description

Method, device, medium and program product for human face living body detection

Technical Field

The application relates to the field of communication, in particular to a technology for detecting a human face living body.

Background

With the development of the era, human face recognition is widely applied to many interactive artificial intelligence systems due to convenience, however, only presenting printed images or videos to a biometric sensor may fool the human face recognition system, typical examples of representing attacks are printing pictures, video playback, 3D masks, and in order to ensure reliable use of the human face recognition system, a human face live detection method is an important means for detecting such representing attacks. In the face recognition process in the prior art, in order to prevent face forgery attack, before face recognition, face living body detection is firstly carried out, whether an input face is a real face is judged, and then whether further face recognition is carried out is determined.

Disclosure of Invention

An object of the present application is to provide a method, an apparatus, a medium, and a program product for face liveness detection.

According to an aspect of the present application, there is provided a method for live human face detection, the method including:

acquiring an image area of a target face frame in a current video frame;

carrying out central differential convolution processing on the image area of the target face image to obtain a first feature map;

sequentially performing feature extraction processing on the first feature map through a plurality of convolution modules, and fusing a second feature map output by the last feature extraction processing and a third feature map output by other feature extraction processing to obtain a target feature map, wherein each convolution module comprises at least one depth separable convolution layer;

and determining a human face living body detection result corresponding to the image area of the target human face frame in the current video frame according to the target feature map.

According to another aspect of the present application, there is provided a method for live human face detection, the method comprising:

acquiring an image area of a target face frame in a current video frame;

inputting the image area of the target face frame into a face living body detection network, and outputting a face living body detection result corresponding to the image area of the target face frame in the current video frame, wherein the face living body detection network performs central difference convolution processing on the image area of the target face frame to obtain a first feature map, sequentially performs feature extraction processing on the first feature map through a plurality of convolution modules, and fuses a second feature map output by the last feature extraction processing and a third feature map output by other feature extraction processing to obtain a target feature map, wherein each convolution module comprises at least one depth separable convolution layer, and the face living body detection result is determined according to the target feature map.

inputting the obtained video frame into a face living body detection network, and outputting a face living body detection result corresponding to a target face frame image area in the video frame, wherein the face living body detection network acquires the target face frame image area in the video frame, performing central differential convolution processing on the target face frame image area to obtain a first feature map, sequentially performing feature extraction processing on the first feature map through a plurality of convolution modules, and fusing a second feature map output by the last feature extraction processing and a third feature map output by other feature extraction processing to obtain a target feature map, wherein each convolution module comprises at least one depth separable convolution layer, and the face living body detection result corresponding to the target face frame image area in the video frame is determined according to the target feature map.

According to an aspect of the present application, there is provided a first apparatus for living human face detection, the apparatus comprising:

the one-to-one module is used for acquiring the image area of the target face frame in the current video frame;

the second module is used for carrying out central differential convolution processing on the image area of the target face frame to obtain a first feature map;

a third module, configured to perform feature extraction processing on the first feature map sequentially through a plurality of convolution modules, and fuse a second feature map output by the last feature extraction processing with third feature maps output by other feature extraction processing to obtain a target feature map, where each convolution module includes at least one depth-separable convolution layer;

and the four modules are used for determining a human face living body detection result corresponding to the image area of the target human face frame in the current video frame according to the target feature map.

According to another aspect of the present application, there is provided a first apparatus for living human face detection, the apparatus comprising:

the first module is used for acquiring a target face frame image area in a current video frame;

and two modules, configured to input the target face frame image region into a face living body detection network, and output a face living body detection result corresponding to the target face frame image region in the current video frame, where the face living body detection network performs center difference convolution on the target face frame image region to obtain a first feature map, sequentially performs feature extraction processing on the first feature map through multiple convolution modules, and fuses a second feature map output by the last feature extraction processing and a third feature map output by other feature extraction processing to obtain a target feature map, where each convolution module includes at least one depth separable convolution layer, and determines the face living body detection result according to the target feature map.

a third module, which is used for inputting the obtained video frame into a human face living body detection network and outputting the human face living body detection result corresponding to the image area of the target human face frame in the video frame, wherein the face living body detection network acquires the image area of the target face frame in the video frame, performing central difference convolution processing on the image area of the target face frame to obtain a first feature map, sequentially performing feature extraction processing on the first feature map through a plurality of convolution modules, fusing the second feature map output by the last feature extraction processing and the third feature maps output by the other feature extraction processing to obtain a target feature map, and each convolution module comprises at least one depth separable convolution layer, and a human face living body detection result corresponding to the image area of the target human face frame in the video frame is determined according to the target feature map.

According to an aspect of the present application, there is provided a computer device for face liveness detection, comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the operations of any of the methods described above.

According to an aspect of the application, there is provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the operations of any of the methods described above.

According to an aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, carries out the steps of any of the methods as described above.

Compared with the prior art, the method and the device have the advantages that the light-weight human face in-vivo detection network suitable for the mobile terminal is constructed to realize human face in-vivo detection, the accuracy of human face in-vivo detection is improved by introducing relevant technologies such as central differential convolution, network high-low layer fusion and the like into the network, and the network calculation amount is reduced by reducing image input of human face in-vivo detection and the like.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a method for live human face detection, according to an embodiment of the present application;

FIG. 2 illustrates a flow diagram of a method for live human face detection, according to an embodiment of the present application;

FIG. 3 illustrates a flow diagram of a method for live human face detection, according to an embodiment of the present application;

FIG. 4 shows a block diagram of a first apparatus for live face detection according to an embodiment of the present application;

FIG. 5 shows a block diagram of a first apparatus for live face detection according to an embodiment of the present application;

FIG. 6 shows a block diagram of a first apparatus for live face detection according to an embodiment of the present application;

FIG. 7 shows a convolution module structure in MobileNet V2;

FIG. 8 shows another convolution module structure in MobileNet V2;

FIG. 9 illustrates a schematic diagram of a human face liveness detection network, according to an embodiment of the present application;

FIG. 10 shows a convolution block structure in ShuffleNet V2;

FIG. 11 shows another convolution block structure in ShuffleNet V2;

FIG. 12 illustrates an exemplary system that can be used to implement the various embodiments described in this application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (e.g., Central Processing Units (CPUs)), input/output interfaces, network interfaces, and memory.

The Memory may include forms of volatile Memory, Random Access Memory (RAM), and/or non-volatile Memory in a computer-readable medium, such as Read Only Memory (ROM) or Flash Memory. Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, Phase-Change Memory (PCM), Programmable Random Access Memory (PRAM), Static Random-Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other Memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The device referred to in the present application includes, but is not limited to, a terminal, a network device, or a device formed by integrating a terminal and a network device through a network. The terminal includes, but is not limited to, any mobile electronic product capable of performing human-computer interaction with a user, such as a smart phone, a tablet computer, smart glasses, and the like, and the mobile electronic product may employ any operating system, such as an Android operating system, an iOS operating system, and the like. The network Device includes an electronic Device capable of automatically performing numerical calculation and information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded Device, and the like. The network device includes but is not limited to a computer, a network host, a single network server, a plurality of network server sets or a cloud of a plurality of servers; here, the Cloud is composed of a large number of computers or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual supercomputer consisting of a collection of loosely coupled computers. Including, but not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless Ad Hoc network (Ad Hoc network), etc. Preferably, the device may also be a program running on the terminal, the network device, or a device formed by integrating the terminal and the network device, the touch terminal, or the network device and the touch terminal through a network.

Of course, those skilled in the art will appreciate that the foregoing is by way of example only, and that other existing or future devices, which may be suitable for use in the present application, are also encompassed within the scope of the present application and are hereby incorporated by reference.

In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Fig. 1 shows a flowchart of a method for face liveness detection according to an embodiment of the present application, the method including step S11, step S12, step S13 and step S14. In step S11, the first device acquires a target face frame image region in the current video frame; in step S12, the first device performs center difference convolution processing on the target face frame image area to obtain a first feature map; in step S13, the first device sequentially performs feature extraction processing on the first feature map through a plurality of convolution modules, and fuses a second feature map output by the last feature extraction processing and a third feature map output by another feature extraction processing to obtain a target feature map, where each convolution module includes at least one depth-separable convolution layer; in step S14, the first device determines, according to the target feature map, a living human face detection result corresponding to the image area of the target human face map in the current video frame.

In step S11, the first device acquires a target face frame image region in the current video frame. Wherein one or more target face frame image regions may be present in the current video frame. In some embodiments, the first device may be a user device or, alternatively, may be a network device. For example, the user equipment obtains a video stream by shooting through the user equipment or an external camera device, the user equipment obtains a target face frame image area from a current video frame of the video stream, and for example, the user equipment sends the video stream obtained by shooting to the network equipment, and the network equipment obtains the target face frame image area from the current video frame of the video stream. In some embodiments, face coordinate information in a current video frame, such as coordinates of a face circumscribed rectangle in the current video frame, is obtained based on a face detection algorithm, and then a corresponding target face frame image region is obtained from the current video frame according to the face coordinate information, where the face detection algorithm may be any face detection algorithm in the prior art.

In step S12, the first device performs center difference convolution processing on the target face frame image area to obtain a first feature map. In some embodiments, Central Difference Convolution (CDC) can capture intrinsic detail information of a face image by aggregating intensity and gradient information. Briefly, the center differential convolution is a step of adding one more center differential operation on the basis of standard convolution, the standard convolution mainly comprises two steps of sampling and aggregation, the center differential convolution is a step of inserting one center differential between the center differential convolution, the specific operation is to take a 3x3 convolution kernel as an example, the 3x3 convolution kernel is scanned on a feature map in a sliding manner, before aggregation (namely, performing dot multiplication operation with convolution kernel weight), 9 pixel points in a corresponding area of the 3x3 convolution kernel are extracted, a central point in the 9 pixel points is extracted, then a pixel value of the central point is subtracted from the remaining 9 (including the central point) pixel points to obtain updated pixel values after 9 center differentials, and finally, the 9 updated values and the convolution kernel weight are subjected to dot product aggregation to obtain a final output value. In some embodiments, the central differential convolution processing formula is as follows:

wherein x and y are respectively input characteristic diagram and output characteristic diagram, P₀Refers to the current position (i.e. the center point of the image area corresponding to the size of the convolution kernel in the input feature map), P_nIs P₀The neighborhood position (i.e., the remaining pixel points adjacent to the center point in the image region corresponding to the convolution kernel size in the input feature map). In some embodiments, the center differential convolution is performed on the target face frame image area to obtain the first feature map corresponding to the target face frame image area, where the center differential convolution has stronger robustness than that of a normal convolution, and can improve detection accuracy of a living human face, for example, the size of a convolution kernel corresponding to the center differential convolution is 3 × 3, the step size is 2, and there are 32 convolution kernels in total, and for another example, the size of a convolution kernel corresponding to the center differential convolution is 5 × 5, the step size is 1, and there are 32 convolution kernels in total, and convolution parameters of the center differential convolution are merely examples here and are not limited. In some embodiments, performing only the center differential convolution processing on the image area of the target face frame, rather than performing the center differential convolution processing on the entire current video frame, can reduce the time consumption and power consumption of the operation at the mobile terminal, so as to meet the requirement of lightweight type at the mobile terminal.

In step S13, the first device sequentially performs feature extraction processing on the first feature map through a plurality of convolution modules, and fuses the second feature map output by the last feature extraction processing and the third feature maps output by the other feature extraction processing to obtain a target feature map, where each convolution module includes at least one depth-separable convolution layer. In some embodiments, the feature extraction processing is performed on the first feature map sequentially by a plurality of convolution modules, each convolution module includes at least one depth-separable convolution layer, for example, the feature extraction processing is performed by inputting the first feature map into the first convolution module, then the feature map output by the first convolution module is input into the second convolution module for feature extraction processing, and so on, the first feature map is input into the plurality of convolution modules sequentially, and the output of the previous convolution module is used as the input of the next convolution module. In some embodiments, a depth-separable convolution is an improved algorithm over a normal convolution, which splits the original normal convolution into two parts, a channel-by-channel convolution and a point-by-point convolution. In some embodiments, a Block module in a MobileNetV2 model is used as a convolution module, a plurality of blocks are sequentially performed on a first feature map output by a central differential convolution, an inverse residual structure, namely expansion, feature extraction and compression, is adopted in the blocks, and compared with the residual structure, the feature loss can be reduced, for example, the first feature map is sequentially subjected to 4 blocks, the first feature map is input into a Block1, a feature map output by the Block1 is input into a Block2, a feature map output by the Block2 is input into a Block3, and a feature map output by the Block3 is input into a Block4, wherein the Block1 has a structure shown in fig. 7, the output size is the same as the input size, and the structures shown in fig. 8 are used for the

blocks

2, 3 and 4, so that the output is half of the input. The number and specific structure of the convolution modules (blocks) are only examples, the convolution modules (blocks) may also be in other numbers, the convolution modules (blocks) may only adopt the structure shown in fig. 7, may only adopt the structure shown in fig. 8, may also use the structures shown in fig. 7 and 8 in combination, may also adopt other Block structures, and are not limited. In some embodiments, the second feature map output by the last feature extraction process and the third feature map output by the other feature extraction process are fused to obtain the target feature map, which may be implemented in a specific manner that the second feature map output by the last feature extraction process is subjected to normal convolution to obtain a first convolution result, the third feature maps output by the feature extraction processes of some times except the last feature extraction process are subjected to normal convolution to obtain a second convolution result, and then the first convolution result and the second convolution result are fused to obtain the target feature map.

In step S14, the first device determines, according to the target feature map, a living human face detection result corresponding to the image area of the target human face map in the current video frame. In some embodiments, the fused target feature map is pooled and classified, for example, the fused target feature map is input into the pooling layer, the feature map output by the pooling layer is input into the classification loss function, and a living human face classification score corresponding to the image area of the target human face frame corresponding to the current video frame (for example, a probability value that a human face corresponding to the image area of the target human face frame in the current video frame is a living body) is output, and then a human face living body detection result is determined based on the living human face classification score, that is, whether the human face corresponding to the image area of the target human face frame in the current video frame is a living body is determined, for example, if the living human face classification score is greater than or equal to a preset threshold, it is determined that the human face corresponding to the image area of the target human face frame in the current video frame is not a living body, and if the living human face classification score is less than the preset threshold, determining that the face corresponding to the image area of the target face frame in the current video frame is a living body. In some embodiments, the present solution constructs a lightweight face biopsy network suitable for a mobile terminal, where the face biopsy network performs the aforementioned steps S12, S13, and S14, the input of the face biopsy network is a target face image region obtained from a current video frame, the output of the face biopsy network may be a face biopsy result directly, that is, whether a face corresponding to the target face image region in the current video frame is a living body can be directly determined by the face biopsy network, or the output of the face biopsy network may be a face biopsy score corresponding to the target face image region in the current video frame, and then a face biopsy result is determined according to the face biopsy score, and the face biopsy network can meet the lightweight requirement of the mobile terminal, the human face living body detection precision is improved by introducing related technologies such as central differential convolution, network high-low layer fusion and the like into the network, and the network calculation amount is reduced by reducing image input of human face living body detection and the like. As an example, as shown in the schematic diagram of the face live body detection network shown in fig. 9, the size of the image area of the input target face frame is 3 × 64, the center difference convolution processing is performed on the image area, the size of the obtained first feature map is 32 × 32, then the first feature map is sequentially subjected to 4 times of Block processing, the first feature map is input into Block1, the feature map (the size is 16 × 32) output by Block1 is input into Block2, the feature map (the size is 24 × 16) output by Block2 is input into Block3, the feature map (the size is 32 × 8) output by Block3 is input into Block4, then the feature map (the size is 64 × 4) output by Block4 is subjected to general convolution, the first result (the size is 32 × 8), and the size of the feature map (the size is 368 × 4) output by Block4 is subjected to general convolution, so as to obtain a first volume product (the size is 368), and obtaining a second convolution result (with the size of 160 × 4), then fusing the first convolution result and the second convolution result to obtain a target feature map (with the size of 480 × 4), then inputting the target feature map into a pooling layer to obtain a pooling feature map (with the size of 480 × 1), then inputting the pooling feature map into a softmax classification loss function, and outputting to obtain a corresponding face living body classification score, wherein 64 × 4 is taken as an example, 4 × 4 refers to the size of the feature map, and 64 refers to the number of the feature maps.

In some embodiments, the step S11 includes: the method comprises the steps that first equipment obtains an initial face frame image area in a current video frame; and carrying out edge expansion processing on the initial face frame image area to obtain the target face frame image area. In some embodiments, since the initial face frame image region obtained from the current video frame by the face detection algorithm is usually close to the face, but the face boundary or contour information is usually most distinctive from living bodies and non-living bodies, the edge extension processing (for example, 30% boundary extension) is performed on the initial face frame image region to obtain the target face frame image region, so that more living body information is contained in the target face frame image region. In some embodiments, the target face frame image region after the edge expansion is used as a face living body detection region of interest (ROI).

In some embodiments, the step S12 includes: the first equipment carries out quality evaluation on the target face frame image area to obtain the image quality corresponding to the target face frame image area; and if the image quality is higher than or equal to a preset image quality threshold value, performing central differential convolution processing on the target human face image area to obtain a first feature image. In some embodiments, the quality evaluation processing is performed on the target face frame image region, and if the image quality is lower than a predetermined image quality threshold (for example, the image quality is too poor due to image blurring), the process returns directly, the target face frame image region is not subjected to subsequent processing, and if the target face frame image region is not subjected to center difference convolution processing, other target face frame image regions in the current video frame may be processed continuously, or a next video frame after the current video frame is acquired, a new target face frame image region is acquired from the next video frame, and the quality evaluation is performed on the new target face frame image region.

In some embodiments, the fusing the second feature map output by the last feature extraction process with the third feature maps output by the other feature extraction processes to obtain the target feature map includes: and fusing the second feature map output by the last feature extraction processing and the third feature map output by the last feature extraction processing to obtain a target feature map. In some embodiments, the second feature map output by the last feature extraction processing is subjected to common convolution to obtain a first convolution result, the third feature map output by the penultimate feature extraction processing is subjected to common convolution to obtain a second convolution result, and then the first convolution result and the second convolution result are fused to obtain a target feature map. For example, the size of the feature map corresponding to the first convolution result is 320 × 4, the size of the feature map corresponding to the second convolution result is 160 × 4, and the size of the target feature map obtained by fusing the first convolution result and the second convolution result is 480 × 4, where 4 × 4 refers to the size of the feature map, and 160, 320, and 480 refer to the number of the feature maps.

In some embodiments, the step S14 includes a step S141 (not shown) and a step S142 (not shown). In step S141, the first device determines, according to the target feature map, a living face classification score corresponding to the image region of the target face map in the current video frame; in step S142, the first device determines a face living body detection result corresponding to the image area of the target face frame in the current video frame according to the face living body classification score. In some embodiments, a living human face classification score corresponding to a target human face image region in a current video frame is determined according to the fused target feature map, and the living human face classification score is used for scoring the probability of whether a human face corresponding to the target human face image region in the current video frame is a living human face. As an example, the larger the face living body classification score is, the higher the probability that the face corresponding to the image region of the target face frame in the current video frame is not a living body is, and as another example, the smaller the face living body classification score is, the higher the probability that the face corresponding to the image region of the target face frame in the current video frame is a living body is. In some embodiments, the face liveness detection result is determined according to a face liveness classification score, for example, there is a score threshold, if the face liveness classification score is greater than or equal to the score threshold, the face in the current video frame is determined not to be a live body, and if the face liveness classification score is less than the score threshold, the face in the current video frame is determined to be a live body. In some embodiments, a mapping relationship between the score interval and the face living body detection result may be pre-established, and according to the score interval in which the face living body classification score falls, the face living body detection result mapped by the score interval is used as the face living body detection result corresponding to the image area of the face frame of the current video frame target. In some embodiments, a functional relation between the classification score and the face living body detection result may be established in advance, and the face living body classification score is input into the functional relation, and the output of the functional relation is used as the face living body detection result.

In some embodiments, the step S141 includes: the first equipment performs pooling treatment on the target characteristic diagram to obtain a second target characteristic diagram; and inputting the second target feature map into a classification loss function, and outputting to obtain a face living body classification score corresponding to the image area of the target face map in the current video frame. In some embodiments, the fused target feature map is input into the pooling layer, the feature map output by the pooling layer is input into a classification loss function (e.g., softmax loss function, 0-1 loss function, cross entropy loss function, etc.), a living face classification score corresponding to the image area of the target face frame in the current video frame is output (e.g., a probability value that a face corresponding to the image area of the target face frame in the current video frame is a living body), and then a living face detection result is determined based on the living face classification score, that is, whether the face corresponding to the image area of the target face frame in the current video frame is a living body is determined.

In some embodiments, the step S11 includes: the first device obtains a target face frame image area in a current video frame and target face identification information corresponding to the target face frame image area, wherein the step S141 includes a step S1411 (not shown). In step S1411, the first device determines, according to the target feature map, a living human face classification score corresponding to a target human face image region corresponding to the target human face identification information in the current video frame. In some embodiments, since a plurality of target face Frame image regions corresponding to different faces and a target face Frame image region corresponding to the same face may exist in the same video Frame, for example, a target face Frame image region a1 and a target face Frame image region a2 exist in a current video Frame, a target face Frame image region a1 corresponds to a face of a User1, a target face Frame image region a2 corresponds to a face of a User2, for example, a target face Frame image region a1 exists in a video Frame1, a target face Frame image region a1 corresponds to a face of a User1, a target face Frame image region a2 exists in a video Frame2, and a target face Frame image region a2 also corresponds to a face of a User1, in order to reduce the amount of face live body detection calculation, each face image region in each video Frame is prevented from being subjected to face live body detection, therefore, it is necessary to assign face identification information (e.g., face ID) to the faces detected in the video frames, i.e., to assign unique face identification information to the same face in the video frame sequence. In some embodiments, when one or more target face frame image areas in a current video frame are obtained, target face identification information corresponding to each target face frame image area needs to be determined, and if a plurality of face frame image areas in different video frames correspond to the same target face identification information, the plurality of face frame image areas correspond to the same face. In some embodiments, because the image area of the target face frame has a corresponding relationship with the target face identification information, the living body face classification score corresponding to the image area of each target face frame in the current video frame can be respectively determined according to the scheme, and based on the target face identification information corresponding to the image area of each target face frame, the living body face classification score corresponding to the image area of the target face frame corresponding to each target face identification information in the video frame sequence can be determined. In some embodiments, the target face identification information may be determined by a cross-over ratio, for example, a target face image region detected in a first video frame is given with the target face identification information, a next video frame is also subjected to face detection to obtain a target face image region, and then two-by-two matching is performed with the target face image region detected in a previous video frame, that is, the IOU of two target face image regions is calculated, when the IOU of the target face image region detected in the next video frame and the IOU of the target face image region detected in the previous video frame are greater than a threshold (e.g., 0.5), the target face image region detected in the next video frame finds a matched face, and the matched face is given with the target face identification information the same as the matched face of the previous video frame; and when the image area of the target face frame detected by the next video frame and the image area of the target face frame detected by the previous video frame can not find a matched face, giving a new piece of target face identification information to the target face frame. The method for determining the target face identification information is only an example and is not limited herein.

In some embodiments, the step S142 includes: if the face living body classification score is larger than or equal to a first preset score threshold value, the first device determines that the face corresponding to the target face frame image area corresponding to the target face identification information in the current video frame is not a living body; wherein the method further comprises step S15 (not shown). In step S15, the first device repeatedly executes the step S11 for one or more second video frames within a subsequent predetermined interval corresponding to the current video frame, and directly determines that the face corresponding to the target face frame image area corresponding to the target face identification information in the one or more second video frames is not a living body. In some embodiments, the face live body classification score is used to score a probability of whether a face corresponding to a target face frame image region in a current video frame is a live body, since there is a corresponding relationship between the target face frame image region and target face identification information, if the face live body classification score corresponding to the target face frame image region in the current video frame is greater than or equal to a first predetermined score threshold, it is determined that the face corresponding to the target face frame image region corresponding to the target face identification information in the current video frame is not a live body, and since there is little change in adjacent video frames, in order to reduce a calculation amount of face live body detection, for one or more second video frames in a subsequent predetermined interval (predetermined time interval, e.g., 500ms, or predetermined frame number interval, e.g., 1-15 frames) corresponding to the current video frame, a target face frame image region in the one or more second video frames and a target face frame image region corresponding to the target face frame image region are obtained For the target face frame image region corresponding to the target face identification information in the one or more second video frames, without performing face live body detection on the target face frame image region (steps S12, S13, S14), it may be directly determined that the face corresponding to the target face frame image region corresponding to the target face identification information in the one or more second video frames is not a live body, and face live body detection is still required for the target face frame image region corresponding to the other target face identification information in the one or more second video frames and the target face frame image region corresponding to the target face identification information in the subsequent video frames after the predetermined interval. For example, a target face frame image area a1 and a target face frame image area a2 exist in a current video frame, and target face identification information corresponding to the target face frame image area a2 is User1 and User2, respectively, and if a living body classification score of a face corresponding to the target face frame image area a1 is greater than or equal to a first predetermined score threshold, it is determined that a face corresponding to a target face frame image area a1 corresponding to the target face identification information User1 in the current video frame is not a living body, and for a video frame within 500ms after the current video frame, a target face frame image area and corresponding target face identification information are determined, respectively, for example, a target face frame image area A3 and a target face frame image area a4 exist in a video frame next to the current video frame, and target face identification information corresponding to User1 and User3, respectively, and the target face frame image area A3 corresponding to the target face identification information User1 in the video frame does not need to be subjected to convolution living body detection (for example, a center difference image area a center difference detection is not used) The steps of feature extraction processing of a plurality of convolution modules and determining a living body detection result according to a target feature map) directly determine that the human face corresponding to the target human face image area A3 corresponding to the target human face identification information User1 in the video frame is not a living body, and still need to carry out human face living body detection on the target human face image area A4 corresponding to the target human face identification information User3 in the video frame.

In some embodiments, the step S142 includes a step S1421 (not shown) and a step S1422 (not shown). In step S1421, if the face living body classification score is less than or equal to a second predetermined score threshold, it cannot be determined whether the face corresponding to the target face frame image region corresponding to the target face identification information in the current video frame is a living body; wherein the method further comprises: the first device repeatedly executes the steps S11, S12, S13 and S1411 for a third video frame after a subsequent predetermined interval corresponding to the current video frame, and obtains a third face living body classification score corresponding to a target face frame image area corresponding to the target face identification information in the third video frame; in step S1422, the first device determines, according to the third face living body classification score, a face living body detection result corresponding to a target face frame image region corresponding to the target face identification information in the third video frame. In some embodiments, in order to improve the face living body detection accuracy, the face living body classification scores of a plurality of video frame images may be used to comprehensively determine whether the face in the video frame is a living body. In some embodiments, the face live body classification score is used to score a probability of whether a face corresponding to a target face frame image region in a current video frame is a live body, if the face live body classification score corresponding to the target face frame image region in the current video frame is less than or equal to a second predetermined score threshold, since there is a correspondence between the target face frame image region and the target face identification information, at this time, it cannot be immediately determined that a face corresponding to the target face frame image region corresponding to the target face identification information in the current video frame is a live body, and in order to increase accuracy and since there is little change in adjacent frames, the above steps S11, S12, S13, and S1411 may be selected to be repeatedly executed for a third video frame after a subsequent predetermined interval corresponding to the current video frame, so as to obtain a third face live body classification score corresponding to the target face frame image region corresponding to the target face identification information in the third video frame, and comprehensively determining whether the face corresponding to the image area of the target face frame corresponding to the target face identification information in the third video frame is a living body according to the third face living body classification score, wherein the predetermined interval may be a predetermined time interval, for example, 100ms, or the predetermined interval may also be a predetermined number of frames, for example, 1-3 frames, and the predetermined interval is by way of example only and is not limited herein. For example, a target face frame image area a1 and a target face frame image area a2 exist in the current video frame, and the corresponding target face identification information is User1 and User2, if the living body face classification score corresponding to the target face frame image area a1 in the current video frame is less than or equal to a second predetermined score threshold, it cannot be determined whether the face corresponding to the target face frame image area a1 corresponding to the target face identification information User1 in the current video frame is a living body, the above steps S11, S12, S13, S1 are performed on a third video frame after 100ms subsequent to the current video frame, if a target face frame image area A3 corresponding to the target face identification information User1 exists in the third video frame, a third living body face classification score corresponding to a target face frame image area A3 in the third video frame is obtained, and then a target face frame image area a 14184 corresponding to the target face frame image area a1 in the third video frame is determined according to the third living body face classification score Or not, a living body.

In some embodiments, wherein the step S1422 includes at least one of: if the third face living body classification score is smaller than or equal to a second preset score threshold value, the first device determines that the face corresponding to the target face frame image area corresponding to the target face identification information in the third video frame is a living body; if the third face living body classification score is larger than or equal to a first preset score threshold value, the first device determines that the face corresponding to the target face frame image area corresponding to the target face identification information in the third video frame is not a living body; repeating the step S15 with the third video frame as the current video frame; if the third face living body classification score is larger than a second preset score threshold and smaller than a first preset score threshold, whether a face corresponding to a target face frame image area corresponding to the target face identification information in the third video frame is a living body cannot be determined; repeating the steps S11, S12, S13 and S1411 for a fourth video frame after a subsequent predetermined interval corresponding to the third video frame, and obtaining a fourth face living body classification score corresponding to a target face frame image area corresponding to the target face identification information in the fourth video frame; and determining a fourth face living body detection result corresponding to the target face frame image area corresponding to the target face identification information in the fourth video frame according to the fourth face living body classification score. In some embodiments, if the third face living body classification score is less than or equal to the second predetermined score threshold, since the face living body classification score of the target face frame image area corresponding to the target face identification information in the current video frame is less than or equal to the second predetermined score threshold, it may be determined that the face corresponding to the target face frame image area corresponding to the target face identification information in the third video frame is a living body. For example, a target face frame image area A3 corresponding to the target face identification information User1 exists in the third video frame, a third living body classification score of the face corresponding to the target face frame image area A3 in the third video frame is obtained, and if the third living body classification score of the face is still smaller than or equal to a second predetermined score threshold, it is determined that the face corresponding to the target face frame image area A3 corresponding to the target face identification information User1 in the third video frame is a living body. In some embodiments, if the third face live-body classification score is greater than or equal to the first predetermined score threshold, it may be determined that the face corresponding to the target face frame image region corresponding to the target face identification information in the third video frame is not a live body, and since the change in the neighboring video frames is not large, for one or more video frames in a subsequent predetermined interval (a predetermined time interval, e.g., 500ms, or a predetermined frame number interval, e.g., 1-15 frames) corresponding to the third video frame, the target face frame image region in the one or more video frames and the target face frame image region corresponding to the target face frame image region in the one or more video frames are obtained, and for the target face frame image region corresponding to the target face identification information in the one or more video frames, no face live-body detection is performed on the target face frame image region (steps S12, S13, b), S14), it may be directly determined that the face corresponding to the target face frame image region corresponding to the target face identification information in the one or more video frames is not a living body, and for the target face frame image region corresponding to the other target face identification information in the one or more video frames and the target face frame image region corresponding to the target face identification information in the subsequent video frames after the predetermined interval, face living body detection still needs to be performed on the target face frame image region. For example, a target face frame image area A3 corresponding to the target face frame image area A3 in the third video frame is obtained by the presence of the target face frame image area A3 corresponding to the target face frame image area User1 in the third video frame, if the third face live body classification score is greater than or equal to a first predetermined score threshold, it is determined that the face corresponding to the target face frame image area A3 corresponding to the target face identification information User1 in the third video frame is not a live body, and for one or more video frames within 500ms subsequent to the third video frame, the target face frame image area and the corresponding target face identification information are respectively determined, for example, the target face frame image area a4 and the target face frame image area a5 exist in the next video frame of the third video frame, and the corresponding target face identification information is respectively User1 and User3, and then the target face frame image area a4 corresponding to the target face frame identification information User1 in the video frame is not subjected to live body detection (for example, difference center face frame image area a4 is not subjected to convolution) The steps of feature extraction processing of a plurality of convolution modules and determining a living body detection result according to a target feature map) directly determine that the human face corresponding to the target human face image area A4 corresponding to the target human face identification information User1 in the video frame is not a living body, and still need to carry out human face living body detection on the target human face image area A5 corresponding to the target human face identification information User3 in the video frame. In some embodiments, if the third face living body classification score is greater than the second predetermined score threshold and less than the first predetermined score threshold, it cannot be determined whether the face corresponding to the target face frame image region corresponding to the target face identification information in the third video frame is a living body, in order to increase the accuracy and since the change in the adjacent frames is not large, the steps S11, S12, S13, and S1411 may be selected to be repeatedly executed on a fourth video frame after a subsequent predetermined interval corresponding to the third video frame, a fourth face living body classification score corresponding to the target face frame image region corresponding to the target face identification information in the fourth video frame is obtained, and whether the face corresponding to the target face frame image region corresponding to the target face identification information in the fourth video frame is a living body is comprehensively determined according to the fourth face living body classification score, where the predetermined interval may be a predetermined time interval, e.g., 100ms, or the predetermined interval may also be a predetermined frame number interval, e.g., 1-3 frames. For example, if a target face frame image area A3 corresponding to the target face frame image area A3 in the third video frame exists, a third living body face classification score corresponding to the target face frame image area A3 in the third video frame is obtained, if the third living body face classification score is greater than a second predetermined score threshold and smaller than a first predetermined score threshold, it cannot be determined whether the face corresponding to the target face frame image area A3 corresponding to the target face identification information User1 in the third video frame is a living body, the above steps S11, S12, S13, and S1411 are performed on a fourth video frame after 100ms subsequent to the third video frame, if a target face frame image area a6 corresponding to the target face identification information User1 exists in the fourth video frame, a fourth living body face classification score corresponding to the target face frame image area a6 in the fourth video frame is obtained, and then a target face frame image area a1 in the fourth video frame is determined according to the fourth living body face classification score Whether the face corresponding to a6 is a living body.

In some embodiments, the step S142 includes: if the face living body classification score is larger than a second preset score threshold and smaller than a first preset score threshold, the first device cannot determine whether a face corresponding to a target face frame image area corresponding to the target face identification information in the current video frame is a living body; wherein the method further comprises step S16 (not shown) and step S17 (not shown). In step S16, the first device repeatedly executes the steps S11, S12, S13, and S1411 for a fifth video frame after a predetermined interval of time corresponding to the current video frame, and obtains a fifth living face classification score corresponding to a target face frame image region corresponding to the target face identification information in the fifth video frame; in step S17, the first device determines, according to the fifth face living body classification score, a fifth face living body detection result corresponding to the target face frame image area corresponding to the target face identification information in the fifth video frame. In some embodiments, if the living human face classification score corresponding to the target human face image region in the current video frame is greater than a second predetermined score threshold and less than a first predetermined score threshold, because the image area of the target face frame corresponds to the identification information of the target face, at this time, it cannot be determined whether the face corresponding to the image area of the target face frame in the current video frame is a living body, in order to increase the accuracy, the above steps S11, S12, S13 and S1411 are repeatedly executed for a fifth video frame after a predetermined interval corresponding to the current video frame, a fifth face living body classification score corresponding to the image area of the target face frame corresponding to the target face identification information in the fifth video frame is obtained, and comprehensively determining whether the face corresponding to the image area of the target face frame corresponding to the target face identification information in the fifth video frame is a living body or not based on the fifth face living body classification score. The predetermined interval may be a predetermined time interval, such as 100ms, or the predetermined interval may also be a predetermined number of frames, such as 1-3 frames, and the predetermined interval is only exemplary and not limited herein. For example, a target face frame image area a1 and a target face frame image area a2 exist in the current video frame, and the target face identification information respectively corresponds to User1 and User2, if a living body classification score of a face corresponding to the target face frame image area a1 in the current video frame is greater than a second predetermined score threshold and smaller than a first predetermined score threshold, it cannot be determined whether a face corresponding to a target face frame image area a1 corresponding to the target face identification information User1 in the current video frame is a living body, the above steps S11, S12, S13, and S1411 are performed on a fifth video frame after 100ms subsequent to the current video frame, if a target face frame image area a7 corresponding to the target face identification information User1 exists in the fifth video frame, a fifth living body classification score corresponding to the target face frame image area a7 in the fifth video frame is obtained, and then a target face image area a1 corresponding to the target face identification information User image area a1 in the fifth video frame is determined according to the fifth living body classification score Whether the face corresponding to the domain a7 is a living body. In some embodiments, the step S17 includes at least one of: if the fifth face living body classification score is larger than or equal to a first preset score threshold value, determining that the face corresponding to the target face frame image area corresponding to the target face identification information in the fifth video frame is not a living body; repeatedly executing the step S15 with the fifth video frame as the current video frame; if the fifth face living body classification score is smaller than or equal to a second preset score threshold value, repeating the steps S11, S12, S13 and S1411 for a sixth video frame after a subsequent preset interval corresponding to the fifth video frame, obtaining a sixth face living body classification score corresponding to a target face frame image region corresponding to the target face identification information in the sixth video frame, and determining a face living body detection result corresponding to the target face frame image region corresponding to the target face identification information in the sixth video frame according to the sixth face living body classification score; if the fifth face living body classification score is larger than a second preset score threshold and smaller than a first preset score threshold, whether a face corresponding to a target face frame image area corresponding to the target face identification information in the fifth video frame is a living body cannot be determined; repeating the steps S11, S12, S13 and S1411 for a seventh video frame after a subsequent predetermined interval corresponding to the fifth video frame, and obtaining a seventh face living body classification score corresponding to a target face frame image area corresponding to the target face identification information in the seventh video frame; and determining a seventh face living body detection result corresponding to the target face frame image area corresponding to the target face identification information in the seventh video frame according to the seventh face living body classification score. In some embodiments, if the fifth face live body classification score is greater than or equal to the first predetermined score threshold, it may be determined that the face corresponding to the target face frame image region corresponding to the target face identification information in the fifth video frame is not a live body, and since the change in the neighboring video frames is not large, for one or more video frames in the subsequent predetermined interval corresponding to the fifth video frame, the target face frame image region in the one or more video frames and the target face frame image region corresponding to the target face frame image region are obtained, and for the target face frame image region corresponding to the target face identification information in the one or more video frames, no face live body detection is performed on the target face frame image region (steps S12, S13, S14), it may be directly determined that the face corresponding to the target face frame image region corresponding to the target face identification information in the one or more video frames is not a live body, and for a target face frame image region corresponding to other target face identification information in the one or more video frames and a target face frame image region corresponding to target face identification information in the video frame after the subsequent predetermined interval, face living body detection needs to be performed on the target face frame image region. For example, if a target face frame image area a7 corresponding to the target face frame image area a7 exists in the fifth video frame, a fifth living body classification score corresponding to the target face frame image area a7 exists in the fifth video frame, and if the fifth living body classification score is greater than or equal to a first predetermined score threshold, it is determined that the face corresponding to the target face frame image area a7 corresponding to the target face identification information User1 in the fifth video frame is not a living body, and for one or more video frames within 500ms subsequent to the fifth video frame, a target face frame image area and corresponding target face identification information are respectively determined, for example, a target face frame image area A8 and a target face frame image area a9 exist in a next video frame of the fifth video frame, and the corresponding target face identification information is respectively User1 and User4, then living body detection is not required for the target face frame image area a4 corresponding to the target face frame identification information User1 in the video frame (for example, a central convolution face image detection is not required (for central difference, a 8298 is used) The steps of feature extraction processing of a plurality of convolution modules and determining a living body detection result according to a target feature map) directly determine that the human face corresponding to the target human face image area A8 corresponding to the target human face identification information User1 in the video frame is not a living body, and still need to carry out human face living body detection on the target human face image area A9 corresponding to the target human face identification information User4 in the video frame. In some embodiments, if the fifth face living body classification score is less than or equal to the second predetermined score threshold, and it cannot be determined whether the face corresponding to the target face frame image region in the fifth video frame is a living body, the steps S11, S12, S13, and S1411 are repeatedly performed on the sixth video frame after the predetermined interval corresponding to the fifth video frame, so as to obtain a sixth face living body classification score corresponding to the target face frame image region corresponding to the target face identification information in the sixth video frame, and then the step S1422 is repeatedly performed on the sixth face living body classification score, so as to determine whether the face corresponding to the target face frame image region corresponding to the target face identification information in the sixth video frame is a living body. For example, if the target face frame image area A7 corresponding to the target face frame image area A7 exists in the fifth video frame, a fifth living body classification score corresponding to the target face frame image area A7 exists in the fifth video frame, if the fifth living body classification score is less than or equal to a second predetermined score threshold, it cannot be determined whether the face corresponding to the target face frame image area A7 corresponding to the target face identification information User1 in the fifth video frame is a living body, the above steps S11, S12, S13, S1411 are performed on the sixth video frame after 100ms corresponding to the fifth video frame, if the target face frame image area a10 corresponding to the target face identification information User1 exists in the sixth video frame, a sixth face classification score corresponding to the target face frame image area a10 in the sixth video frame is obtained, and then the above step S1422 is performed according to the sixth face living body classification score, it is determined whether the face corresponding to the target face frame image region a10 corresponding to the target face identification information User1 in the sixth video frame is a living body. In some embodiments, if the fifth face live-body classification score is greater than the second predetermined score threshold and less than the first predetermined score threshold, and it cannot be determined whether the face corresponding to the target face frame image region corresponding to the target face identification information in the fifth video frame is a live body, the steps S11, S12, S13, and S1411 are repeated for a seventh video frame after a predetermined interval corresponding to the fifth video frame, so as to obtain a seventh face live-body classification score corresponding to the seventh video frame, and then the step S17 is repeated for the seventh face live-body classification score, so as to determine whether the face corresponding to the target face frame image region corresponding to the target face identification information in the seventh video frame is a live body based on the seventh face live-body classification score. For example, if the target face image area a7 corresponding to the target face image area a7 exists in the fifth video frame, the live-face classification score corresponding to the target face image area a 357 in the fifth video frame is obtained, if the live-face classification score is greater than the second predetermined score threshold and less than the first predetermined score threshold, it cannot be determined whether the face corresponding to the target face image area a7 corresponding to the target face identification information User1 in the fifth video frame is a live body, the above steps S11, S12, S13, and S1411 are performed on the seventh video frame after 100ms after the fifth video frame, if the target face image area a11 corresponding to the target face identification information User1 exists in the seventh video frame, the seventh live-face classification score corresponding to the target face image area a11 in the seventh video frame is obtained, and then the above step S17 is performed according to the seventh live-face classification score, it is determined whether the face corresponding to the target face frame image region a11 corresponding to the target face identification information User1 in the seventh video frame is a living body.

In some embodiments, the central differential convolution is a combination of a pure central differential convolution and a normal convolution based on a predetermined hyper-parameter, wherein the predetermined hyper-parameter is between 0 and 1. In some embodiments, the center differential convolution in the present application may be a pure center differential convolution, or may be a combination of a normal convolution and a pure center differential convolution. In some embodiments, the general convolution formula is:

to improve robustness, a pure central difference convolution (central difference convolution) and a normal convolution (vanilla convolution) can be combined, and the central difference convolution formula is as follows:

wherein x and y are respectively input characteristic diagram and output characteristic diagram, P₀Refers to the current position (i.e. the center point of the image area corresponding to the size of the convolution kernel in the input feature map), P_nIs P₀The value range of the hyper-parameter theta is [0,1 ] (i.e. the rest pixel points adjacent to the central point in the image region corresponding to the convolution kernel size in the input feature map)]When theta is 0, the central differential convolution is normal convolution, when theta is 1, the central differential convolution is pure central differential convolution, the hyper-parameter theta is preset by default, preferably, theta is default to 0.7, and through the combination of the pure central differential convolution and the normal convolution based on the preset hyper-parameter, the calculation amount can be reduced as much as possible on the basis of higher robustness and higher human face living body detection precision.

In some embodiments, each of the convolution modules is a convolutional neural network module comprising at least one depth separable convolutional layer; wherein the convolutional neural network module comprises any one of: the Block module (i.e., convolution module) in MobileNet V1; a Block module in MobileNet V2; block Module in ShuffleNet V1; block Module in ShuffleNet V2. In some embodiments, mobilonet V1 is the first version of a lightweight convolutional neural network proposed in 2017 by google that is dedicated to mobile terminals or embedded devices, mobilonet V2 is the second version of google proposed in 2018 that is improved on the basis of mobilonet V1, and shuffonet V1 and shuffonet V2 are the first and second versions of a lightweight convolutional neural network that can be used in mobile devices proposed by spaciousness technology. In some embodiments, the Block module in the ShuffleNet V2 model is used as a convolution module, and the Block and stride (step size) are 1 and 2 respectively as shown in FIGS. 10 and 11.

Fig. 2 shows a flowchart of a method for face liveness detection according to an embodiment of the present application, the method including step S21 and step S22. In step S21, the first device acquires a target face frame image region in the current video frame; in step S22, the first device inputs the target face frame image area into a face live detection network, and outputs a face live detection result corresponding to the target face frame image area in the current video frame, where the face live detection network performs center difference convolution on the target face frame image area to obtain a first feature map, sequentially performs feature extraction on the first feature map through a plurality of convolution modules, and fuses a second feature map output by the last feature extraction process and a third feature map output by another feature extraction process to obtain a target feature map, where each convolution module includes at least one depth separable convolution layer, and determines the face live detection result according to the target feature map.

In step S21, the first device acquires a target face frame image region in the current video frame. In some embodiments, the related operations are described in detail above and are not described herein.

In step S22, the first device inputs the target face frame image area into a face live detection network, and outputs a face live detection result corresponding to the target face frame image area in the current video frame, where the face live detection network performs center difference convolution on the target face frame image area to obtain a first feature map, sequentially performs feature extraction on the first feature map through a plurality of convolution modules, and fuses a second feature map output by the last feature extraction process and a third feature map output by another feature extraction process to obtain a target feature map, where each convolution module includes at least one depth separable convolution layer, and determines the face live detection result according to the target feature map. In some embodiments, the operations related to the face living body detection network are the same as or similar to the operations related to the first device described above, and are not described herein again.

Fig. 3 shows a flowchart of a method for face liveness detection according to an embodiment of the present application, the method includes step S31. In step S31, the first device inputs the obtained video frame into a face live detection network, outputs a face live detection result corresponding to the image area of the target face frame in the video frame, wherein the face living body detection network acquires the image area of the target face frame in the video frame, performing central difference convolution processing on the image area of the target face frame to obtain a first feature map, sequentially performing feature extraction processing on the first feature map through a plurality of convolution modules, fusing the second feature map output by the last feature extraction processing and the third feature maps output by the other feature extraction processing to obtain a target feature map, and each convolution module comprises at least one depth separable convolution layer, and a human face living body detection result corresponding to the image area of the target human face frame in the video frame is determined according to the target feature map.

In step S31, the first device inputs the obtained video frame into a face live detection network, outputs a face live detection result corresponding to the image area of the target face frame in the video frame, wherein the face living body detection network acquires the image area of the target face frame in the video frame, performing central difference convolution processing on the image area of the target face frame to obtain a first feature map, sequentially performing feature extraction processing on the first feature map through a plurality of convolution modules, fusing the second feature map output by the last feature extraction processing and the third feature maps output by the other feature extraction processing to obtain a target feature map, and each convolution module comprises at least one depth separable convolution layer, and a human face living body detection result corresponding to the image area of the target human face frame in the video frame is determined according to the target feature map. In some embodiments, the operations related to the face living body detection network are the same as or similar to the operations related to the first device described above, and are not described herein again.

Fig. 4 shows a block diagram of a first apparatus for human face liveness detection according to an embodiment of the present application, which includes a one-module 11, a two-module 12, a three-module 13, and a four-module 14. A one-to-one module 11, configured to obtain an image area of a target face frame in a current video frame; a second module 12, configured to perform central difference convolution processing on the image area of the target face frame to obtain a first feature map; a third module 13, configured to sequentially perform feature extraction processing on the first feature map through a plurality of convolution modules, and fuse a second feature map output by the last feature extraction processing with third feature maps output by other feature extraction processing to obtain a target feature map, where each convolution module includes at least one depth-separable convolution layer; a fourth module 14, configured to determine, according to the target feature map, a living human face detection result corresponding to the image area of the target human face frame in the current video frame.

And the one-to-one module 11 is used for acquiring the image area of the target face frame in the current video frame. Wherein one or more target face frame image regions may be present in the current video frame. In some embodiments, the first device may be a user device or, alternatively, may be a network device. For example, the user equipment obtains a video stream by shooting through the user equipment or an external camera device, the user equipment obtains a target face frame image area from a current video frame of the video stream, and for example, the user equipment sends the video stream obtained by shooting to the network equipment, and the network equipment obtains the target face frame image area from the current video frame of the video stream. In some embodiments, face coordinate information in a current video frame, such as coordinates of a face circumscribed rectangle in the current video frame, is obtained based on a face detection algorithm, and then a corresponding target face frame image region is obtained from the current video frame according to the face coordinate information, where the face detection algorithm may be any face detection algorithm in the prior art.

And the second module 12 is used for performing central difference convolution processing on the image area of the target face frame to obtain a first feature map. In some embodiments, Central Difference Convolution (CDC) can capture intrinsic detail information of a face image by aggregating intensity and gradient information. Briefly, the center differential convolution is a step of adding one more center differential operation on the basis of standard convolution, the standard convolution mainly comprises two steps of sampling and aggregation, the center differential convolution is a step of inserting one center differential between the center differential convolution, the specific operation is to take a 3x3 convolution kernel as an example, the 3x3 convolution kernel is scanned on a feature map in a sliding manner, before aggregation (namely, performing dot multiplication operation with convolution kernel weight), 9 pixel points in a corresponding area of the 3x3 convolution kernel are extracted, a central point in the 9 pixel points is extracted, then a pixel value of the central point is subtracted from the remaining 9 (including the central point) pixel points to obtain updated pixel values after 9 center differentials, and finally, the 9 updated values and the convolution kernel weight are subjected to dot product aggregation to obtain a final output value. In some embodiments, the central differential convolution processing formula is as follows:

wherein x and y are respectively input characteristic diagram and output characteristic diagram, P₀Refers to the current position (i.e. the center point of the image area corresponding to the size of the convolution kernel in the input feature map), P_nIs P₀The neighborhood position (i.e., the remaining pixel points adjacent to the center point in the image region corresponding to the convolution kernel size in the input feature map). In some embodiments, the center differential convolution processing is performed on the target face frame image area to obtain a first feature map corresponding to the target face frame image area, the center differential convolution has stronger robustness than that of a common convolution, and the detection accuracy of a living human face can be improved, for example, the size of a convolution kernel corresponding to the center differential convolution is 3 × 3, the step size is 2, there are 32 convolution kernels in total, and for example, the size of a convolution kernel corresponding to the center differential convolution is 5 × 5, and the step size is three timesThe length is 1, there are 32 convolution kernels, and the convolution parameters of the center difference convolution are only used as examples and are not limited. In some embodiments, performing only the center differential convolution processing on the image area of the target face frame, rather than performing the center differential convolution processing on the entire current video frame, can reduce the time consumption and power consumption of the operation at the mobile terminal, so as to meet the requirement of lightweight type at the mobile terminal.

And a third module 13, configured to sequentially perform feature extraction processing on the first feature map through a plurality of convolution modules, and fuse a second feature map output by the last feature extraction processing with a third feature map output by another feature extraction processing to obtain a target feature map, where each convolution module includes at least one depth-separable convolution layer. In some embodiments, the feature extraction processing is performed on the first feature map sequentially by a plurality of convolution modules, each convolution module includes at least one depth-separable convolution layer, for example, the feature extraction processing is performed by inputting the first feature map into the first convolution module, then the feature map output by the first convolution module is input into the second convolution module for feature extraction processing, and so on, the first feature map is input into the plurality of convolution modules sequentially, and the output of the previous convolution module is used as the input of the next convolution module. In some embodiments, a depth-separable convolution is an improved algorithm over a normal convolution, which splits the original normal convolution into two parts, a channel-by-channel convolution and a point-by-point convolution. In some embodiments, a Block module in a MobileNetV2 model is used as an immediate convolution module, a first feature map output by a central differential convolution is sequentially subjected to Block processing for multiple times, an inverse residual structure, namely expansion, feature extraction and compression, is adopted in the Block, and compared with the residual structure, the feature loss can be reduced, for example, the first feature map is sequentially subjected to the Block processing for 4 times, the first feature map is input into a Block1, a feature map output by the Block1 is input into a Block2, a feature map output by the Block2 is input into a Block3, and a feature map output by the Block3 is input into a Block4, wherein the Block1 has a structure shown in fig. 7, the output size is the same as the input size, and the structures shown in fig. 8 are used for the Block2, the Block3 and the Block4, so that the output is half of the input. The number and specific structure of the convolution modules (blocks) are only examples, the convolution modules (blocks) may also be in other numbers, the convolution modules (blocks) may only adopt the structure shown in fig. 7, may only adopt the structure shown in fig. 8, may also use the structures shown in fig. 7 and 8 in combination, may also be in other Block structures, and are not limited. In some embodiments, the second feature map output by the last feature extraction process and the third feature map output by the other feature extraction process are fused to obtain the target feature map, which may be implemented in a specific manner that the second feature map output by the last feature extraction process is subjected to normal convolution to obtain a first convolution result, the third feature maps output by the feature extraction processes of some times except the last feature extraction process are subjected to normal convolution to obtain a second convolution result, and then the first convolution result and the second convolution result are fused to obtain the target feature map.

A fourth module 14, configured to determine, according to the target feature map, a living human face detection result corresponding to the image area of the target human face frame in the current video frame. In some embodiments, the fused target feature map is pooled and classified, for example, the fused target feature map is input into the pooling layer, the feature map output by the pooling layer is input into the classification loss function, and a living human face classification score corresponding to the image area of the target human face frame corresponding to the current video frame (for example, a probability value that a human face corresponding to the image area of the target human face frame in the current video frame is a living body) is output, and then a human face living body detection result is determined based on the living human face classification score, that is, whether the human face corresponding to the image area of the target human face frame in the current video frame is a living body is determined, for example, if the living human face classification score is greater than or equal to a preset threshold, it is determined that the human face corresponding to the image area of the target human face frame in the current video frame is not a living body, and if the living human face classification score is less than the preset threshold, it is determined that the face in the current video frame is a living body. In some embodiments, the present solution constructs a lightweight face biopsy network suitable for a mobile terminal, where the face biopsy network performs the aforementioned steps S12, S13, and S14, the input of the face biopsy network is a target face image region obtained from a current video frame, the output of the face biopsy network may be a face biopsy result directly, that is, whether a face corresponding to the target face image region in the current video frame is a living body can be directly determined by the face biopsy network, or the output of the face biopsy network may be a face biopsy score corresponding to the target face image region in the current video frame, and then a face biopsy result is determined according to the face biopsy score, and the face biopsy network can meet the lightweight requirement of the mobile terminal, the human face living body detection precision is improved by introducing related technologies such as central differential convolution, network high-low layer fusion and the like into the network, and the network calculation amount is reduced by reducing image input of human face living body detection and the like. As an example, as shown in the schematic diagram of the face live body detection network shown in fig. 9, the size of the image area of the input target face frame is 3 × 64, the center difference convolution processing is performed on the image area, the size of the obtained first feature map is 32 × 32, then the first feature map is sequentially subjected to 4 times of Block processing, the first feature map is input into Block1, the feature map (the size is 16 × 32) output by Block1 is input into Block2, the feature map (the size is 24 × 16) output by Block2 is input into Block3, the feature map (the size is 32 × 8) output by Block3 is input into Block4, then the feature map (the size is 64 × 4) output by Block4 is subjected to general convolution, the first result (the size is 32 × 8), and the size of the feature map (the size is 368 × 4) output by Block4 is subjected to general convolution, so as to obtain a first volume product (the size is 368), and obtaining a second convolution result (with the size of 160 × 4), then fusing the first convolution result and the second convolution result to obtain a target feature map (with the size of 480 × 4), then inputting the target feature map into a pooling layer to obtain a pooling feature map (with the size of 480 × 1), then inputting the pooling feature map into a softmax classification loss function, and outputting to obtain a corresponding face living body classification score, wherein 64 × 4 is taken as an example, 4 × 4 refers to the size of the feature map, and 64 refers to the number of the feature maps.

In some embodiments, the module 11 is configured to: acquiring an initial face image area in a current video frame; and carrying out edge expansion processing on the initial face frame image area to obtain the target face frame image area. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the secondary module 12 is configured to: performing quality evaluation on the target face frame image area to obtain the image quality corresponding to the target face frame image area; and if the image quality is higher than or equal to a preset image quality threshold value, performing central differential convolution processing on the target human face image area to obtain a first feature image. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the fusing the second feature map output by the last feature extraction process with the third feature maps output by the other feature extraction processes to obtain the target feature map includes: and fusing the second feature map output by the last feature extraction processing and the third feature map output by the last feature extraction processing to obtain a target feature map. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the quad module 14 includes a quad one module 141 (not shown) and a quad two module 142 (not shown). A fourth-to-first module 141, configured to determine, according to the target feature map, a face living body classification score corresponding to the image region of the target face map in the current video frame; a fourth-second module 142, configured to determine, according to the face living body classification score, a face living body detection result corresponding to the image area of the target face frame in the current video frame. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the one-four-one module 141 is configured to: performing pooling treatment on the target characteristic diagram to obtain a second target characteristic diagram; and inputting the second target feature map into a classification loss function, and outputting to obtain a face living body classification score corresponding to the image area of the target face map in the current video frame. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the one-to-one module is to: a target face frame image area in the current video frame and target face identification information corresponding to the target face frame image area are obtained, wherein the one-four-one module 141 includes a one-four-one module 1411 (not shown). A four-to-one module 1411, configured to determine, according to the target feature map, a living face classification score corresponding to a target face block image area corresponding to the target face identification information in the current video frame. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the one-two-module 142 is configured to: if the face living body classification score is larger than or equal to a first preset score threshold value, determining that the face corresponding to the target face frame image area corresponding to the target face identification information in the current video frame is not a living body; wherein the device further comprises a five-module 15 (not shown). A fifth module 15, configured to repeatedly execute the one-to-one module 11 for one or more second video frames in a subsequent predetermined interval corresponding to the current video frame, and directly determine that a face corresponding to the image area of the target face frame corresponding to the target face identification information in the one or more second video frames is not a living body. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the one-two-module 142 includes a two-one module 1421 (not shown) and a two-module 1422 (not shown). A fourth-to-second module 1421, configured to determine whether a face corresponding to a target face frame image region corresponding to the target face identification information in the current video frame is a living body if the face living body classification score is smaller than or equal to a second predetermined score threshold; wherein the method further comprises: repeatedly executing the one-to-one module 11, the two-to-one module 12, the three-to-one module 13, and the four-to-one module 1411 for a third video frame after a subsequent predetermined interval corresponding to the current video frame to obtain a third face living body classification score corresponding to a target face frame image area corresponding to the target face identification information in the third video frame; a fourth-second module 1422, configured to determine, according to the third face living body classification score, a face living body detection result corresponding to a target face frame image region corresponding to the target face identification information in the third video frame. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the one, two, module 1422 is configured to at least one of: if the third face living body classification score is smaller than or equal to a second preset score threshold value, determining that a face corresponding to a target face frame image area corresponding to the target face identification information in the third video frame is a living body; if the third face living body classification score is larger than or equal to a first preset score threshold value, determining that the face corresponding to the target face frame image area corresponding to the target face identification information in the third video frame is not a living body; taking the third video frame as the current video frame, and repeatedly executing the first-fifth module 15; if the third face living body classification score is larger than a second preset score threshold and smaller than a first preset score threshold, whether a face corresponding to a target face frame image area corresponding to the target face identification information in the third video frame is a living body cannot be determined; repeatedly executing the one-to-one module 11, the two-to-one module 12, the three-to-one module 13, and the four-to-one module 1411 for a fourth video frame after a subsequent predetermined interval corresponding to the third video frame to obtain a fourth face living body classification score corresponding to a target face frame image area corresponding to the target face identification information in the fourth video frame; and determining a fourth face living body detection result corresponding to the target face frame image area corresponding to the target face identification information in the fourth video frame according to the fourth face living body classification score. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the one-two-module 142 is configured to: if the face living body classification score is larger than a second preset score threshold and smaller than a first preset score threshold, whether a face corresponding to a target face frame image area corresponding to the target face identification information in the current video frame is a living body cannot be determined; wherein the device further comprises a six-module 16 (not shown) and a seven-module 17 (not shown). A sixth module 16, configured to repeatedly execute the one-to-one module 11, the two-to-one module 12, the one-to-three module 13, and the one-to-four one-to-one module 1411 for a fifth video frame after a subsequent predetermined interval corresponding to the current video frame, to obtain a fifth living human face classification score corresponding to a target human face image area corresponding to the target human face identification information in the fifth video frame; a seventh module 17 is configured to determine, according to the fifth face living body classification score, a fifth face living body detection result corresponding to a target face frame image region corresponding to the target face identification information in the fifth video frame. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the one-seven module is to at least one of: if the fifth face living body classification score is larger than or equal to a first preset score threshold value, determining that the face corresponding to the target face frame image area corresponding to the target face identification information in the fifth video frame is not a living body; taking the fifth video frame as the current video frame, and repeatedly executing the first-fifth module 15; if the fifth face living body classification score is smaller than or equal to a second preset score threshold value, repeatedly executing the one-to-one module 11, the two-to-two module 12, the one-to-three module 13, and the one-to-four module 1411 for a sixth video frame after a subsequent preset interval corresponding to the fifth video frame to obtain a sixth face living body classification score corresponding to a target face frame image area corresponding to the target face identification information in the sixth video frame, and determining a face living body detection result corresponding to the target face frame image area corresponding to the target face identification information in the sixth video frame according to the sixth face living body classification score; if the fifth face living body classification score is larger than a second preset score threshold and smaller than a first preset score threshold, whether a face corresponding to a target face frame image area corresponding to the target face identification information in the fifth video frame is a living body cannot be determined; repeatedly executing the one-to-one module 11, the two-to-one module 12, the three-to-one module 13, the four-to-one module 1411 for a seventh video frame after a subsequent predetermined interval corresponding to the fifth video frame to obtain a seventh face living body classification score corresponding to a target face frame image area corresponding to the target face identification information in the seventh video frame; and determining a seventh face living body detection result corresponding to the target face frame image area corresponding to the target face identification information in the seventh video frame according to the seventh face living body classification score.

In some embodiments, the central differential convolution is a combination of a pure central differential convolution and a normal convolution based on a predetermined hyper-parameter, wherein the predetermined hyper-parameter is between 0 and 1. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, each of the convolution modules is a convolutional neural network module comprising at least one depth separable convolutional layer; wherein the convolutional neural network module comprises any one of: the Block module (i.e., convolution module) in MobileNet V1; a Block module in MobileNet V2; block Module in ShuffleNet V1; block Module in ShuffleNet V2. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

Fig. 5 shows a structure of a first apparatus for human face live body detection according to an embodiment of the present application, which includes two modules 21 and two modules 22. A second module 21, configured to obtain an image area of a target face frame in a current video frame; and a second module 22, configured to input the image area of the target face frame into a face live detection network, and output a face live detection result corresponding to the image area of the target face frame in the current video frame, where the face live detection network performs center difference convolution on the image area of the target face frame to obtain a first feature map, sequentially performs feature extraction on the first feature map through multiple convolution modules, and fuses a second feature map output by the last feature extraction process and a third feature map output by another feature extraction process to obtain a target feature map, where each convolution module includes at least one depth separable convolution layer, and determines the face live detection result according to the target feature map.

And a second module 21, configured to obtain an image area of a target face frame in a current video frame. In some embodiments, the related operations are described in detail above and are not described herein.

And a second module 22, configured to input the image area of the target face frame into a face live detection network, and output a face live detection result corresponding to the image area of the target face frame in the current video frame, where the face live detection network performs center difference convolution on the image area of the target face frame to obtain a first feature map, sequentially performs feature extraction on the first feature map through multiple convolution modules, and fuses a second feature map output by the last feature extraction process and a third feature map output by another feature extraction process to obtain a target feature map, where each convolution module includes at least one depth separable convolution layer, and determines the face live detection result according to the target feature map. In some embodiments, the operations related to the face living body detection network are the same as or similar to the operations related to the first device described above, and are not described herein again.

Fig. 6 shows a block diagram of a first apparatus for human face liveness detection according to an embodiment of the present application, which includes three modules 31. A third module 31, configured to input the obtained video frame into a face living body detection network, and output a face living body detection result corresponding to an image area of a target face frame in the video frame, wherein the face living body detection network acquires the image area of the target face frame in the video frame, performing central difference convolution processing on the image area of the target face frame to obtain a first feature map, sequentially performing feature extraction processing on the first feature map through a plurality of convolution modules, fusing the second feature map output by the last feature extraction processing and the third feature maps output by the other feature extraction processing to obtain a target feature map, and each convolution module comprises at least one depth separable convolution layer, and a human face living body detection result corresponding to the image area of the target human face frame in the video frame is determined according to the target feature map.

A third module 31, configured to input the obtained video frame into a face living body detection network, and output a face living body detection result corresponding to an image area of a target face frame in the video frame, wherein the face living body detection network acquires the image area of the target face frame in the video frame, performing central difference convolution processing on the image area of the target face frame to obtain a first feature map, sequentially performing feature extraction processing on the first feature map through a plurality of convolution modules, fusing the second feature map output by the last feature extraction processing and the third feature maps output by the other feature extraction processing to obtain a target feature map, and each convolution module comprises at least one depth separable convolution layer, and a human face living body detection result corresponding to the image area of the target human face frame in the video frame is determined according to the target feature map. In some embodiments, the operations related to the face living body detection network are the same as or similar to the operations related to the first device described above, and are not described herein again.

In addition to the methods and apparatus described in the embodiments above, the present application also provides a computer readable storage medium storing computer code that, when executed, performs the method as described in any of the preceding claims.

The present application also provides a computer program product, which when executed by a computer device, performs the method of any of the preceding claims.

The present application further provides a computer device, comprising:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.

FIG. 12 illustrates an exemplary system that can be used to implement the various embodiments described herein;

in some embodiments, as shown in FIG. 12, the system 300 can be implemented as any of the devices in the various embodiments described. In some embodiments, system 300 may include one or more computer-readable media (e.g., system memory or NVM/storage 320) having instructions and one or more processors (e.g., processor(s) 305) coupled with the one or more computer-readable media and configured to execute the instructions to implement modules to perform the actions described herein.

For one embodiment, system control module 310 may include any suitable interface controllers to provide any suitable interface to at least one of processor(s) 305 and/or any suitable device or component in communication with system control module 310.

The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. Memory controller module 330 may be a hardware module, a software module, and/or a firmware module.

System memory 315 may be used, for example, to load and store data and/or instructions for system 300. For one embodiment, system memory 315 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 315 may include a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, system control module 310 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 320 and communication interface(s) 325.

For example, NVM/storage 320 may be used to store data and/or instructions. NVM/storage 320 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 320 may include storage resources that are physically part of the device on which system 300 is installed or may be accessed by the device and not necessarily part of the device. For example, NVM/storage 320 may be accessible over a network via communication interface(s) 325.

Communication interface(s) 325 may provide an interface for system 300 to communicate over one or more networks and/or with any other suitable device. System 300 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols.

For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) (e.g., memory controller module 330) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) of the system control module 310 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310 to form a system on a chip (SoC).

In various embodiments, system 300 may be, but is not limited to being: a server, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, system 300 may have more or fewer components and/or different architectures. For example, in some embodiments, system 300 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Those skilled in the art will appreciate that the form in which the computer program instructions reside on a computer-readable medium includes, but is not limited to, source files, executable files, installation package files, and the like, and that the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. Computer-readable media herein can be any available computer-readable storage media or communication media that can be accessed by a computer.

Communication media includes media by which communication signals, including, for example, computer readable instructions, data structures, program modules, or other data, are transmitted from one system to another. Communication media may include conductive transmission media such as cables and wires (e.g., fiber optics, coaxial, etc.) and wireless (non-conductive transmission) media capable of propagating energy waves such as acoustic, electromagnetic, RF, microwave, and infrared. Computer readable instructions, data structures, program modules, or other data may be embodied in a modulated data signal, for example, in a wireless medium such as a carrier wave or similar mechanism such as is embodied as part of spread spectrum techniques. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. The modulation may be analog, digital or hybrid modulation techniques.

By way of example, and not limitation, computer-readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable storage media include, but are not limited to, volatile memory such as random access memory (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read-only memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memories (MRAM, FeRAM); and magnetic and optical storage devices (hard disk, tape, CD, DVD); or other now known media or later developed that can store computer-readable information/data for use by a computer system.

An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method for live human face detection, wherein the method comprises:

a, acquiring an image area of a target face frame in a current video frame;

b, performing central differential convolution processing on the image area of the target face image to obtain a first feature map;

c, sequentially performing feature extraction processing on the first feature graph through a plurality of convolution modules, and fusing a second feature graph output by the last feature extraction processing and a third feature graph output by other feature extraction processing to obtain a target feature graph, wherein each convolution module comprises at least one depth separable convolution layer;

2. The method of claim 1, wherein the obtaining of the target face frame image region in the current video frame comprises:

acquiring an initial face image area in a current video frame;

and carrying out edge expansion processing on the initial face frame image area to obtain the target face frame image area.

3. The method according to claim 1 or 2, wherein the performing central difference convolution processing on the target face frame image area to obtain a first feature map comprises:

performing quality evaluation on the target face frame image area to obtain the image quality corresponding to the target face frame image area;

and if the image quality is higher than or equal to a preset image quality threshold value, performing central differential convolution processing on the target human face image area to obtain a first feature image.

4. The method according to claim 1, wherein the fusing the second feature map output by the last feature extraction process and the third feature maps output by the other feature extraction processes to obtain the target feature map comprises:

and fusing the second feature map output by the last feature extraction processing and the third feature map output by the last feature extraction processing to obtain a target feature map.

5. The method according to claim 1, wherein the determining, according to the target feature map, a living human face detection result corresponding to the target human face image region in the current video frame includes:

determining a face living body classification score corresponding to the image area of the target face frame in the current video frame according to the target feature map;

and determining a face living body detection result corresponding to the image area of the target face frame in the current video frame according to the face living body classification score.

6. The method of claim 5, wherein the determining, according to the target feature map, a living face classification score corresponding to the target face image region in the current video frame comprises:

performing pooling treatment on the target characteristic diagram to obtain a second target characteristic diagram;

and inputting the second target feature map into a classification loss function, and outputting to obtain a face living body classification score corresponding to the image area of the target face map in the current video frame.

7. The method of claim 5, wherein the obtaining of the target face frame image region in the current video frame comprises:

acquiring a target face frame image area in a current video frame and target face identification information corresponding to the target face frame image area;

the determining, according to the target feature map, a living human face classification score corresponding to the target human face image region in the current video frame includes:

and R, according to the target feature map, determining a face living body classification score corresponding to a target face image region corresponding to the target face identification information in the current video frame.

8. The method of claim 7, wherein the determining the live human face detection result corresponding to the target human face image region in the current video frame according to the live human face classification score comprises:

if the face living body classification score is larger than or equal to a first preset score threshold value, determining that the face corresponding to the target face frame image area corresponding to the target face identification information in the current video frame is not a living body;

wherein the method further comprises:

and G, repeatedly executing the step A aiming at one or more second video frames in a subsequent preset interval corresponding to the current video frame, and directly determining that the face corresponding to the target face frame image area corresponding to the target face identification information in the one or more second video frames is not a living body.

9. The method of claim 7, wherein the determining the live human face detection result corresponding to the target human face image region in the current video frame according to the live human face classification score comprises:

if the face living body classification score is smaller than or equal to a second preset score threshold value, whether the face corresponding to the target face frame image area corresponding to the target face identification information in the current video frame is a living body cannot be determined;

wherein the method further comprises:

repeatedly executing the step A, B, C, R for a third video frame after a subsequent predetermined interval corresponding to the current video frame, to obtain a third face living body classification score corresponding to a target face frame image region corresponding to the target face identification information in the third video frame;

and determining a face living body detection result corresponding to the target face frame image area corresponding to the target face identification information in the third video frame according to the third face living body classification score.

10. The method according to claim 9, wherein the determining, according to the third face living body classification score, a face living body detection result corresponding to a target face frame image area corresponding to the target face identification information in the third video frame includes at least one of:

if the third face living body classification score is smaller than or equal to a second preset score threshold value, determining that a face corresponding to a target face frame image area corresponding to the target face identification information in the third video frame is a living body;

if the third face living body classification score is larger than or equal to a first preset score threshold value, determining that the face corresponding to the target face frame image area corresponding to the target face identification information in the third video frame is not a living body; taking the third video frame as the current video frame, and repeatedly executing the step G;

if the third face living body classification score is larger than a second preset score threshold and smaller than a first preset score threshold, whether a face corresponding to a target face frame image area corresponding to the target face identification information in the third video frame is a living body cannot be determined; repeatedly executing the step A, B, C, R for a fourth video frame after a subsequent predetermined interval corresponding to the third video frame, to obtain a fourth face living body classification score corresponding to a target face frame image region corresponding to the target face identification information in the fourth video frame; and determining a fourth face living body detection result corresponding to the target face frame image area corresponding to the target face identification information in the fourth video frame according to the fourth face living body classification score.

11. The method of claim 7, wherein the determining the live human face detection result corresponding to the target human face image region in the current video frame according to the live human face classification score comprises:

if the face living body classification score is larger than a second preset score threshold and smaller than a first preset score threshold, whether a face corresponding to a target face frame image area corresponding to the target face identification information in the current video frame is a living body cannot be determined;

wherein the method further comprises:

repeatedly executing the step A, B, C, R for a fifth video frame after a subsequent predetermined interval corresponding to the current video frame, to obtain a fifth face living body classification score corresponding to a target face frame image region corresponding to the target face identification information in the fifth video frame;

and determining a fifth face living body detection result corresponding to the target face frame image area corresponding to the target face identification information in the fifth video frame according to the fifth face living body classification score.

12. The method according to claim 11, wherein the determining, according to the fifth face living body classification score, a fifth face living body detection result corresponding to a target face frame image region corresponding to the target face identification information in the fifth video frame includes at least one of:

if the fifth face living body classification score is larger than or equal to a first preset score threshold value, determining that the face corresponding to the target face frame image area corresponding to the target face identification information in the fifth video frame is not a living body; taking the fifth video frame as the current video frame, and repeatedly executing the step G;

if the fifth face living body classification score is less than or equal to a second predetermined score threshold, the step A, B, C, R is repeatedly executed for a sixth video frame after a subsequent predetermined interval corresponding to the fifth video frame, so as to obtain a sixth face living body classification score corresponding to a target face frame image region corresponding to the target face identification information in the sixth video frame, and a face living body detection result corresponding to the target face frame image region corresponding to the target face identification information in the sixth video frame is determined according to the sixth face living body classification score;

if the fifth face living body classification score is larger than a second preset score threshold and smaller than a first preset score threshold, whether a face corresponding to a target face frame image area corresponding to the target face identification information in the fifth video frame is a living body cannot be determined; repeatedly executing the step A, B, C, R for a seventh video frame after a subsequent predetermined interval corresponding to the fifth video frame, to obtain a seventh face living body classification score corresponding to a target face frame image region corresponding to the target face identification information in the seventh video frame; and determining a seventh face living body detection result corresponding to the target face frame image area corresponding to the target face identification information in the seventh video frame according to the seventh face living body classification score.

13. The method of claim 1, wherein the central differential convolution is a combination of a pure central differential convolution and a normal convolution based on a predetermined hyper-parameter, wherein the predetermined hyper-parameter is between 0 and 1.

14. The method of claim 1, wherein each convolution module is a convolutional neural network module comprising at least one depth separable convolutional layer;

wherein the convolutional neural network module comprises any one of:

a Block module in MobileNet V1;

a Block module in MobileNet V2;

block Module in ShuffleNet V1;

block Module in ShuffleNet V2.

15. A method for live human face detection, wherein the method comprises:

acquiring an image area of a target face frame in a current video frame;

16. A method for live human face detection, wherein the method comprises:

17. A computer device for live human face detection, comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method according to any one of claims 1 to 16.

18. A computer-readable storage medium, on which a computer program/instructions are stored, which, when being executed by a processor, carry out the steps of the method according to any one of claims 1 to 16.

19. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method according to any one of claims 1 to 16 when executed by a processor.