CN113283376A

CN113283376A - Face living body detection method, face living body detection device, medium and equipment

Info

Publication number: CN113283376A
Application number: CN202110648875.2A
Authority: CN
Inventors: 喻庐军; 韩森尧; 李驰; 刘岩
Original assignee: Taikang Insurance Group Co Ltd
Current assignee: Taikang Insurance Group Co Ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-08-20
Anticipated expiration: 2041-06-10
Also published as: CN113283376B

Abstract

The embodiment of the disclosure provides a face in-vivo detection method, a face in-vivo detection device, a computer readable medium and an electronic device, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: carrying out up-sampling processing on the face features through a plurality of up-sampling layers in the face living body detection network; performing characteristic transformation on the corresponding up-sampling characteristics of each up-sampling layer through the full-connection layer; calculating a ternary loss function according to the non-living sample vector and the reference living sample vector for the feature vector corresponding to each up-sampling feature; determining a classification loss function of each feature vector through a multilayer perceptron; and training a face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out face living body detection on the received image to be recognized through the trained face living body detection network. Therefore, by implementing the technical scheme of the application, the identification precision of the face living body detection network can be improved so as to identify the living body face.

Description

Face living body detection method, face living body detection device, medium and equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a face in-vivo detection method, a face in-vivo detection apparatus, a computer-readable medium, and an electronic device.

Background

With the continuous development of computer technology, the application scenarios of the face liveness detection function are more and more abundant, for example, the face liveness detection function can be applied to attendance software, payment software, social contact software, and the like. However, some illegal users may use the human face live detection function through an abnormal means to pass the legal verification, which may easily cause a bad influence on the data security of the software. For example, the illegal user may perform face liveness detection through a photograph of the legal user, thereby passing face liveness detection verification. Based on the above problems, how to recognize a living human face is a problem which needs to be solved urgently at present.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a face in-vivo detection method, a face in-vivo detection apparatus, a computer readable medium, and an electronic device, which can train a face in-vivo detection network based on a fusion result of a ternary loss function and a classification loss function, thereby improving the recognition accuracy of the face in-vivo detection network to recognize a living face.

A first aspect of the embodiments of the present disclosure provides a face live detection method, including:

carrying out up-sampling treatment on the face features through a plurality of up-sampling layers in the face in-vivo detection network to obtain up-sampling features corresponding to the up-sampling layers;

performing feature transformation on the up-sampling features corresponding to each up-sampling layer through a full connection layer in the face living body detection network to obtain feature vectors corresponding to each up-sampling feature;

calculating a ternary loss function for the feature vector corresponding to each up-sampling feature according to the non-living sample vector and the reference living sample vector to obtain a ternary loss function corresponding to each feature vector;

determining a classification loss function of each feature vector through a multilayer perceptron in the human face living body detection network;

and training a face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out face living body detection on the received image to be recognized through the trained face living body detection network.

In an exemplary embodiment of the present disclosure, before performing upsampling processing on a face feature through a plurality of upsampling layers in a face live detection network to obtain an upsampling feature corresponding to each upsampling layer, the method further includes:

identifying a face region in a sample image;

and carrying out convolution processing on the face area through a convolution neural network in the face living body detection network to obtain the face characteristics.

In an exemplary embodiment of the present disclosure, before identifying a face region in a sample image, the method further includes:

receiving a plurality of marking operations aiming at an image to be marked;

determining labeling results corresponding to a plurality of labeling operations respectively, and determining the same labeling result with the largest quantity as a final labeling result corresponding to the image to be labeled;

establishing an incidence relation for correlating the image to be annotated and the final annotation result;

and determining the image to be annotated corresponding to the association relationship as a sample image.

In an exemplary embodiment of the present disclosure, performing convolution processing on a face region through a convolutional neural network in a face living body detection network to obtain a face feature includes:

performing feature transformation on the face region through each layer in the convolutional neural network to obtain reference features corresponding to each layer; wherein, based on the arrangement sequence of the layers, the input of each layer is the output of the previous layer;

and determining the reference feature corresponding to the last layer in each layer as the face feature.

In an exemplary embodiment of the present disclosure, training a face living body detection network according to a ternary loss function corresponding to each feature vector and a classification loss function of each feature vector includes:

calculating a target loss function according to the ternary loss function corresponding to each feature vector and the classification loss function corresponding to each feature vector;

and adjusting network parameters in the face living body detection network through the target loss function until the target loss function converges to a preset loss range, so as to realize training of the face living body detection network.

In an exemplary embodiment of the present disclosure, calculating a target loss function according to a ternary loss function corresponding to each feature vector and a classification loss function corresponding to each feature vector includes:

carrying out mean value calculation on the ternary loss functions corresponding to the feature vectors to obtain a first mean value;

carrying out mean value calculation on the classification loss functions corresponding to the feature vectors to obtain a second mean value;

a weighted sum of the first mean and the second mean is calculated and determined as the target loss function.

In an exemplary embodiment of the present disclosure, performing living human face detection on a received image to be recognized through a trained living human face detection network includes:

inputting an image to be recognized into a trained human face living body detection network;

generating a plurality of classification results corresponding to the images to be recognized through the trained human face living body detection network;

fusing various classification results to obtain a reference result;

and determining a threshold range to which the reference result belongs, and generating an identification result of the image to be identified according to a label corresponding to the threshold range.

According to a second aspect of the embodiments of the present disclosure, there is provided a face liveness detection apparatus, the apparatus comprising:

the characteristic sampling unit is used for performing up-sampling processing on the face characteristics through a plurality of up-sampling layers in the face living body detection network to obtain up-sampling characteristics corresponding to the up-sampling layers;

the feature transformation unit is used for carrying out feature transformation on the up-sampling features corresponding to the up-sampling layers through the full connection layer in the human face living body detection network to obtain feature vectors corresponding to the up-sampling features;

the ternary loss function determining unit is used for calculating a ternary loss function according to the non-living body sample vector and the reference living body sample vector and the feature vector corresponding to each up-sampling feature to obtain a ternary loss function corresponding to each feature vector;

the classification loss function determining unit is used for determining the classification loss function of each feature vector through a multilayer perceptron in the human face living body detection network;

and the face in-vivo detection unit is used for training a face in-vivo detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and performing face in-vivo detection on the received image to be recognized through the trained face in-vivo detection network.

In an exemplary embodiment of the present disclosure, the apparatus further includes:

the face region identification unit is used for identifying the face region in the sample image before the feature sampling unit performs up-sampling processing on the face features through a plurality of up-sampling layers in the face living body detection network to obtain up-sampling features corresponding to the up-sampling layers;

and the feature extraction unit is used for carrying out convolution processing on the face region through a convolution neural network in the face living body detection network to obtain the face features.

the operation receiving unit is used for receiving a plurality of marking operations aiming at the image to be marked before the face area identification unit identifies the face area in the sample image;

the annotation result determining unit is used for determining annotation results corresponding to the plurality of annotation operations respectively, and determining the same annotation result with the largest quantity as the final annotation result corresponding to the image to be annotated;

the incidence relation establishing unit is used for establishing incidence relation for associating the image to be annotated and the final annotation result;

and the sample image determining unit is used for determining the image to be annotated corresponding to the association relationship as a sample image.

In an exemplary embodiment of the present disclosure, the convolving processing a face region by a convolutional neural network in a face living body detection network by a feature extraction unit to obtain a face feature includes:

In an exemplary embodiment of the present disclosure, the training of the face in-vivo detection network by the face in-vivo detection unit according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector includes:

In an exemplary embodiment of the present disclosure, the calculating, by the face living body detecting unit, a target loss function according to the ternary loss function corresponding to each feature vector and the classification loss function corresponding to each feature vector includes:

In an exemplary embodiment of the present disclosure, a face living body detection unit performs face living body detection on a received image to be recognized through a trained face living body detection network, including:

fusing various classification results to obtain a reference result;

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable medium, on which a computer program is stored, which when executed by a processor, implements the face liveness detection method of the first aspect as in the above embodiments.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method for detecting a living human face as in the first aspect of the embodiments described above.

According to a fifth aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

technical solutions provided in some embodiments of the present disclosure specifically include: carrying out up-sampling treatment on the face features through a plurality of up-sampling layers in the face in-vivo detection network to obtain up-sampling features corresponding to the up-sampling layers; performing feature transformation on the up-sampling features corresponding to each up-sampling layer through a full connection layer in the face living body detection network to obtain feature vectors corresponding to each up-sampling feature; calculating a ternary loss function for the feature vector corresponding to each up-sampling feature according to the non-living sample vector and the reference living sample vector to obtain a ternary loss function corresponding to each feature vector; determining a classification loss function of each feature vector through a multilayer perceptron in the human face living body detection network; and training a face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out face living body detection on the received image to be recognized through the trained face living body detection network. By implementing the embodiment of the disclosure, on one hand, the face living body detection network can be trained based on the fusion result of the ternary loss function and the classification loss function, so that the recognition precision of the face living body detection network is improved, and a living body face is recognized. On the other hand, the data safety in the face living body detection scene can be improved based on the correct recognition of the living body face, and the user rights and interests are guaranteed.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

Fig. 1 schematically shows a schematic diagram of an exemplary system architecture of a face in-vivo detection method and a face in-vivo detection apparatus to which an embodiment of the present disclosure may be applied;

FIG. 2 schematically illustrates a structural schematic diagram of a computer system suitable for use with an electronic device that implements an embodiment of the disclosure;

FIG. 3 schematically illustrates a flow diagram of a face liveness detection method according to one embodiment of the present disclosure;

fig. 4 schematically shows a structural schematic diagram of a face liveness detection network according to one embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow diagram of a face liveness detection method according to one embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of a configuration of a living human face detection apparatus in an embodiment according to the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a face live detection method and a face live detection apparatus according to an embodiment of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like. Wherein the server 105 is configured to perform: carrying out up-sampling treatment on the face features through a plurality of up-sampling layers in the face in-vivo detection network to obtain up-sampling features corresponding to the up-sampling layers; performing feature transformation on the up-sampling features corresponding to each up-sampling layer through a full connection layer in the face living body detection network to obtain feature vectors corresponding to each up-sampling feature; calculating a ternary loss function for the feature vector corresponding to each up-sampling feature according to the non-living sample vector and the reference living sample vector to obtain a ternary loss function corresponding to each feature vector; determining a classification loss function of each feature vector through a multilayer perceptron in the human face living body detection network; and training a face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out face living body detection on the received image to be recognized through the trained face living body detection network.

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.

It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.

As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU)201 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the (RAM)203, various programs and data necessary for system operation are also stored. The (CPU)201, (ROM)202, and (RAM)203 are connected to each other by a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the (I/O) interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 208 including a hard disk and the like; and a communication section 209 including a network interface card such as a LAN card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. The driver 210 is also connected to the (I/O) interface 205 as necessary. A removable medium 211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 210 as necessary, so that a computer program read out therefrom is installed into the storage section 208 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 209 and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU)201, performs various functions defined in the methods and apparatus of the present application.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the various steps shown in fig. 3, and so on.

The present exemplary embodiment provides a face live detection method, as shown in fig. 3, the face live detection method may include the following steps S310 to S350, specifically:

step S310: the face features are subjected to up-sampling processing through a plurality of up-sampling layers in the face in-vivo detection network, and up-sampling features corresponding to the up-sampling layers are obtained.

Step S320: and performing feature transformation on the upsampling features corresponding to each upsampling layer through a full connection layer in the face living body detection network to obtain feature vectors corresponding to each upsampling feature.

Step S330: and calculating a ternary loss function for the feature vector corresponding to each up-sampling feature according to the non-living sample vector and the reference living sample vector to obtain the ternary loss function corresponding to each feature vector.

Step S340: and determining the classification loss function of each feature vector through a multilayer perceptron in the human face living body detection network.

Step S350: and training a face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out face living body detection on the received image to be recognized through the trained face living body detection network.

It should be noted that the technical solution formed in steps S310 to S350 may be applied to a human face living body detection platform, and the platform may provide a functional interface for application scenarios such as online attendance checking, online payment, and the like.

By implementing the face in-vivo detection method shown in fig. 3, the face in-vivo detection network can be trained based on the fusion result of the ternary loss function and the classification loss function, so that the recognition accuracy of the face in-vivo detection network is improved, and the living face is recognized. In addition, the data safety in the face living body detection scene can be improved based on correct recognition of the living body face, and the user rights and interests are guaranteed.

The above steps of the present exemplary embodiment will be described in more detail below.

As an optional embodiment, before performing upsampling processing on the face features through a plurality of upsampling layers in the face live detection network to obtain upsampling features corresponding to the upsampling layers, the method further includes: identifying a face region in a sample image; and carrying out convolution processing on the face area through a convolution neural network in the face living body detection network to obtain the face characteristics.

Specifically, the area of the face region is smaller than or equal to the area of the sample image, and the face region includes a plurality of face features, such as eye features, nose features, mouth features, and the like.

Therefore, by implementing the optional embodiment, the training efficiency can be improved by identifying the face region in the sample image and taking the face region as the training sample, the influence of the non-face region in the sample image on the training result is avoided, and the identification precision of the face living body detection network is favorably improved.

As an alternative embodiment, before identifying the face region in the sample image, the method further includes: receiving a plurality of marking operations aiming at an image to be marked; determining labeling results corresponding to a plurality of labeling operations respectively, and determining the same labeling result with the largest quantity as a final labeling result corresponding to the image to be labeled; establishing an incidence relation for correlating the image to be annotated and the final annotation result; and determining the image to be annotated corresponding to the association relationship as a sample image.

Specifically, the image to be annotated may be an image including a living human face or a non-living human face, where a living human face may be understood as a real human face, and the number of the images to be annotated may be one or more, which is not limited in the embodiment of the present application. The marking operation may be a manual operation, and the marking operation may specifically be a click operation, a slide operation, a long-press operation, a drag operation, a voice control operation, a gesture operation, or the like. Preferably, the number of the labeling operations for the same image to be labeled can be 3, so that the same number of the labeling results can be avoided (for example, 2 types).

Specifically, determining the same annotation result with the largest number as the final annotation result corresponding to the image to be annotated includes: the same result statistics is carried out according to the labeling results respectively corresponding to a plurality of labeling operations to obtain at least one labeling set, the same labeling result (such as a living body face) is contained in the labeling set, different labeling sets correspond to different labeling results (such as the labeling result contained in the labeling set 1 is the living body face, the labeling result contained in the labeling set 2 is a non-living body face, and the face area corresponding to the non-living body face can be a video picture, a printed picture and a 3D picture), then the number of the labeling results of each labeling set is calculated, and the labeling result with the largest number of the labeling results is determined as the final labeling result corresponding to the image to be labeled.

Further, if the number of the labeling results of each labeling set is equal, the method may further include: feeding back prompt information of abnormal annotation to prompt the annotation personnel to re-annotate the image to be annotated; or determining the labeling result corresponding to the nth labeling operation on the image to be labeled as the final labeling result corresponding to the image to be labeled, wherein N is a positive integer.

Specifically, the establishing of the association relationship for associating the image to be annotated and the final annotation result includes: establishing an association relation for associating the image to be annotated and the final annotation result through a key-value form; the image to be annotated can be represented by a key, and the final annotation result can be represented by a value.

Therefore, by implementing the optional embodiment, the same annotation result with the largest quantity can be selected in the multi-person evaluation of the image to be annotated as the final annotation result of the image to be annotated, so that the reasonability of the annotation result of the image to be annotated can be improved, the annotation accuracy of the sample image can be ensured, and the identification precision of the human face living body detection network trained based on the sample image can be improved.

As an alternative embodiment, performing convolution processing on a face region by using a convolutional neural network in a face living body detection network to obtain a face feature includes: performing feature transformation on the face region through each layer in the convolutional neural network to obtain reference features corresponding to each layer; wherein, based on the arrangement sequence of the layers, the input of each layer is the output of the previous layer; and determining the reference feature corresponding to the last layer in each layer as the face feature.

Specifically, the face region may be an RGB (red, green, and blue) three-channel image, and the face region is bound to the labeling result of the sample image to which the face region belongs. The convolutional neural network can be a residual error network (ResNet), and the residual error network (ResNet) can be a backbone network of the human face living body detection network; preferably, the number of residual units in the residual network (ResNet) may be 18. Alternatively, the convolutional neural network may be another type of network, and the embodiments of the present application are not limited.

Optionally, before performing feature transformation on the face region through each layer in the convolutional neural network to obtain reference features corresponding to each layer, the method may further include: if the size of the detected face area is larger than the target size, the face area is compressed to the target size (e.g., 224 × 224), and the target size can be represented by the number of image pixels, so that the training efficiency can be improved.

Specifically, each layer in the convolutional neural network can be sequentially represented as: 1 convolutional layer, 2 max pooling layers (maxpool), 2 bottleneck layers a1 (bottleetck), 2 bottleneck layers b1 (bottleetck), 2 bottleneck layers c1 (bottleetck), and 2 bottleneck layers d1 (bottleetck).

The face region may be input into a convolutional layer based on a size of 224x224x3, the convolutional kernel size of the convolutional layer is 7x7, the sampling interval is 2, the number of channels is 64, and the output of the convolutional layer is convolution characteristic conv 1; further, the convolution feature conv1 may be input to a maximum pooling layer with a sampling interval of 2, the output of which is a pooled feature maxpool; furthermore, the pooling feature maxpool can be input into 2 bottleneck layers a1, the convolution kernel size of the 2 bottleneck layers a1 is 3 × 3, the sampling interval is 1, the number of channels is 64, and the output of the successive bottleneck layers a1 in the 2 bottleneck layers a1 is convolution feature conv 2; furthermore, the convolution characteristic conv2 may be input into 2 bottleneck layers b1, the convolution kernel size of 2 bottleneck layers b1 is 3 × 3, the sampling interval is 2, the number of channels is 128, and the output of the successive bottleneck layer b1 in the 2 bottleneck layers b1 is the convolution characteristic conv 3; furthermore, the convolution characteristic conv3 may be input into 2 bottleneck layers c1, the convolution kernel size of 2 bottleneck layers c1 is 3x3, the sampling interval is 1, the number of channels is 256, and the output of the subsequent bottleneck layer c1 in the 2 bottleneck layers c1 is convolution characteristic conv 4; furthermore, the convolution characteristic conv4 may be input into 2 bottleneck layers d1, the convolution kernel size of 2 bottleneck layers d1 is 3x3, the sampling interval is 2, the number of channels is 512, and the output of the subsequent bottleneck layer d1 in the 2 bottleneck layers d1 is the convolution characteristic conv 5.

Therefore, by implementing the optional embodiment, the face features of different scales can be extracted through the multilayer convolution layers, and the face features most suitable for network training can be conveniently obtained, so that the identification precision of the face living body detection network is improved.

In step S310, the face features are upsampled by a plurality of upsampling layers in the living human face detection network, so as to obtain upsampling features corresponding to the upsampling layers.

Specifically, the face liveness detection network may be constituted by a full convolution network (UNet) and a multi-layered perceptron (MLP). The UNet is a full convolution network comprising a 4-layer down-sampling, 4-layer up-sampling and hopping connection structure, and is characterized in that convolution layers are completely symmetrical in the down-sampling and up-sampling parts. MLP can be understood as a multi-layer fully-connected feed-forward network, and typically after inputting samples into the MLP, the samples can be fed forward in the MLP layer by layer (i.e., the result is calculated layer by layer from the input layer to the hidden layer to the output layer) to obtain the final output value.

Specifically, the plurality of upsampling layers specifically include: an upsampling layer 1 comprising 2 bottleneck layers a2 (bottleetck), an upsampling layer 2 comprising 2 bottleneck layers b2 (bottleetck), an upsampling layer 3 comprising 2 bottleneck layers c2 (bottleetck), an upsampling layer 4 comprising 2 bottleneck layers d2 (bottleetck), and an upsampling layer 5 comprising 2 bottleneck layers e (bottleetck). The convolution characteristic conv5 output by the bottleneck layer d1 can be used as the input of the upsampling layer 1, the convolution kernel size of the upsampling layer 1 is 3x3, the sampling interval is 2, the number of channels is 256, and the output is the upsampling characteristic deconv 1; the upsampling characteristic deconv1 output by the upsampling layer 1 can be used as the input of the upsampling layer 2, the convolution kernel size of the upsampling layer 2 is 3x3, the sampling interval is 2, the number of channels is 128, and the output is the upsampling characteristic deconv 2; the upsampling characteristic deconv2 output by the upsampling layer 2 can be used as the input of the upsampling layer 3, the convolution kernel size of the upsampling layer 3 is 3x3, the sampling interval is 2, the number of channels is 64, and the output is the upsampling characteristic deconv 3; the upsampling characteristic deconv3 output by the upsampling layer 3 can be used as the input of the upsampling layer 4, the convolution kernel size of the upsampling layer 4 is 3x3, the sampling interval is 2, the number of channels is 64, and the output is the upsampling characteristic deconv 4; the upsampling characteristic deconv4 output by the upsampling layer 4 can be used as the input of the upsampling layer 5, the convolution kernel size of the upsampling layer 5 is 3x3, the sampling interval is 2, the number of channels is 1, and the output is the upsampling characteristic deconv 5.

Referring to fig. 4, fig. 4 schematically illustrates a structural schematic diagram of a face liveness detection network according to an embodiment of the present disclosure, where the face liveness detection network is used to learn a difference between a live face and a non-live face, where the difference may be represented as a fraud matrix (shook _ cue _ map), and generally, the fraud matrix corresponding to a face area including the live face is a zero matrix, and the fraud matrix corresponding to a face area including the non-live face is a non-zero matrix. As shown in fig. 4, the human face living body detection network includes a feature generation module E1410, a feature generation module E2420, a feature generation module E3430, a feature generation module E4440, a feature generation module E5450, a network decoding module D1460, a network decoding module D2470, a network decoding module D3480, and a network decoding module D4490.

The feature generation module E1410, the feature generation module E2420, the feature generation module E3430, the feature generation module E4440, and the feature generation module E5450 are configured to perform feature convolution with different scales, each feature generation module may be understood as one convolutional neural network described above, and optionally, the feature generation module may also be a certain layer in the convolutional neural network. The network decoding module D1460, the network decoding module D2470, the network decoding module D3480 and the network decoding module D4490 are used for performing feature upsampling of different scales. The number of the feature generation modules and the number of the network decoding modules are only schematically shown in fig. 4, and in the actual application process, the number of the feature generation modules and the number of the network decoding modules are not limited in the present application.

Specifically, when the face region is acquired, the face region may be input to the feature generation module, so that the feature generation module E1410, the feature generation module E2420, the feature generation module E3430, the feature generation module E4440, and the feature generation module E5450 perform feature transformation in sequence, and the input of each feature generation module is the output of the previous feature generation module. Furthermore, the last feature generation module may input the feature extraction result to the network decoding module, so that the network decoding module D1460, the network decoding module D2470, the network decoding module D3480, and the network decoding module D4490 sequentially perform feature upsampling, and the input of each network decoding module is the output of the previous network decoding module and the output of the feature generation module with symmetric network hierarchy. Furthermore, the upsampling results of the network decoding module D1460, the network decoding module D2470, the network decoding module D3480 and the network decoding module D4490 may be classified by MLP to obtain classification results F1, F2, F3 and F4 corresponding to the network decoding module D1460, the network decoding module D2470, the network decoding module D3480 and the network decoding module D4490, respectively, so that a ternary loss function and a classification loss function may be calculated according to F1, F2, F3 and F4, and the face live-detecting network may be trained by the ternary loss function and the classification loss function, so that the face live-detecting network learns the difference expressed as a fraud matrix (spooff cue) between the live face and the non-live face.

In step S320, feature transformation is performed on the upsampling features corresponding to each upsampling layer through a full connection layer in the human face living body detection network, so as to obtain feature vectors corresponding to each upsampling feature.

Specifically, the feature transformation of the upsampling features corresponding to each upsampling layer is performed through a full connection layer in the human face living body detection network, and the feature transformation comprises the following steps: and performing feature transformation on the upsampling features deconv1, deconv2, deconv3, deconv4 and deconv5 corresponding to each upsampling layer to obtain a feature vector V1, a feature vector V2, a feature vector V3, a feature vector V4 and a feature vector V5 with the scale of 512 respectively. In the embodiment of the present application, the preferred scale of each feature vector is 512, and the scale of each feature vector in the actual application process is not limited.

In step S330, a ternary loss function is calculated for the feature vector corresponding to each upsampling feature according to the non-live sample vector and the reference live sample vector, so as to obtain a ternary loss function corresponding to each feature vector.

Specifically, the non-live sample vector (Vspoof) and the reference live sample vector (Vanchor) may be preset vectors, and in contrast, the feature vector V1, the feature vector V2, the feature vector V3, the feature vector V4, and the feature vector V5 may all participate in the computation of the ternary loss function as the face sample vector Vlive. Optionally, the expression corresponding to the ternary loss function (TripletLoss) may be TripletLoss min (| Vanchor-Vlive | |2- | Vanchor-Vspoof | | 2).

In step S340, a classification loss function of each feature vector is determined by the multilayer perceptron in the human face living body detection network.

Specifically, the method for determining the classification loss function of each feature vector through a multilayer perceptron in the human face living body detection network comprises the following steps: respectively determining classification results of the feature vectors corresponding to the multiple multi-layer perceptrons through the multiple multi-layer perceptrons in the human face living body detection network to obtain the classification results corresponding to the multiple multi-layer perceptrons; a classification loss function (CrossEntropyLoss) was calculated for each classification result.

The classification result and the feature vector are in a one-to-one correspondence relationship, the classification result can be represented as a probability set, and each probability in the probability set is used for representing the probability that the face region belongs to each preset category (such as a living body face category and a non-living body face category); the multi-layered perceptron (MLP) is a feedforward artificial neural network model, which is used for feature classification.

The classification loss function (CrossEntropy loss) is preferably a cross entropy loss function (CrossEntropy), and the classification loss function may be a 0-1 loss function (zero-one loss), an absolute loss function, a log logarithmic loss function, a square loss function, an exponential loss function (exponential loss), a hindge loss function, or a perceptual loss function (perceptual loss), and the present embodiment is not limited thereto.

In step S350, a face living body detection network is trained according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and face living body detection is performed on the received image to be recognized through the trained face living body detection network.

As an optional embodiment, the training of the human face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector includes: calculating a target loss function according to the ternary loss function corresponding to each feature vector and the classification loss function corresponding to each feature vector; and adjusting network parameters in the face living body detection network through the target loss function until the target loss function converges to a preset loss range, so as to realize training of the face living body detection network.

Specifically, the network parameters in the human face living body detection network at least include a weight value and a bias item, and the preset loss range is a numerical range set artificially.

Therefore, by implementing the optional embodiment, the convergence speed of the face living body detection network can be improved and the identification precision of the face living body detection network can be improved by combining the loss functions.

As an optional embodiment, calculating a target loss function according to the ternary loss function corresponding to each feature vector and the classification loss function corresponding to each feature vector includes: carrying out mean value calculation on the ternary loss functions corresponding to the feature vectors to obtain a first mean value; carrying out mean value calculation on the classification loss functions corresponding to the feature vectors to obtain a second mean value; a weighted sum of the first mean and the second mean is calculated and determined as the target loss function.

Specifically, the calculating an average value of the ternary loss functions corresponding to the feature vectors to obtain a first average value includes: and calculating (TripletLoss1+ TripletLoss2+ … … + TripletLoss)/n according to the ternary loss function corresponding to each eigenvector to obtain a first mean value, wherein n is a positive integer. Calculating the mean value of the classification loss functions corresponding to the feature vectors to obtain a second mean value, including: and (Cross EntropyLoss1+ Cross EntropyLoss2+ … … + Cross EntropyLossn)/n is calculated according to the ternary loss function corresponding to each feature vector to obtain a second average value, wherein n is a positive integer. Calculating a weighted sum of the first mean and the second mean and determining the weighted sum as the objective loss function, comprising: and calculating a target Loss function (Loss) according to the first mean value and the second mean value, wherein the Loss is W1 first mean value + W2 second mean value, W1 and W2 are constants, W1 is a ternary Loss function weight, and W2 is a classification Loss function.

Therefore, by implementing the optional embodiment, the target loss function suitable for training the face living body detection network can be calculated, so that the identification precision of the face living body detection network is improved.

As an alternative embodiment, performing living human face detection on a received image to be recognized through a trained living human face detection network includes: inputting an image to be recognized into a trained human face living body detection network; generating a plurality of classification results corresponding to the images to be recognized through the trained human face living body detection network; fusing various classification results to obtain a reference result; and determining a threshold range to which the reference result belongs, and generating an identification result of the image to be identified according to a label corresponding to the threshold range.

Specifically, an image to be recognized is input into a trained human face living body detection network; and generating a plurality of classification results C1, C2, C3, C4 and C5 corresponding to the images to be recognized through the trained human face living body detection network. Furthermore, the reference result is obtained by fusing a plurality of classification results, including: according to the expression

Calculating a reference result

Further, determining a threshold range to which the reference result belongs, and generating an identification result of the image to be identified according to a label corresponding to the threshold range, including: if it is

Then according to

Generating a recognition result of the image to be recognized (for example, the image to be recognized contains a living body) by the corresponding label; if it is

Then according to

The corresponding tag generates a recognition result of the image to be recognized (e.g., the image to be recognized does not contain a living body). Wherein, if the image to be identified does not contain living bodies, the method comprisesThe method may further comprise: and uploading the image to be identified to a cloud server for storage and feeding back alarm information so as to remind relevant personnel to pay attention to the current abnormal condition.

Therefore, by implementing the optional embodiment, the security loophole in the field of face living body detection can be perfected by improving the identification precision of the face living body detection network, the face living body detection security can be improved, and the system robustness can be improved when the face living body detection network is applied to a face living body detection system.

Referring to fig. 5, fig. 5 schematically shows a flowchart of a face liveness detection method according to an embodiment of the present disclosure. As shown in fig. 5, the face live detection method includes: step S510 to step S590.

Step S510: receiving a plurality of annotation operations aiming at the image to be annotated, determining annotation results corresponding to the plurality of annotation operations respectively, determining the same annotation result with the largest quantity as a final annotation result corresponding to the image to be annotated, further establishing an association relation for associating the image to be annotated and the final annotation result, and determining the image to be annotated corresponding to the association relation as a sample image.

Step S520: the method comprises the steps of identifying a face area in a sample image, carrying out feature transformation on the face area through each layer in a convolutional neural network to obtain reference features corresponding to each layer, wherein the input of each layer is the output of the previous layer based on the arrangement sequence of each layer, and further determining the reference features corresponding to the last layer in each layer as the face features.

Step S530: the face features are subjected to up-sampling processing through a plurality of up-sampling layers in the face in-vivo detection network, and up-sampling features corresponding to the up-sampling layers are obtained.

Step S540: and performing feature transformation on the upsampling features corresponding to each upsampling layer through a full connection layer in the face living body detection network to obtain feature vectors corresponding to each upsampling feature.

Step S550: and calculating a ternary loss function for the feature vector corresponding to each up-sampling feature according to the non-living sample vector and the reference living sample vector to obtain the ternary loss function corresponding to each feature vector.

Step S560: and determining the classification loss function of each feature vector through a multilayer perceptron in the human face living body detection network.

Step S570: and carrying out mean calculation on the ternary loss functions corresponding to the feature vectors to obtain a first mean value, carrying out mean calculation on the classification loss functions corresponding to the feature vectors to obtain a second mean value, further calculating the weighted sum of the first mean value and the second mean value, and determining the weighted sum as a target loss function.

Step S580: and adjusting network parameters in the face living body detection network through the target loss function until the target loss function converges to a preset loss range, so as to realize training of the face living body detection network.

Step S590: inputting the image to be recognized into the trained face living body detection network, generating a plurality of classification results corresponding to the image to be recognized through the trained face living body detection network, fusing the classification results to obtain a reference result, determining a threshold range to which the reference result belongs, and generating a recognition result of the image to be recognized according to a label corresponding to the threshold range.

It should be noted that steps S510 to S590 correspond to the steps and the embodiment shown in fig. 3, and for the specific implementation of steps S510 to S590, please refer to the steps and the embodiment shown in fig. 3, which is not described herein again.

Therefore, by implementing the face in-vivo detection method shown in fig. 5, the face in-vivo detection network can be trained based on the fusion result of the ternary loss function and the classification loss function, so that the recognition accuracy of the face in-vivo detection network is improved, and the living face is recognized. In addition, the data safety in the face living body detection scene can be improved based on correct recognition of the living body face, and the user rights and interests are guaranteed.

Further, in the present exemplary embodiment, there is also provided a face liveness detection apparatus, and as shown in fig. 6, the face liveness detection apparatus 600 may include:

the feature sampling unit 601 is configured to perform upsampling processing on the face features through multiple upsampling layers in the face living body detection network to obtain upsampling features corresponding to the upsampling layers;

a feature transformation unit 602, configured to perform feature transformation on the upsampling features corresponding to each upsampling layer through a full connection layer in the human face living body detection network to obtain feature vectors corresponding to each upsampling feature;

a ternary loss function determining unit 603, configured to calculate a ternary loss function for the feature vector corresponding to each upsampling feature according to the non-living sample vector and the reference living sample vector, so as to obtain a ternary loss function corresponding to each feature vector;

a classification loss function determining unit 604, configured to determine a classification loss function of each feature vector through a multilayer perceptron in the human face living body detection network;

and the face in-vivo detection unit 605 is configured to train a face in-vivo detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and perform face in-vivo detection on the received image to be recognized through the trained face in-vivo detection network.

Therefore, by implementing the device shown in fig. 6, the face living body detection network can be trained based on the fusion result of the ternary loss function and the classification loss function, so that the recognition accuracy of the face living body detection network is improved, and the living body face is recognized. In addition, the data safety in the face living body detection scene can be improved based on correct recognition of the living body face, and the user rights and interests are guaranteed.

a face region identification unit (not shown) configured to identify a face region in the sample image before the feature sampling unit 601 performs upsampling processing on the face features through a plurality of upsampling layers in the face live detection network to obtain upsampling features corresponding to the upsampling layers;

and the feature extraction unit (not shown) is used for performing convolution processing on the face region through a convolution neural network in the face living body detection network to obtain the face features.

an operation receiving unit (not shown) for receiving a plurality of annotation operations for the image to be annotated before the face region identification unit identifies the face region in the sample image;

an annotation result determining unit (not shown) configured to determine annotation results corresponding to the multiple annotation operations, and determine the same annotation result with the largest number as a final annotation result corresponding to the image to be annotated;

an association relationship establishing unit (not shown) for establishing an association relationship for associating the image to be annotated and the final annotation result;

and a sample image determining unit (not shown) for determining the image to be annotated corresponding to the association relationship as a sample image.

In an exemplary embodiment of the present disclosure, the training of the face in-vivo detection network by the face in-vivo detection unit 605 according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector includes:

In an exemplary embodiment of the present disclosure, the face living body detection unit 605 calculates a target loss function according to the ternary loss function corresponding to each feature vector and the classification loss function corresponding to each feature vector, including:

In an exemplary embodiment of the present disclosure, the face live detection unit 605 performs face live detection on the received image to be recognized through the trained face live detection network, including:

fusing various classification results to obtain a reference result;

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

As each functional module of the face live-body detection device of the exemplary embodiment of the present disclosure corresponds to the steps of the exemplary embodiment of the face live-body detection method, please refer to the embodiment of the face live-body detection method of the present disclosure for details that are not disclosed in the embodiment of the apparatus of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A face living body detection method is characterized by comprising the following steps:

performing feature transformation on the up-sampling features corresponding to each up-sampling layer through a full connection layer in the human face living body detection network to obtain feature vectors corresponding to each up-sampling feature;

calculating a ternary loss function for the feature vector corresponding to each upsampling feature according to the non-living sample vector and the reference living sample vector to obtain a ternary loss function corresponding to each feature vector;

and training the face in-vivo detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and performing face in-vivo detection on the received image to be recognized through the trained face in-vivo detection network.

2. The method according to claim 1, wherein before the upsampling processing is performed on the face features through a plurality of upsampling layers in the face live detection network to obtain the upsampling features corresponding to each upsampling layer, the method further comprises:

identifying a face region in a sample image;

and carrying out convolution processing on the face area through a convolution neural network in the face living body detection network to obtain the face features.

3. The method of claim 2, wherein prior to identifying the face region in the sample image, the method further comprises:

receiving a plurality of marking operations aiming at an image to be marked;

establishing an association relation for associating the image to be annotated and the final annotation result;

and determining the image to be annotated corresponding to the incidence relation as the sample image.

4. The method of claim 2, wherein performing convolution processing on the face region through a convolutional neural network in the face in-vivo detection network to obtain the face features comprises:

and determining the reference feature corresponding to the last layer in the layers as the face feature.

5. The method of claim 1, wherein training the face in-vivo detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector comprises:

and adjusting network parameters in the face living body detection network through the target loss function until the target loss function converges to a preset loss range, so as to realize the training of the face living body detection network.

6. The method of claim 1, wherein computing a target loss function based on the ternary loss function for each feature vector and the classification loss function for each feature vector comprises:

performing mean value calculation on the classification loss functions corresponding to the feature vectors to obtain a second mean value;

calculating a weighted sum of the first mean and the second mean, and determining the weighted sum as the objective loss function.

7. The method of claim 6, wherein the face live detection of the received image to be recognized through the trained face live detection network comprises:

inputting the image to be recognized into a trained human face living body detection network;

generating a plurality of classification results corresponding to the images to be recognized through a trained human face living body detection network;

fusing the various classification results to obtain a reference result;

8. A face liveness detection device, comprising:

the ternary loss function determining unit is used for calculating a ternary loss function for the feature vector corresponding to each up-sampling feature according to the non-living body sample vector and the reference living body sample vector to obtain a ternary loss function corresponding to each feature vector;

and the face living body detection unit is used for training the face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out face living body detection on the received image to be recognized through the trained face living body detection network.

9. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, implements the face liveness detection method according to any one of claims 1 to 7.

10. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of live human face detection as claimed in any one of claims 1 to 7.