CN113283376B

CN113283376B - Face living body detection method, face living body detection device, medium and equipment

Info

Publication number: CN113283376B
Application number: CN202110648875.2A
Authority: CN
Inventors: 喻庐军; 韩森尧; 李驰; 刘岩
Original assignee: Taikang Insurance Group Co Ltd
Current assignee: Taikang Insurance Group Co Ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2024-02-09
Anticipated expiration: 2041-06-10
Also published as: CN113283376A

Abstract

The embodiment of the disclosure provides a face living body detection method, a face living body detection device, a computer readable medium and electronic equipment, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: performing upsampling processing on the face features through a plurality of upsampling layers in the face living body detection network; performing feature transformation on the up-sampling features corresponding to each up-sampling layer through the full connection layer; calculating a ternary loss function for feature vectors corresponding to the up-sampling features according to the non-living sample vector and the reference living sample vector; determining a classification loss function of each feature vector through a multi-layer perceptron; training a human face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out human face living body detection on the received image to be identified through the trained human face living body detection network. Therefore, by implementing the technical scheme, the recognition accuracy of the human face living body detection network can be improved so as to recognize the living body human face.

Description

Face living body detection method, face living body detection device, medium and equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to a face living body detection method, a face living body detection device, a computer readable medium and electronic equipment.

Background

With the continuous development of computer technology, the application scene of the face living body detection function is more and more abundant, for example, the face living body detection function can be applied to attendance checking software, payment software, social software and the like. However, some illegal users use the human face living body detection function through an abnormal means to pass legal verification, which easily has bad influence on the data security of the software. For example, an illegal user may perform face-living detection through a photograph of a legal user, thereby passing face-living detection verification. Based on the above-mentioned problems, how to identify a living face becomes a problem to be solved currently.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

An object of an embodiment of the present disclosure is to provide a face living body detection method, a face living body detection apparatus, a computer readable medium, and an electronic device, which can train a face living body detection network based on a fusion result of a ternary loss function and a classification loss function, thereby improving recognition accuracy of the face living body detection network to recognize a living body face.

A first aspect of an embodiment of the present disclosure provides a face living body detection method, including:

performing upsampling processing on the face features through a plurality of upsampling layers in the face living body detection network to obtain upsampling features corresponding to the upsampling layers;

performing feature transformation on the up-sampling features corresponding to each up-sampling layer through a full connection layer in the human face living body detection network to obtain feature vectors corresponding to the up-sampling features;

according to the non-living body sample vector and the reference living body sample vector, calculating a ternary loss function for the feature vector corresponding to each up-sampling feature, and obtaining a ternary loss function corresponding to each feature vector;

determining a classification loss function of each feature vector through a multi-layer perceptron in the human face living body detection network;

training a human face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out human face living body detection on the received image to be identified through the trained human face living body detection network.

In an exemplary embodiment of the present disclosure, before performing upsampling processing on a face feature through a plurality of upsampling layers in the face living body detection network to obtain upsampling features corresponding to the upsampling layers, the method further includes:

Identifying a face region in the sample image;

and carrying out convolution processing on the face region through a convolution neural network in the face living body detection network to obtain the face characteristics.

In an exemplary embodiment of the present disclosure, before identifying the face region in the sample image, the method further includes:

receiving a plurality of labeling operations for an image to be labeled;

determining marking results corresponding to the marking operations respectively, and determining the same marking result with the largest quantity as a final marking result corresponding to the image to be marked;

establishing an association relation for associating the image to be annotated with the final annotation result;

and determining the image to be annotated corresponding to the association relation as a sample image.

In an exemplary embodiment of the present disclosure, a face region is convolved by a convolutional neural network in a face living body detection network to obtain a face feature, including:

carrying out feature transformation on the face region through each layer in the convolutional neural network to obtain reference features corresponding to each layer; wherein, based on the arrangement sequence of each layer, the input of each layer is the output of the upper layer;

and determining the reference characteristic corresponding to the last layer in each layer as the face characteristic.

In one exemplary embodiment of the present disclosure, training a face in-vivo detection network according to a ternary loss function corresponding to each feature vector and a classification loss function of each feature vector includes:

calculating a target loss function according to the ternary loss function corresponding to each feature vector and the classification loss function corresponding to each feature vector;

and adjusting network parameters in the human face living body detection network through the target loss function until the target loss function is converged to a preset loss range, so as to realize training of the human face living body detection network.

In one exemplary embodiment of the present disclosure, calculating the target loss function from the ternary loss function for each feature vector and the classification loss function for each feature vector includes:

carrying out average value calculation on the ternary loss function corresponding to each feature vector to obtain a first average value;

carrying out average value calculation on the classification loss function corresponding to each feature vector to obtain a second average value;

a weighted sum of the first mean and the second mean is calculated and the weighted sum is determined as a target loss function.

In an exemplary embodiment of the present disclosure, performing face biopsy on a received image to be recognized through a trained face biopsy network includes:

Inputting an image to be identified into a trained human face living body detection network;

generating a plurality of classification results corresponding to the images to be identified through the trained human face living body detection network;

fusing various classification results to obtain a reference result;

and determining a threshold range to which the reference result belongs, and generating a recognition result of the image to be recognized according to the label corresponding to the threshold range.

According to a second aspect of the embodiments of the present disclosure, there is provided a face living body detection apparatus, the apparatus including:

the feature sampling unit is used for carrying out up-sampling processing on the face features through a plurality of up-sampling layers in the face living body detection network to obtain up-sampling features corresponding to the up-sampling layers;

the feature transformation unit is used for carrying out feature transformation on the up-sampling features corresponding to each up-sampling layer through the full connection layer in the human face living body detection network to obtain feature vectors corresponding to the up-sampling features;

the ternary loss function determining unit is used for calculating a ternary loss function for the feature vectors corresponding to the up-sampling features according to the non-living body sample vector and the reference living body sample vector to obtain a ternary loss function corresponding to each feature vector;

the classification loss function determining unit is used for determining the classification loss function of each feature vector through a multi-layer perceptron in the human face living body detection network;

The human face living body detection unit is used for training a human face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out human face living body detection on the received image to be identified through the trained human face living body detection network.

In an exemplary embodiment of the present disclosure, the above apparatus further includes:

the face region identification unit is used for identifying the face region in the sample image before the feature sampling unit carries out up-sampling processing on the face features through a plurality of up-sampling layers in the face living body detection network to obtain up-sampling features corresponding to the up-sampling layers;

the feature extraction unit is used for carrying out convolution processing on the face region through a convolution neural network in the face living body detection network to obtain face features.

an operation receiving unit for receiving a plurality of labeling operations for the image to be labeled before the face region recognition unit recognizes the face region in the sample image;

the marking result determining unit is used for determining marking results corresponding to a plurality of marking operations respectively, and determining the same marking result with the largest quantity as a final marking result corresponding to the image to be marked;

The association relation establishing unit is used for establishing an association relation for associating the image to be annotated with the final annotation result;

and the sample image determining unit is used for determining the image to be annotated corresponding to the association relation as a sample image.

In an exemplary embodiment of the present disclosure, a feature extraction unit performs convolution processing on a face region through a convolutional neural network in a face living body detection network to obtain a face feature, including:

In an exemplary embodiment of the present disclosure, a face biopsy unit trains a face biopsy network according to a ternary loss function corresponding to each feature vector and a classification loss function of each feature vector, including:

In an exemplary embodiment of the present disclosure, the face living body detection unit calculates an objective loss function according to a ternary loss function corresponding to each feature vector and a classification loss function corresponding to each feature vector, including:

In an exemplary embodiment of the present disclosure, a face biopsy unit performs face biopsy on a received image to be recognized through a trained face biopsy network, including:

fusing various classification results to obtain a reference result;

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements the face in-vivo detection method of the first aspect as in the above embodiments.

According to a fourth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the face in vivo detection method of the first aspect as in the above embodiments.

According to a fifth aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternative implementations described above.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

the technical solutions provided in some embodiments of the present disclosure specifically include: performing upsampling processing on the face features through a plurality of upsampling layers in the face living body detection network to obtain upsampling features corresponding to the upsampling layers; performing feature transformation on the up-sampling features corresponding to each up-sampling layer through a full connection layer in the human face living body detection network to obtain feature vectors corresponding to the up-sampling features; according to the non-living body sample vector and the reference living body sample vector, calculating a ternary loss function for the feature vector corresponding to each up-sampling feature, and obtaining a ternary loss function corresponding to each feature vector; determining a classification loss function of each feature vector through a multi-layer perceptron in the human face living body detection network; training a human face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out human face living body detection on the received image to be identified through the trained human face living body detection network. By implementing the embodiment of the disclosure, on one hand, the face living body detection network can be trained based on the fusion result of the ternary loss function and the classification loss function, so that the recognition accuracy of the face living body detection network is improved, and the living body face is recognized. On the other hand, based on correct identification of the living body face, the data security in the living body face detection scene can be improved, and the user rights and interests are ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 schematically illustrates a schematic diagram of an exemplary system architecture of a face biopsy method and a face biopsy device to which embodiments of the present disclosure may be applied;

FIG. 2 schematically illustrates a structural schematic of a computer system suitable for use in implementing electronic devices of embodiments of the present disclosure;

FIG. 3 schematically illustrates a flow chart of a face in-vivo detection method according to one embodiment of the present disclosure;

fig. 4 schematically illustrates a structural diagram of a face living body detection network according to an embodiment of the present disclosure;

fig. 5 schematically illustrates a flow chart of a face in-vivo detection method according to one embodiment of the present disclosure;

Fig. 6 schematically shows a block diagram of a face living body detection apparatus in an embodiment according to the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 illustrates a schematic diagram of a system architecture of an exemplary application environment in which a face biopsy method and a face biopsy device according to embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of the terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The terminal devices 101, 102, 103 may be various electronic devices with display screens including, but not limited to, desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers. Wherein the server 105 is configured to perform: performing upsampling processing on the face features through a plurality of upsampling layers in the face living body detection network to obtain upsampling features corresponding to the upsampling layers; performing feature transformation on the up-sampling features corresponding to each up-sampling layer through a full connection layer in the human face living body detection network to obtain feature vectors corresponding to the up-sampling features; according to the non-living body sample vector and the reference living body sample vector, calculating a ternary loss function for the feature vector corresponding to each up-sampling feature, and obtaining a ternary loss function corresponding to each feature vector; determining a classification loss function of each feature vector through a multi-layer perceptron in the human face living body detection network; training a human face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out human face living body detection on the received image to be identified through the trained human face living body detection network.

Fig. 2 shows a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure.

As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU) 201, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In (RAM) 203, various programs and data required for system operation are also stored. The (CPU) 201, (ROM) 202, and (RAM) 203 are connected to each other through a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the (I/O) interface 205: an input section 206 including a keyboard, a mouse, and the like; an output portion 207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage section 208 including a hard disk or the like; and a communication section 209 including a network interface card such as a LAN card, a modem, and the like. The communication section 209 performs communication processing via a network such as the internet. The drive 210 is also connected to the (I/O) interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 210 as needed, so that a computer program read therefrom is installed into the storage section 208 as needed.

In particular, according to embodiments of the present disclosure, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 209, and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU) 201, performs the various functions defined in the methods and apparatus of the present application.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 3, and so on.

The present exemplary embodiment provides a face living body detection method, which may include the following steps S310 to S350, specifically, referring to fig. 3:

step S310: and carrying out upsampling processing on the face features through a plurality of upsampling layers in the face living body detection network to obtain upsampling features corresponding to the upsampling layers.

Step S320: and carrying out feature transformation on the up-sampling features corresponding to each up-sampling layer through a full connection layer in the human face living body detection network to obtain feature vectors corresponding to each up-sampling feature.

Step S330: and calculating a ternary loss function for the feature vector corresponding to each up-sampling feature according to the non-living sample vector and the reference living sample vector, and obtaining the ternary loss function corresponding to each feature vector.

Step S340: and determining the classification loss function of each feature vector through a multi-layer perceptron in the human face living body detection network.

Step S350: training a human face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out human face living body detection on the received image to be identified through the trained human face living body detection network.

It should be noted that the technical solution formed in step S310 to step S350 may be applied to a face living body detection platform, and the platform may provide a functional interface for application scenarios such as online attendance checking, online payment, and the like.

By implementing the face living body detection method shown in fig. 3, the face living body detection network can be trained based on the fusion result of the ternary loss function and the classification loss function, so that the recognition accuracy of the face living body detection network is improved, and the living body face is recognized. In addition, based on correct identification of the living body face, the data safety in the living body face detection scene can be improved, and the user rights and interests are ensured.

Next, the above steps of the present exemplary embodiment will be described in more detail.

As an optional embodiment, before performing upsampling processing on the face feature by using a plurality of upsampling layers in the face living body detection network to obtain upsampling features corresponding to the upsampling layers, the method further includes: identifying a face region in the sample image; and carrying out convolution processing on the face region through a convolution neural network in the face living body detection network to obtain the face characteristics.

Specifically, the area of the face region is equal to or smaller than that of the sample image, and the face region contains a plurality of face features, such as eye features, nose features, mouth features, and the like.

Therefore, by implementing the optional embodiment, the recognition accuracy of the human face living body detection network can be improved by recognizing the human face region in the sample image and taking the human face region as a training sample, so that the training efficiency is improved, the influence of the non-human face region in the sample image on the training result is avoided, and the improvement of the recognition accuracy of the human face living body detection network is facilitated.

As an alternative embodiment, before identifying the face region in the sample image, the method further includes: receiving a plurality of labeling operations for an image to be labeled; determining marking results corresponding to the marking operations respectively, and determining the same marking result with the largest quantity as a final marking result corresponding to the image to be marked; establishing an association relation for associating the image to be annotated with the final annotation result; and determining the image to be annotated corresponding to the association relation as a sample image.

Specifically, the image to be annotated may be an image including a living face or a non-living face, where the living face may be understood as a real human face, and the number of the images to be annotated may be one or more, which is not limited in the embodiment of the present application. The labeling operation may be an artificial operation, and the labeling operation may specifically be a click operation, a sliding operation, a long-press operation, a drag operation, a voice control operation, or a gesture operation, etc. Preferably, the number of labeling operations for the same image to be labeled may be 3, so that the occurrence of the same number of corresponding labeling results of multiple types (e.g., 2 types) may be avoided.

Specifically, determining the same labeling result with the largest quantity as the final labeling result corresponding to the image to be labeled, including: and carrying out the same result statistics according to the labeling results respectively corresponding to the labeling operations to obtain at least one labeling set, wherein the same labeling results (such as living faces) are contained in the labeling set, different labeling sets correspond to different labeling results (such as living faces are contained in the labeling set 1, non-living faces are contained in the labeling set 2, the face areas corresponding to the non-living faces can be video pictures, printing pictures and 3D pictures), further, calculating the labeling result number of each labeling set, and determining the labeling result with the largest labeling result number as the final labeling result corresponding to the image to be labeled.

Further, if the number of labeling results of each labeling set is equal, the method may further include: feeding back prompt information of the labeling abnormality to prompt labeling personnel to remark the image to be labeled; or determining the labeling result corresponding to the Nth labeling operation of the image to be labeled as the final labeling result corresponding to the image to be labeled, wherein N is a positive integer.

Specifically, establishing an association relationship for associating the image to be annotated with the final annotation result includes: establishing an association relation for associating the image to be marked with the final marking result through a key-value form (key-value); the image to be marked can be represented by a key, and the final marking result can be represented by a value.

Therefore, by implementing the alternative embodiment, the same labeling result with the largest quantity can be selected from the multi-person evaluation of one image to be labeled and used as the final labeling result of the image to be labeled, so that the rationality of the labeling result of the image to be labeled can be improved, the labeling accuracy of the sample image can be ensured, and the identification accuracy of the human face living detection network trained based on the sample image can be improved.

As an optional embodiment, the convolution processing is performed on the face region by using a convolution neural network in the face living body detection network to obtain a face feature, including: carrying out feature transformation on the face region through each layer in the convolutional neural network to obtain reference features corresponding to each layer; wherein, based on the arrangement sequence of each layer, the input of each layer is the output of the upper layer; and determining the reference characteristic corresponding to the last layer in each layer as the face characteristic.

Specifically, the face region may be an RGB (red, green and blue) three-channel image, and the face region is bound with the labeling result of the sample image to which the face region belongs. The convolutional neural network can be a residual network (ResNet), and the residual network (ResNet) can be used as a backbone network of the human face living body detection network; preferably, the number of residual units in the residual network (ResNet) may be 18. Alternatively, the convolutional neural network may be another type of network, which is not limited in the embodiments of the present application.

Optionally, before the feature transformation is performed on the face region by each layer in the convolutional neural network to obtain the reference feature corresponding to each layer, the method may further include: if the size of the face area is detected to be larger than the target size, the face area is compressed to the target size (e.g., 224x 224), and the target size can be represented by the number of image pixels, so that training efficiency can be improved.

Specifically, each layer in the convolutional neural network can be expressed as: 1 convolutional layer, 2 max pooling layers (maxpool), 2 bottleneck layers a1 (bottleneck), 2 bottleneck layers b1 (bottleneck), 2 bottleneck layers c1 (bottleneck), 2 bottleneck layers d1 (bottleneck).

The face region can be input into a convolution layer based on the size of 224x224x3, the convolution kernel size of the convolution layer is 7x7, the sampling interval is 2, the channel number is 64, and the output of the convolution layer is a convolution feature conv1; furthermore, the convolution feature conv1 may be input to a max pooling layer whose sampling interval is 2, and whose output is the pooling feature maxpool; furthermore, the pooling feature maxpool can be input into 2 bottleneck layers a1, the convolution kernel size of the 2 bottleneck layers a1 is 3×3, the sampling interval is 1, and the output of the bottleneck layer a1 with the channel number of 64,2, which is the later in sequence, is the convolution feature conv2; furthermore, the convolution characteristic conv2 may be input into 2 bottleneck layers b1, the convolution kernel size of the 2 bottleneck layers b1 is 3×3, the sampling interval is 2, and the output of the bottleneck layer b1, which is the next in order, of the bottleneck layers b1 with the number of channels 128,2 is the convolution characteristic conv3; furthermore, the convolution characteristic conv3 may be input into 2 bottleneck layers c1, the convolution kernel size of the 2 bottleneck layers c1 is 3x3, the sampling interval is 1, the channel number is 256, and the output of the bottleneck layer c1 in the sequence back among the 2 bottleneck layers c1 is the convolution characteristic conv4; further, the convolution characteristic conv4 may be input to 2 bottleneck layers d1, the convolution kernel size of the 2 bottleneck layers d1 may be 3×3, the sampling interval may be 2, and the output of the bottleneck layer d1, which is the next in the order among the 512,2 bottleneck layers d1, may be the convolution characteristic conv5.

Therefore, by implementing the alternative embodiment, the face features with different scales can be extracted through the multi-layer convolution layers, so that the face features which are most suitable for network training can be conveniently obtained, and the recognition accuracy of the face living body detection network can be improved.

In step S310, the face features are up-sampled by a plurality of up-sampling layers in the face living body detection network, so as to obtain up-sampling features corresponding to each up-sampling layer.

In particular, the face living body detection network may be composed of a full convolution network (UNet) and a multi-layer perceptron (MLP). Wherein UNet is a full convolution network comprising a 4-layer downsampling, 4-layer upsampling and jump connection structure, characterized in that the convolution layers are completely symmetrical in the downsampling and upsampling portions. An MLP can be understood as a multi-layered fully connected feed-forward network, and typically after inputting samples to the MLP, the samples can be fed forward layer by layer (i.e., the result is calculated layer by layer from the input layer to the hidden layer to the output layer) in the MLP, resulting in a final output value.

Specifically, the plurality of upsampling layers specifically includes: up-sampling layer 1 containing 2 bottleneck layers a2 (bottlenetck), up-sampling layer 2 containing 2 bottleneck layers b2 (bottlenetck), up-sampling layer 3 containing 2 bottleneck layers c2 (bottlenetck), up-sampling layer 4 containing 2 bottleneck layers d2 (bottlenetck), up-sampling layer 5 containing 2 bottleneck layers e (bottlenetck). The convolution characteristic conv5 output by the bottleneck layer d1 can be used as input of the up-sampling layer 1, the convolution kernel size of the up-sampling layer 1 is 3x3, the sampling interval is 2, the channel number is 256, and the output is the up-sampling characteristic deconv1; the up-sampling feature deconv1 output by the up-sampling layer 1 can be used as the input of the up-sampling layer 2, the convolution kernel size of the up-sampling layer 2 is 3x3, the sampling interval is 2, the channel number is 128, and the output is the up-sampling feature deconv2; the up-sampling feature deconv2 output by the up-sampling layer 2 can be used as the input of the up-sampling layer 3, the convolution kernel size of the up-sampling layer 3 is 3x3, the sampling interval is 2, the channel number is 64, and the output is the up-sampling feature deconv3; the up-sampling feature deconv3 output by the up-sampling layer 3 can be used as the input of the up-sampling layer 4, the convolution kernel size of the up-sampling layer 4 is 3x3, the sampling interval is 2, the channel number is 64, and the output is the up-sampling feature deconv4; the up-sampling feature deconv4 output by the up-sampling layer 4 can be used as an input of the up-sampling layer 5, the convolution kernel size of the up-sampling layer 5 is 3x3, the sampling interval is 2, the channel number is 1, and the output is the up-sampling feature deconv5.

Referring to fig. 4, fig. 4 schematically illustrates a schematic structure of a face-in-vivo detection network according to an embodiment of the present disclosure, where the face-in-vivo detection network is used to learn a difference between a living face and a non-living face, and the difference may be represented as a fraud matrix (spof cue map), and generally, a fraud matrix corresponding to a face region including a living face is a zero matrix and a fraud matrix corresponding to a face region including a non-living face is a non-zero matrix. As shown in fig. 4, the face living body detection network includes a feature generation module E1 410, a feature generation module E2 420, a feature generation module E3 430, a feature generation module E4 440, a feature generation module E5 450, a network decoding module D1 460, a network decoding module D2 470, a network decoding module D3 480, and a network decoding module D4 490.

The feature generating modules E1, E2, E420, E3, E4, E5 and E5 are used for performing feature convolution with different scales, and each feature generating module may be understood as one convolutional neural network, or alternatively, the feature generating module may be a layer in the convolutional neural network. The network decoding modules D1, D2, D480, D4 and D4, 490 are used to perform feature upsampling at different scales. The number of the feature generating modules and the network decoding modules is only schematically shown in fig. 4, and in the practical application process, the number of the feature generating modules and the network decoding modules is not limited in this application.

Specifically, when the face region is acquired, the face region may be input to the feature generation module, so that the feature generation module E1, the feature generation module E2, the feature generation module E3, the feature generation module E430, the feature generation module E4, the feature generation module E5, and the feature generation module E450 sequentially perform feature transformation, where the input of each feature generation module is the output of the previous feature generation module. Furthermore, the last feature generation module may input the feature extraction result to the network decoding module, so that the network decoding module D1 460, the network decoding module D2 470, the network decoding module D3 480, and the network decoding module D4 490 sequentially perform feature upsampling, and the input of each network decoding module is the output of the last network decoding module and the output of the feature generation module with symmetrical network layers. Furthermore, the upsampling results of the network decoding module D1 460, the network decoding module D2 470, the network decoding module D3 480, and the network decoding module D4 490 may be classified by the MLP to obtain classification results F1, F2, F3, and F4 corresponding to the network decoding module D1 460, the network decoding module D2 470, the network decoding module D3 480, and the network decoding module D4 490, respectively, so that a ternary loss function and a classification loss function may be calculated according to F1, F2, F3, and F4, and a face living detection network may be trained by the ternary loss function and the classification loss function, so that a face living detection network learns a difference between a living face and a non-living face expressed as a fraud matrix (spoofcue map).

In step S320, feature transformation is performed on the upsampling features corresponding to each upsampling layer through the full connection layer in the face living body detection network, so as to obtain feature vectors corresponding to each upsampling feature.

Specifically, the feature transformation is performed on the up-sampling features corresponding to each up-sampling layer through the full connection layer in the human face living body detection network, and the method comprises the following steps: the up-sampling features deconv1, deconv2, deconv3, deconv4, deconv5 corresponding to each up-sampling layer are subjected to feature transformation to obtain feature vectors V1, V2, V3, V4, and V5 with a scale of 512. In this embodiment of the present application, the preferred scale of each feature vector is 512, and the scale of each feature vector in the practical application process is not limited.

In step S330, a ternary loss function is calculated for each feature vector corresponding to the up-sampled feature according to the non-living sample vector and the reference living sample vector, and the ternary loss function corresponding to each feature vector is obtained.

Specifically, the non-living sample vector (Vspoof) and the reference living sample vector (Vanchor) may be preset vectors, and in contrast, the feature vector V1, the feature vector V2, the feature vector V3, the feature vector V4, and the feature vector V5 may all participate in the computation of the ternary loss function as the face sample vector Vlive. Alternatively, the expression corresponding to the ternary loss function (trippletloss) may be trippletloss=min (|vanchor-vlive|2- |vanchor-vspoof|2).

In step S340, a classification loss function of each feature vector is determined by the multi-layer perceptron in the face living body detection network.

Specifically, determining a classification loss function of each feature vector by a multi-layer perceptron in a human face living body detection network comprises: respectively determining the classification results of the corresponding feature vectors through a plurality of multi-layer perceptrons in the human face living body detection network to obtain the classification results corresponding to the multi-layer perceptrons; a classification loss function (cross entropyloss) was calculated for each classification result.

The classification results and the feature vectors are in one-to-one correspondence, the classification results can be expressed as probability sets, and each probability in the probability sets is used for expressing the probability that the face region belongs to each preset category (for example, a living face category and a non-living face category); wherein, the multi-layer perceptron (MLP) is a feedforward artificial neural network model for classifying features.

In addition, the classification loss function (cross EntopropyLoss) is preferably a cross entropy loss function (cross Entopropy), and the classification loss function may be a 0-1 loss function (zero-one loss), an absolute loss function, a log loss function, a square loss function, an exponential loss function (exponential loss), a range loss function, or a perceptual loss (perfect loss) function, which is not limited in the embodiments of the present application.

In step S350, the face living body detection network is trained according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and the face living body detection is performed on the received image to be identified through the trained face living body detection network.

As an alternative embodiment, training the face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector includes: calculating a target loss function according to the ternary loss function corresponding to each feature vector and the classification loss function corresponding to each feature vector; and adjusting network parameters in the human face living body detection network through the target loss function until the target loss function is converged to a preset loss range, so as to realize training of the human face living body detection network.

Specifically, the network parameters in the human face living body detection network at least comprise a weight value and a bias item, and the preset loss range is a manually set numerical range.

Therefore, by implementing the alternative embodiment, the convergence speed of the face living body detection network can be improved through the combination of the loss functions, and the recognition accuracy of the face living body detection network can be improved.

As an alternative embodiment, calculating the objective loss function according to the ternary loss function corresponding to each feature vector and the classification loss function corresponding to each feature vector includes: carrying out average value calculation on the ternary loss function corresponding to each feature vector to obtain a first average value; carrying out average value calculation on the classification loss function corresponding to each feature vector to obtain a second average value; a weighted sum of the first mean and the second mean is calculated and the weighted sum is determined as a target loss function.

Specifically, performing mean value calculation on the ternary loss function corresponding to each feature vector to obtain a first mean value, including: and calculating (tripleloss1+tripleloss2+ … … +triplelossn)/n according to the ternary loss function corresponding to each feature vector to obtain a first average value, wherein n is a positive integer. The method for calculating the average value of the classification loss function corresponding to each feature vector to obtain a second average value comprises the following steps: and calculating (CrossEntopyLos1+CrossEntopyLos2+ … … +CrossEntopyLossn) according to the ternary loss function corresponding to each feature vector to obtain a second average value, wherein n is a positive integer. Calculating a weighted sum of the first mean and the second mean and determining the weighted sum as a target loss function, comprising: and calculating a target Loss function (Loss) according to the first average value and the second average value, wherein loss=w1 is the first average value+w2 is the second average value, wherein W1 and W2 are constants, W1 is a ternary Loss function weight, and W2 is a classification Loss function.

Therefore, by implementing the alternative embodiment, the target loss function suitable for training the human face living body detection network can be calculated, so that the recognition accuracy of the human face living body detection network is improved.

As an alternative embodiment, performing face biopsy on the received image to be identified through the trained face biopsy network includes: inputting an image to be identified into a trained human face living body detection network; generating a plurality of classification results corresponding to the images to be identified through the trained human face living body detection network; fusing various classification results to obtain a reference result; and determining a threshold range to which the reference result belongs, and generating a recognition result of the image to be recognized according to the label corresponding to the threshold range.

Specifically, inputting an image to be recognized into a trained human face living body detection network; and generating a plurality of classification results C1, C2, C3, C4 and C5 corresponding to the images to be recognized through the trained human face living body detection network. Further, the processing unit is used for processing the data,fusing the multiple classification results to obtain a reference result, including: according to the expressionCalculate reference results->Further, determining a threshold range to which the reference result belongs, and generating a recognition result of the image to be recognized according to the label corresponding to the threshold range, including: if->Then according to->Generating a recognition result of the image to be recognized (for example, the image to be recognized contains a living body) by the corresponding tag; if->Then according to->The corresponding tag generates a recognition result of the image to be recognized (e.g., the image to be recognized does not include a living body). If the image to be identified does not contain a living body, the method may further include: uploading the image to be identified to a cloud server for storage and feeding back alarm information so as to remind related personnel of paying attention to the current abnormal situation. />

Therefore, by implementing the alternative embodiment, the security hole in the field of human face living body detection can be perfected through improving the recognition precision of the human face living body detection network, the human face living body detection security is improved, and the system robustness can be improved when the method is applied to a human face living body detection system.

Referring to fig. 5, fig. 5 schematically illustrates a flowchart of a face in-vivo detection method according to one embodiment of the present disclosure. As shown in fig. 5, the face living body detection method includes: step S510 to step S590.

Step S510: receiving a plurality of labeling operations aiming at the image to be labeled, determining labeling results corresponding to the labeling operations respectively, determining the same labeling result with the largest quantity as a final labeling result corresponding to the image to be labeled, further establishing an association relation for associating the image to be labeled with the final labeling result, and determining the image to be labeled corresponding to the association relation as a sample image.

Step S520: and identifying a face region in the sample image, and carrying out feature transformation on the face region through each layer in the convolutional neural network to obtain reference features corresponding to each layer, wherein the input of each layer is the output of the last layer based on the arrangement sequence of each layer, and further, the reference features corresponding to the last layer in each layer are determined to be the face features.

Step S530: and carrying out upsampling processing on the face features through a plurality of upsampling layers in the face living body detection network to obtain upsampling features corresponding to the upsampling layers.

Step S540: and carrying out feature transformation on the up-sampling features corresponding to each up-sampling layer through a full connection layer in the human face living body detection network to obtain feature vectors corresponding to each up-sampling feature.

Step S550: and calculating a ternary loss function for the feature vector corresponding to each up-sampling feature according to the non-living sample vector and the reference living sample vector, and obtaining the ternary loss function corresponding to each feature vector.

Step S560: and determining the classification loss function of each feature vector through a multi-layer perceptron in the human face living body detection network.

Step S570: and carrying out average value calculation on the ternary loss function corresponding to each feature vector to obtain a first average value, carrying out average value calculation on the classification loss function corresponding to each feature vector to obtain a second average value, further calculating the weighted sum of the first average value and the second average value, and determining the weighted sum as a target loss function.

Step S580: and adjusting network parameters in the human face living body detection network through the target loss function until the target loss function is converged to a preset loss range, so as to realize training of the human face living body detection network.

Step S590: inputting the image to be identified into a trained human face living body detection network, generating a plurality of classification results corresponding to the image to be identified through the trained human face living body detection network, fusing the plurality of classification results to obtain a reference result, determining a threshold range to which the reference result belongs, and generating an identification result of the image to be identified according to a label corresponding to the threshold range.

It should be noted that, steps S510 to S590 correspond to the steps and embodiments shown in fig. 3, and for the specific implementation of steps S510 to S590, please refer to the steps and embodiments shown in fig. 3, and the description thereof is omitted here.

Therefore, by implementing the face living body detection method shown in fig. 5, the face living body detection network can be trained based on the fusion result of the ternary loss function and the classification loss function, so that the recognition accuracy of the face living body detection network is improved, and the living body face is recognized. In addition, based on correct identification of the living body face, the data safety in the living body face detection scene can be improved, and the user rights and interests are ensured.

Further, in this example embodiment, there is also provided a face living body detection apparatus, referring to fig. 6, the face living body detection apparatus 600 may include:

the feature sampling unit 601 is configured to perform upsampling processing on the face feature through a plurality of upsampling layers in the face living body detection network, so as to obtain upsampling features corresponding to the upsampling layers;

the feature transformation unit 602 is configured to perform feature transformation on the upsampling features corresponding to each upsampling layer through a full connection layer in the face living body detection network, so as to obtain feature vectors corresponding to each upsampling feature;

A ternary loss function determining unit 603, configured to calculate a ternary loss function for feature vectors corresponding to the upsampled features according to the non-living sample vector and the reference living sample vector, so as to obtain a ternary loss function corresponding to each feature vector;

a classification loss function determining unit 604, configured to determine a classification loss function of each feature vector by using a multi-layer perceptron in the face living body detection network;

the face living body detection unit 605 is configured to train a face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and perform face living body detection on the received image to be identified through the trained face living body detection network.

Therefore, the device shown in fig. 6 can train the face living body detection network based on the fusion result of the ternary loss function and the classification loss function, so as to improve the recognition accuracy of the face living body detection network and recognize the living body face. In addition, based on correct identification of the living body face, the data safety in the living body face detection scene can be improved, and the user rights and interests are ensured.

A face region identifying unit (not shown) configured to identify a face region in the sample image before the feature sampling unit 601 performs upsampling processing on the face feature through a plurality of upsampling layers in the face living body detection network to obtain upsampled features corresponding to the upsampling layers;

and the feature extraction unit (not shown) is used for carrying out convolution processing on the face region through a convolution neural network in the face living body detection network to obtain the face features.

an operation receiving unit (not shown) for receiving a plurality of labeling operations for the image to be labeled before the face region recognition unit recognizes the face region in the sample image;

the labeling result determining unit (not shown) is used for determining labeling results corresponding to a plurality of labeling operations respectively, and determining the same labeling result with the largest quantity as a final labeling result corresponding to the image to be labeled;

The association relation establishing unit (not shown) is used for establishing an association relation for associating the image to be annotated with the final annotation result;

a sample image determining unit (not shown) for determining an image to be annotated corresponding to the association relationship as a sample image.

In an exemplary embodiment of the present disclosure, the face biopsy unit 605 trains a face biopsy network according to a ternary loss function corresponding to each feature vector and a classification loss function of each feature vector, including:

In an exemplary embodiment of the present disclosure, the face living body detection unit 605 calculates a target loss function from a ternary loss function corresponding to each feature vector and a classification loss function corresponding to each feature vector, including:

In an exemplary embodiment of the present disclosure, the face biopsy unit 605 performs face biopsy on a received image to be recognized through a trained face biopsy network, including:

fusing various classification results to obtain a reference result;

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Since each functional module of the face biopsy device according to the exemplary embodiment of the present disclosure corresponds to a step of the exemplary embodiment of the face biopsy method described above, for details not disclosed in the embodiment of the present disclosure, please refer to the embodiment of the face biopsy method described above in the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A face living body detection method, characterized by comprising:

performing feature transformation on the up-sampling features corresponding to each up-sampling layer through a full connection layer in the human face living body detection network to obtain feature vectors corresponding to each up-sampling feature;

according to the non-living body sample vector and the reference living body sample vector, calculating a ternary loss function for the feature vector corresponding to each up-sampling feature, and obtaining a ternary loss function corresponding to each feature vector; wherein the ternary loss function is expressed as trippletloss=min (|vanchor-vlive|min) ₂ - ||Vanchor-Vspoof|| ₂ ) Vanchor refers to the reference live sample vector, vlive refers to the feature vector corresponding to each up-sampled feature, vspoof refers to the non-live sample vector;

Training the human face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and performing human face living body detection on the received image to be identified through the trained human face living body detection network;

wherein the method further comprises:

receiving a plurality of labeling operations for an image to be labeled;

carrying out the same result statistics according to the marking results respectively corresponding to the marking operations to obtain at least one marking set, wherein the marking results are the same in each marking set, and different marking sets correspond to different marking results;

calculating the number of the labeling results of each labeling set, and determining the labeling result with the largest number of the labeling results as the final labeling result corresponding to the image to be labeled;

if the number of the labeling results of each labeling set is equal, feeding back prompt information of labeling abnormality to prompt labeling personnel to remark the images to be labeled;

determining the image to be annotated corresponding to the association relation as a sample image;

Identifying a face region in the sample image;

2. The method according to claim 1, wherein the face feature is obtained by performing convolution processing on the face region through a convolutional neural network in the face living body detection network, including:

performing feature transformation on the face region through each layer in the convolutional neural network to obtain reference features corresponding to each layer; wherein, based on the arrangement sequence of each layer, the input of each layer is the output of the upper layer;

and determining the corresponding reference feature of the last layer in the layers as the face feature.

3. The method of claim 1, wherein training the face biopsy network based on the ternary loss function for each feature vector and the classification loss function for each feature vector comprises:

4. The method of claim 1, wherein calculating an objective loss function from the ternary loss function for each feature vector and the classification loss function for each feature vector comprises:

a weighted sum of the first mean and the second mean is calculated and determined as the target loss function.

5. The method of claim 4, wherein performing the face biopsy on the received image to be identified through the trained face biopsy network comprises:

inputting the image to be recognized into a trained human face living body detection network;

fusing the multiple classification results to obtain a reference result;

and determining a threshold range to which the reference result belongs, and generating a recognition result of the image to be recognized according to a label corresponding to the threshold range.

6. A human face living body detection apparatus, characterized by comprising:

the feature transformation unit is used for carrying out feature transformation on the up-sampling features corresponding to each up-sampling layer through the full connection layer in the human face living body detection network to obtain feature vectors corresponding to each up-sampling feature;

the ternary loss function determining unit is used for calculating a ternary loss function for the feature vectors corresponding to the up-sampling features according to the non-living body sample vector and the reference living body sample vector to obtain a ternary loss function corresponding to each feature vector; wherein the ternary loss function is expressed as trippletloss=min (|vanchor-vlive|min) ₂ - ||Vanchor-Vspoof|| ₂ ) Vanchor refers to the reference living sample vector, vlive refers to the feature vector corresponding to each up-sampled feature, vspof refers to the non-sample vectorA live sample vector;

The human face living body detection unit is used for training the human face living body detection network according to the ternary loss function corresponding to each feature vector and the classification loss function of each feature vector, and carrying out human face living body detection on the received image to be identified through the trained human face living body detection network;

wherein the device is further for:

receiving a plurality of labeling operations for an image to be labeled;

Identifying a face region in the sample image;

7. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the face in vivo detection method according to any one of claims 1 to 5.

8. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the face living detection method according to any one of claims 1 to 5.