CN115909514A

CN115909514A - Training method of living body attack detection model and living body attack detection method

Info

Publication number: CN115909514A
Application number: CN202211531496.6A
Authority: CN
Inventors: 武文琦
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-04-04

Abstract

The embodiment of the specification discloses a training method of a living body attack detection model, a living body attack detection method, a living body attack detection device, a storage medium and electronic equipment, wherein a sample face image is input into the living body attack detection model; performing feature extraction and feature splitting on the sample face image through a living body attack detection model to obtain the content feature of the sample face image, wherein the content feature of the sample face image is used for expressing the image content of the sample face image; combining the content characteristics of the sample face image and the style characteristics of the reference face image to obtain sample combination characteristics, wherein the style characteristics are used for expressing the image style of the reference face image; training a living body attack detection model based on first difference information between a prediction label and an annotation label of a reference face image, wherein the prediction label is determined by the living body attack detection model based on sample combination characteristics, and the annotation label is used for indicating whether the reference face image is a living body attack image or not.

Description

Training method of living body attack detection model and living body attack detection method

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method for a living attack detection model, a living attack detection method, a living attack detection device, a storage medium, and an electronic device.

Background

With the development of computer technology, face recognition technology has been widely used in recent years, for example, a face recognition system using face recognition technology is widely deployed on a payment platform, and a user can quickly complete payment through the face recognition system.

However, while bringing convenience to production and life of people, the face recognition system is also tested in various attacks, for example, a living attack is a type of attack with a high threat, and the living attack is an identity imitation attack performed by using a photo, a mobile phone screen, a mask and the like in the face recognition stage. Once the living body attack succeeds, a huge loss is caused, the living body attack detection through the living body attack detection model is the current development trend, and how to train the living body attack detection model is a hotspot for research.

Disclosure of Invention

The present specification provides a training method for a living body attack detection model, a living body attack detection method, a living body attack detection device, a storage medium, and an electronic apparatus, which can train a living body attack detection model, and can detect a living body attack in a face recognition process using the living body attack detection model, thereby improving the safety of the face recognition.

In one aspect, an embodiment of the present specification provides a method for training a live attack detection model, including:

inputting a sample face image into a living body attack detection model, wherein the living body attack detection model is used for determining whether the input face image is a living body attack image;

performing feature extraction and feature splitting on the sample face image through the living body attack detection model to obtain the content features of the sample face image, wherein the content features of the sample face image are used for representing the image content of the sample face image;

combining the content characteristics of the sample face image with the style characteristics of the reference face image to obtain sample combination characteristics, wherein the style characteristics are used for expressing the image style of the reference face image;

training the living body attack detection model based on first difference information between a prediction label and an annotation label of the reference face image, wherein the prediction label is determined by the living body attack detection model based on the sample combination characteristics, and the annotation label is used for indicating whether the reference face image is a living body attack image.

In one aspect, an embodiment of the present specification provides a method for detecting a living body attack, including:

inputting a target face image into a living body attack detection model;

performing feature extraction on the target face image through the living body attack detection model to obtain image features of the target face image;

predicting based on the image characteristics of the target face image through the living body attack detection model, and outputting a label of the target face image, wherein the label is used for indicating whether the target face image is a living body attack image;

the living attack detection model is obtained by training based on first difference information between a prediction label and an annotation label of a reference face image, the prediction label is determined based on sample combination characteristics, and the sample combination characteristics are obtained by combining content characteristics of the sample face image and style characteristics of the reference face image.

In one aspect, an embodiment of the present specification provides a training apparatus for a living body attack detection model, including:

the system comprises a sample face image input module, a living body attack detection module and a living body attack detection module, wherein the sample face image input module is used for inputting a sample face image into a living body attack detection model, and the living body attack detection model is used for determining whether the input face image is a living body attack image;

the characteristic processing module is used for carrying out characteristic extraction and characteristic splitting on the sample face image through the living body attack detection model to obtain the content characteristic of the sample face image, and the content characteristic of the sample face image is used for expressing the image content of the sample face image;

the characteristic combination module is used for combining the content characteristics of the sample face image and the style characteristics of the reference face image to obtain sample combination characteristics, and the style characteristics are used for expressing the image style of the reference face image;

a training module, configured to train the living attack detection model based on first difference information between a prediction tag and an annotation tag of the reference face image, where the prediction tag is determined by the living attack detection model based on the sample combination feature, and the annotation tag is used to indicate whether the reference face image is a living attack image.

In a possible implementation manner, the feature processing module is configured to perform feature extraction on the sample face image to obtain an image feature of the sample face image, where the image feature includes a content feature and a style feature; and carrying out feature splitting on the image features of the sample face image to obtain the content features of the sample face image.

In one possible implementation, the feature processing module is configured to perform any one of:

performing convolution for at least once on the sample face image to obtain the image characteristics of the sample face image;

carrying out at least one-time full connection on the sample face image to obtain the image characteristics of the sample face image;

and coding the sample face image based on an attention mechanism to obtain the image characteristics of the sample face image.

In a possible implementation manner, the feature processing module is configured to perform at least one convolution on the image features of the sample face image to obtain the content features of the sample face image.

In a possible embodiment, the training module is further configured to perform at least one of:

training the living attack detection model based on second difference information between the sample combination features and positive sample combination features and third difference information between the sample combination features and negative sample combination features, wherein the positive sample combination features and the sample combination features have the same style features, and the negative sample combination features and the sample combination features have different style features;

inputting the content characteristics of the sample face image into a style discrimination unit, predicting based on the content characteristics of the sample face image through the style discrimination unit, and outputting the predicted style of the sample face image; and training the living attack detection model based on fourth difference information between the prediction style and the labeling style of the sample face image.

In a possible implementation manner, the training module is further configured to determine a gradient value corresponding to the current iteration training based on the first difference information, the second difference information, the third difference information, and the fourth difference information; and training the living body attack detection model based on the gradient value.

In a possible implementation manner, the apparatus further includes a prediction tag determination module, configured to perform full connection and normalization on the sample combination features through the living attack detection model, and output a classification value of the sample combination features; and determining a prediction label corresponding to the sample combination characteristic based on the classification value and the classification value threshold.

In a possible implementation manner, the feature processing module is further configured to input the reference face image into the living attack detection model; and performing feature extraction and feature splitting on the reference face image through the living body attack detection model to obtain style features of the reference face image.

In a possible implementation manner, the feature processing module is further configured to perform feature extraction on the reference face image to obtain an image feature of the reference face image; carrying out feature splitting on the image features of the reference face image to obtain the initial style features of the reference face image; and coding the initial style characteristics of the reference face image based on an attention mechanism to obtain the image characteristics of the reference face image.

In one aspect, an embodiment of the present specification provides a living body attack detection apparatus, including:

the target face image input module is used for inputting the target face image into the living body attack detection model;

the feature extraction module is used for extracting features of the target face image through the living body attack detection model to obtain image features of the target face image;

the prediction module is used for predicting based on the image characteristics of the target face image through the living body attack detection model and outputting a label of the target face image, wherein the label is used for indicating whether the target face image is a living body attack image;

In one aspect, embodiments of the present specification provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method.

In one aspect, an embodiment of the present specification provides an electronic device, including: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method.

In one aspect, embodiments of the present specification provide a computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the method described above.

According to the technical scheme provided by the embodiment of the specification, the sample face image is input into the living body attack detection model, and the living body attack detection model is used for carrying out feature extraction and feature splitting on the sample face image to obtain the content features of the sample face image. And combining the content characteristics of the sample face image and the style characteristics of the reference face image to obtain sample combination characteristics. Training a living attack detection model based on first difference information between a prediction label corresponding to the sample combination characteristics and an annotation label of the reference face image, so that the living attack detection model has the capability of detecting the living attack. The living attack detection model is used before face recognition, so that living attacks can be recognized in time, and the safety of face recognition is improved.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present specification, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a training method for a living body attack detection model provided in an embodiment of the present specification;

FIG. 2 is a flowchart of a training method for a living body attack detection model provided in an embodiment of the present specification;

FIG. 3 is a flowchart of another training method for a live attack detection model provided in an embodiment of the present disclosure;

fig. 4 is an architecture diagram of a training method of a living body attack detection model provided in an embodiment of the present specification;

FIG. 5 is a flowchart of a method for detecting a live attack provided by an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a training apparatus for a living body attack detection model provided in an embodiment of the present specification;

FIG. 7 is a schematic structural diagram of a living body attack detection apparatus provided in an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification.

Detailed Description

In order to make the features and advantages of the present specification more apparent and understandable, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only a part of the embodiments of the present specification, and not all the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without making any creative effort fall within the protection scope of the present specification.

First, the noun terms to which one or more embodiments of the present specification relate are explained.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Biological recognition: the biometric identification technology is to closely combine a computer with high-tech means such as optics, acoustics, biosensors and the principle of biometrics, and identify the identity of an individual by utilizing the inherent physiological characteristics (such as fingerprints, facial images, irises and the like) and behavior characteristics (such as handwriting, voice, gait and the like) of the human body.

Face recognition: face recognition is a biometric technique for identifying an identity based on facial feature information of a person. A series of related technologies, also commonly called face recognition and face recognition, are used to collect images or video streams containing faces by using a camera or a video camera, automatically detect and track the faces in the images, and then perform face recognition on the detected faces.

And (3) living body detection: the living body detection is a method for determining the real physiological characteristics of an object in some identity verification scenes, and in the application of face recognition, the living body detection can verify whether a user operates for the real living body by combining actions of blinking, mouth opening, shaking, nodding and the like and using technologies such as face key point positioning, face tracking and the like. Common attack means such as photos, videos, face changing, masks, shelters, 3D animations and screen reproduction can be effectively resisted, so that the user can be helped to discriminate fraudulent behaviors, and the benefit of the user is guaranteed.

Normalization: and the arrays with different value ranges are mapped to the (0, 1) interval, so that the data processing is facilitated. In some cases, the normalized values may be directly implemented as probabilities.

Random inactivation (Dropout): the method is a method for optimizing the artificial neural network with the deep structure, and reduces interdependence among nodes by randomly zeroing partial weight or output of a hidden layer in the learning process so as to realize regularization of the neural network and reduce the structural risk of the neural network. For example, in the model training process, there is a vector (1, 2,3, 4), and after the vector is input into the random inactivation layer, the random inactivation layer can randomly convert a number in the vector (1, 2,3, 4) into 0, for example, 2 into 0, and then the vector becomes (1, 0,3, 4).

Learning Rate (Learning Rate): the learning rate can guide how the model adjusts the network weight by using the gradient of the loss function in the gradient descent method. If the learning rate is too large, the loss function can directly cross the global optimal point, and the loss is too large at the moment; if the learning rate is too small, the change speed of the loss function is slow, which greatly increases the convergence complexity of the network and is easily trapped in a local minimum or saddle point.

Embedded Coding (Embedded Coding): the embedded code mathematically represents a corresponding relationship, namely data on an X space is mapped to a Y space through a function F, wherein the function F is a single-shot function, the mapping result is structure storage, the single-shot function represents that the mapped data is uniquely corresponding to the data before mapping, the structure storage represents that the size relationship of the data before mapping is the same as the size relationship of the mapped data after the size relationship of the data before mapping is stored, for example, data X1 and X2 exist before mapping, and Y1 corresponding to X1 and Y2 corresponding to X2 are obtained after mapping. If the pre-mapped data X1 > X2, then correspondingly, the mapped data Y1 is greater than Y2. For words, the words are mapped to another space, so that subsequent machine learning and processing are facilitated.

Attention weight: may represent the importance of certain data in the training or prediction process, the importance representing the magnitude of the impact of the input data on the output data. The data of high importance has a high value of attention weight, and the data of low importance has a low value of attention weight. Under different scenes, the importance of the data is different, and the process of training attention weight of the model is the process of determining the importance of the data.

It should be noted that the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) and signals referred to in the embodiments of the present description are authorized by the user or fully authorized by various parties, and the collection, use and processing of the relevant data need to comply with relevant laws and regulations and standards in relevant countries and regions. For example, the face image referred to in the embodiments of the present specification is acquired with sufficient authorization.

Next, an environment for implementing the technical solution provided in the embodiments of the present specification will be described.

Fig. 1 is a schematic diagram of an implementation environment of a training method for a live attack detection model provided in an embodiment of the present specification, and referring to fig. 1, the implementation environment includes a terminal 110 and a server 120.

The terminal 110 is connected to the server 120 through a wireless network or a wired network. Optionally, the terminal 110 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart watch, etc., but is not limited thereto. The terminal 110 is installed and operated with an application program supporting face recognition.

The server 120 is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, distribution Network (CDN), and a big data and artificial intelligence platform. The server 120 provides a background service for the application running on the terminal 110, for example, the server 120 provides a corresponding service for the application running on the terminal 110, and in this embodiment of the present specification, the server 120 provides a background service for the application running on the terminal and supporting face recognition, for example, training a live attack detection model for live attack detection, or using the live attack detection model to determine whether a live attack is received.

Those skilled in the art will appreciate that the number of terminals 110 and servers 120 described above may be greater or fewer. For example, only one terminal 110 and one server 120 are provided, or several tens or hundreds of terminals 110 and servers 120 are provided, or more, at this time, other terminals and servers are also included in the implementation environment, and the number of terminals and the type of the device are not limited in the embodiments of the present specification.

After the implementation environment of the embodiment of the present specification is described, an application scenario of the embodiment of the present specification will be described below with reference to the implementation environment, in the following description, a terminal is a terminal 110 in the implementation environment, and a server is a server 120 in the implementation environment. The technical solution provided in the embodiment of the present specification can be applied to various scenarios in which a face recognition system is applied, for example, to various payment applications that provide a face-brushing payment function, or to various payment devices that provide a face-brushing payment function, or to various vending machines with a face-brushing payment function, or to various access control devices with face recognition, which is not limited in the embodiment of the present specification.

The technical scheme provided by the embodiment of the specification is applied to various payment applications providing a face-brushing payment function as an example, when the face-brushing payment function provided by the payment applications is used, a terminal acquires a target face image of a target object, the terminal is a terminal for running the payment applications, and the target object is a user using the terminal. And the terminal sends the target face image of the target object to the server, and the server acquires the target face image of the target object. The server inputs the target face image into a living body attack detection model obtained by training by adopting the technical scheme provided by the specification, classifies the target face image through the living body attack detection model, and determines whether the target face image is a living body attack image, so that the detection of the living body attack is realized.

In the above description, the technical solution provided in the embodiment of the present disclosure is applied to various payment applications providing a face-brushing payment function, and in the above other application scenarios, the living attack can be detected in the above manner, and the specific process is not described herein again.

After the implementation environment and the application scenario of the embodiment of the present specification are introduced, the technical solutions provided by the embodiments of the present specification are introduced below, with reference to fig. 2, where an execution subject is a server, and the method includes the following steps.

202. And the server inputs the sample face image into a living body attack detection model, and the living body attack detection model is used for determining whether the input face image is a living body attack image.

The sample face image is used for model training, and the collection and the use of the sample face image are fully authorized by a corresponding object of the sample face image. The live attack detection model is used for carrying out live attack detection based on an input image, namely the live attack detection model can identify whether the input image is a live attack image or not.

204. And the server performs feature extraction and feature splitting on the sample face image through the living body attack detection model to obtain the content features of the sample face image, wherein the content features of the sample face image are used for expressing the image content of the sample face image.

The process of extracting the characteristics of the sample face image is also the process of abstractly expressing the sample face image, and the extraction of the characteristics of the sample face image is beneficial to the processing of the sample face image by the living attack detection model. The feature splitting is used for splitting the image features obtained by feature extraction so as to utilize the image features with finer granularity. The content features of the sample face image are used to represent the image content of the sample face image, for example, the content features are semantic features or face attribute features.

206. And the server combines the content characteristics of the sample face image with the style characteristics of the reference face image to obtain sample combination characteristics, wherein the style characteristics are used for expressing the image style of the reference face image.

The reference face image is a face image different from the sample face image, and the image style of the reference face image may be the same as or different from that of the sample face image, which is not limited in this specification. In the embodiments of the present specification, the image style of the reference face image is represented by the style feature. In some embodiments, the image style is also referred to as an image domain, and accordingly, two images having the same image style may also be referred to as two images in the same image domain.

208. The server trains the living attack detection model based on first difference information between a prediction label and an annotation label of the reference face image, wherein the prediction label is determined by the living attack detection model based on the sample combination characteristics, and the annotation label is used for indicating whether the reference face image is a living attack image or not.

The annotation label is configured by technical personnel according to the actual type of the reference face image, and the type of the reference face image comprises a living attack image and a non-living attack image. The prediction label is determined by the living body attack detection model based on the sample combination characteristic, and because the relevance between the style characteristic of the image and the living body attack detection result is strong in the living body attack detection task, and the style characteristic of the reference face image is carried by the sample combination characteristic, the training of the living body attack detection model can be realized based on the prediction label and the annotation label of the reference face image. The purpose of training the in-vivo attack detection model is to reduce the first difference information as much as possible, that is, the prediction label and the labeling label are as close as possible, so that the in-vivo attack detection model has the capability of detecting in-vivo attack.

It should be noted that, the above steps are described by taking a round of iterative training of the living attack detection model as an example, multiple rounds of iterations are required to train the living attack detection model, the process of each iteration and the above description belong to the same inventive concept, and different sample face images and reference face images can be replaced, which is not described herein again.

The above steps 202 to 208 are brief descriptions of the technical solutions provided by the embodiments of the present disclosure, and in order to more clearly describe the technical solutions provided by the embodiments of the present disclosure, the technical solutions provided by the embodiments of the present disclosure will be described below with reference to some examples, training of the model includes a plurality of iterative processes, which will be described below by taking one iterative process as an example, and other iterative processes belong to the same inventive concept, and referring to fig. 3, the method includes the following steps.

302. The server obtains a sample face image.

The sample face image is used for model training, and the collection and the use of the sample face image are fully authorized by the corresponding object of the sample face image. In some embodiments, the sample face image is a face image subjected to anonymization, and the sample face image cannot be associated with an object corresponding to the sample face image.

It should be noted that training a living attack detection model requires a plurality of iterative training processes, and the sample face image is a training sample used in one iterative training process.

In one possible implementation, the server obtains a sample face image from the terminal.

The terminal is used by a technician, and the technician can upload the sample face image to the server through the terminal.

In the embodiment, the server can acquire the sample face image from the terminal, and provides more abundant choices for technicians when training the living body attack detection model.

For example, the terminal displays a sample selection interface for selecting a sample face image. Responding to the operation on the sample selection interface, the terminal uploads the sample face image selected on the sample selection interface to the server, and the server acquires the sample face image.

In one possible implementation, the server obtains the sample face image from a face image database, where a plurality of face images are stored. It should be noted that the face images stored in the electrocardiographic database are all anonymized face images.

In this embodiment, the server can acquire the sample face image from the face image database, and the acquisition efficiency of the sample face image is high.

It should be noted that the server can acquire the sample face image by using any of the above manners, which is not limited in the embodiment of the present specification.

In some embodiments, after the server acquires the sample face image, the server can further perform quality scoring on the sample face image to obtain an image quality score of the sample face image. And in the case that the image quality score of the sample face image is greater than or equal to the image quality score threshold value, the server performs subsequent model training based on the sample face image. And under the condition that the image quality score of the sample face image is smaller than the image quality score threshold value, the server acquires the sample face image again. The image quality score is used for representing the quality of the sample face image, the higher the image quality score is, the better the quality of the sample face image is, and the better the model training effect based on the sample face image is. The lower the image quality score, the poorer the quality of the sample face image, and the poorer the effect of model training based on the sample face image. The image quality score threshold is set by a technician according to actual conditions, and is not limited by the embodiment of the present specification.

Through the implementation mode, the server can perform quality grading on the sample face image before training the living body attack detection model based on the sample face image, and filter the sample face image based on the image quality grading of the sample face image, so that the living body attack detection model is trained by the sample face image with higher quality, and the training effect of the living body attack detection model is ensured.

For example, the server determines the Image Quality score of the sample face Image by Subjective Image Quality Assessment (S-IQA) or Objective Image Quality Assessment (O-IQA). And under the condition that the image quality score of the sample face image is greater than or equal to the image quality score threshold value, the server executes a step of subsequently training a living body attack detection model based on the sample face image. And under the condition that the image quality score of the sample face image is smaller than the image quality score threshold value, the server acquires the sample face image again.

For example, the server determines the Image Quality Score of the sample face Image by means of Mean Opinion Score (MOS), full Reference (FR-IQA), half Reference (RR-IQA), or No Reference (NR-IQA), which is not limited in the embodiments of the present specification.

304. And the server inputs the sample face image into a living body attack detection model, and the living body attack detection model is used for determining whether the input face image is a living body attack image.

The live attack detection model is used for carrying out live attack detection based on the input face image, namely the live attack detection model can identify whether the input face image is a live attack face image or not. The face image is a live attack face image, which means that the face image is a face image for live attack and needs to be discovered and processed in time; the fact that the face image is not a live attack face image means that the face image is not a face image for live attack, and normal face recognition can be performed based on the face image.

In some embodiments, the living attack detection model includes a feature extraction unit and a classification unit, and the feature extraction unit is configured to perform feature extraction on an input face image to obtain a face image feature. The classification unit is used for classifying the face image based on the face image characteristics, namely determining whether the face image is a living attack face image. In the embodiment of the present specification, the training of the living attack detection model is to train the feature extraction unit and the classification unit.

306. And the server performs feature extraction and feature splitting on the sample face image through the living body attack detection model to obtain the content features of the sample face image, wherein the content features of the sample face image are used for expressing the image content of the sample face image.

The process of extracting the characteristics of the sample face image is also the process of performing abstract expression on the sample face image, and the process of extracting the characteristics of the sample face image is beneficial to the living body attack detection model to process the sample face image. The feature splitting is used for splitting the image features obtained by feature extraction so as to utilize the image features with finer granularity. The content feature of the sample face image is used to represent the image content of the sample face image, for example, the content feature is a semantic feature or a face attribute feature.

In some embodiments, feature splitting is to split an image feature into a content feature and a style feature, the content feature being used to represent the image content of the image, the content feature also being referred to as a generic feature. The style feature is used to represent a style of the image, which in some embodiments is also referred to as a domain. For example, for two images with the same content but different styles, the content features of the two images are the same, and the style features are different; for two images with different contents but the same style, the style characteristics of the two images are the same, and the content characteristics are different. By carrying out feature splitting on the image features, the image features can be divided into content features and style features with finer granularity, the image features can be used under the finer granularity based on the content features and the style features, and the generalization capability of the living body attack detection model is improved.

In a possible implementation manner, the server performs feature extraction on the sample face image through the living body attack detection model to obtain image features of the sample face image, where the image features include content features and style features. And the server performs characteristic splitting on the image characteristics of the sample face image to obtain the content characteristics of the sample face image.

The feature extraction of the sample face image is performed by the server through the living body attack detection model, and the feature splitting of the obtained image feature may be performed by the server through the living body attack detection model or may be performed directly by the server, which is not limited in the embodiment of the present specification. The image features comprise content features and style features, which means that the image features carry the content of the sample face image and the style of the sample face image.

In this embodiment, the server can perform feature extraction on the sample face image through the living body attack detection model, perform feature splitting on the obtained image features, and obtain the content features of the sample face image, so that the living body attack detection model can be trained subsequently based on the content features of the sample face image.

In order to more clearly explain the above embodiment, the following description will be divided into two parts.

The first part is that the server performs feature extraction on the sample face image through the living body attack detection model to obtain the image features of the sample face image, wherein the image features comprise content features and style features.

In a possible implementation manner, the server performs at least one convolution on the sample face image through the living body attack detection model to obtain the image characteristics of the sample face image.

In the above embodiment, the server can extract the image features of the sample face image by convolution operation, and since the convolution operation is fast, the server can also quickly complete feature extraction. Meanwhile, the convolution operation can carry out deep feature extraction on the sample face image, and the obtained image features have strong expression capability.

For example, the server inputs the sample face image into the feature extraction unit of the living body attack detection model, and performs convolution operation on the sample face image for a plurality of times through a plurality of convolution layers of the feature extraction unit to obtain the image feature of the sample face image. In some embodiments, the feature extraction unit is the base network of the ResNet 18.

For example, for a first convolution layer of the plurality of convolution layers, the server performs a sliding operation on the sample face image by using a plurality of convolution kernels through the first convolution layer, and performs a convolution operation with the covered portion in the sliding process to obtain a plurality of convolution features corresponding to the plurality of convolution kernels respectively, wherein the convolution features correspond to the convolution kernels one to one. And the server fuses the plurality of convolution characteristics through the first convolution layer to obtain a first characteristic diagram of the sample face image, wherein the first characteristic diagram is the image characteristic extracted by the first convolution layer. For a second convolutional layer in the plurality of convolutional layers, the server inputs the first feature map into the second convolutional layer, slides on the first feature map by adopting a plurality of convolutional kernels through the second convolutional layer, and performs convolutional operation with the covered part in the sliding process to obtain a plurality of convolutional features respectively corresponding to the plurality of convolutional kernels, wherein the convolutional features correspond to the convolutional kernels one to one. And the server fuses the plurality of convolution characteristics through the second convolution layer to obtain a second characteristic diagram of the sample face image, wherein the second characteristic diagram is the image characteristic extracted by the second convolution layer. And in the same way, the feature map output by the last convolutional layer in the plurality of convolutional layers is the image feature of the sample face image. In some embodiments, the number of the plurality of convolution kernels is an integer multiple of the number of color channels of the sample face image.

In a possible implementation manner, the server performs at least one full connection on the sample face image through the living body attack detection model to obtain the image features of the sample face image.

In the above embodiment, the server can extract the image features of the sample face image in a full-connection manner, and since the full-connection speed is high, the server can also quickly complete feature extraction.

For example, the server inputs the sample face image into the feature extraction unit of the living body attack detection model, and performs full connection on the sample face image for multiple times through multiple full connection layers of the feature extraction unit to obtain the image features of the sample face image.

For example, for a first full-connected layer of the multiple full-connected layers, the server multiplies the sample face image by the full-connected matrix of the first full-connected layer to obtain a first feature map of the sample face image, where the first feature map is an image feature extracted by the first full-connected layer. For a second full-connection layer in the multiple full-connection layers, the server multiplies the first feature map by a full-connection matrix of the second full-connection layer to obtain a second feature map of the sample face image, wherein the second feature map is the image feature extracted by the second full-connection layer. And by analogy, the feature map output by the last full connection layer in the multiple full connection layers is the image feature of the sample face image.

In a possible implementation manner, the server encodes the sample face image based on an attention mechanism through the living body attack detection model to obtain the image features of the sample face image.

In this embodiment, the sample face image can be encoded by using an attention mechanism, so that the obtained image features can reflect the sample face image more accurately by using the association between different parts of the sample face image.

For example, the server divides the sample face image into a plurality of face image blocks. The server carries out embedded coding on the plurality of face image blocks to obtain a plurality of embedded features of the plurality of face image blocks, and one face image block corresponds to one embedded feature. And the server determines the attention weight between every two face image blocks based on a plurality of embedded features of the face image blocks through the attention coding layer of the living body attack detection model. The server determines a plurality of attention features of the plurality of face image blocks through an attention coding layer of the living body attack detection model based on attention weights between every two face image blocks and a plurality of embedded features of the plurality of face image blocks, wherein one attention feature corresponds to one face image block. And the server fuses the attention features of the human face image blocks through the attention coding layer of the living body attack detection model to obtain the image features of the sample human face image.

The server may extract the image features of the sample face image by any of the above methods, which is not limited in the embodiment of the present specification.

And a second part, performing feature splitting on the image features of the sample face image by the server to obtain the content features of the sample face image.

In a possible implementation manner, the server performs at least one convolution on the image features of the sample face image to obtain the content features of the sample face image.

In this embodiment, the server can perform feature splitting on the image features in a convolution manner to obtain the content features of the sample face image, and the convolution can perform deeper processing on the image features to realize accurate splitting of the image features.

For example, the server inputs the image features of the sample face image into a content feature extraction unit, and performs at least one convolution on the image features through a convolution layer of the content feature extraction unit to obtain the content features of the sample face image, where the content feature extraction unit is a trained content feature extraction unit and is capable of further extracting the content features from the image features, and the embodiment of the present specification does not limit the type and structure of the content feature extractor. In some embodiments, the content feature extraction unit belongs to the live attack detection model, and the content feature extraction unit is used only when the live attack detection model is trained, and is not used when the live attack detection model is used.

Optionally, in the feature splitting process, the server can also extract style features of the sample face image.

In a possible implementation manner, the server performs at least one convolution on the image features of the sample face image to obtain style features of the sample face image, and the at least one convolution uses parameters different from the extracted content features.

In this embodiment, the server can perform feature splitting on the image features in a convolution mode to obtain style features of the sample face image, and the convolution can perform deeper processing on the image features to realize accurate splitting of the image features.

For example, the server inputs the image features of the sample face image into a style feature extraction unit, and performs at least one convolution on the image features through a convolution layer of the style feature extraction unit to obtain the initial style features of the sample face image. And the server encodes the initial style characteristics of the sample face image based on an attention mechanism to obtain the image characteristics of the sample face image. The style feature extraction unit is a trained style feature extraction unit, and can further extract style features from the image features, and the type and structure of the style feature extractor are not limited in the embodiments of the present specification. In some embodiments, the style feature extraction unit belongs to the living attack detection model, and the style feature extraction unit is only used when the living attack detection model is trained, and is not used when the living attack detection model is used.

It should be noted that, after the content features of the sample face image are extracted, the server can train the living attack detection model based on the content features through subsequent steps; after extracting the style features of the sample face image, the server can also train the living body attack detection model by using the style features, for example, the server combines the style features with the content features of other sample face images to obtain new sample combination features to train the living body attack detection model.

In some embodiments, after the server extracts the content features and the style features of any sample face image, the content features and the style features can be stored in a feature pool, so that the content features and the style features can be randomly acquired from the feature pool when the living body attack detection model is trained subsequently, and the diversity of the features is improved.

308. And the server combines the content characteristics of the sample face image and the style characteristics of the reference face image to obtain sample combination characteristics, wherein the style characteristics are used for expressing the image style of the reference face image.

The reference face image is a face image different from the sample face image, and the image style of the reference face image may be the same as or different from that of the sample face image, which is not limited in this specification. The style characteristics of the reference face image are obtained by the server based on the living body attack detection model. In the embodiment of the present specification, the image style of the reference face image is represented by the style feature, and the style feature includes a domain-related distinctive feature and a living body-related style feature. In some embodiments, the image style is also referred to as image domain, and accordingly, two images having the same image style may also be referred to as the two images being in the same image domain. In some embodiments, the reference face image is a randomly determined sample face image.

In a possible implementation manner, the server adds the content features of the sample face image and the style features of the reference face image to obtain the sample combination features.

Wherein, the image represented by the sample combination feature has the content of the sample face image and the style of the reference face image simultaneously. By generating the sample combination characteristics, the combination mode between the content characteristics and the style characteristics can be enriched, so that the adaptability of the in-vivo attack detection model to images of different styles is improved, namely the generalization capability of the in-vivo attack detection model is improved.

Optionally, the method for acquiring the style characteristics of the reference face image includes the following steps.

In one possible embodiment, the server inputs the reference face image into the live attack detection model. And performing feature extraction and feature splitting on the reference face image through the living body attack detection model to obtain the style feature of the reference face image.

For example, the server inputs the reference face image into the living attack detection model. And the server performs feature extraction on the reference face image through the living body attack detection model to obtain the image features of the reference face image. And the server performs feature splitting on the image features of the reference face image to obtain the initial style features of the reference face image. And the server encodes the initial style characteristics of the reference face image based on an attention mechanism to obtain the image characteristics of the reference face image. The initial style features of the reference face image are coded based on an attention mechanism, so that the separability of the initial style features can be further enhanced.

The method for extracting the features of the reference face image by the server through the living attack detection model and the method for extracting the features of the sample face image in the step 306 belong to the same inventive concept, and the implementation process refers to the related description of the step 306, which is not repeated herein. A method for acquiring the style characteristics of the reference face image based on the image characteristics of the reference face image by the server is described below.

For example, the server inputs the image features of the reference face image into a style feature extraction unit, and performs convolution on the image features at least once through a convolution layer of the style feature extraction unit to obtain the initial style features of the reference face image. And the server encodes the initial style characteristics of the reference face image based on an attention mechanism to obtain the image characteristics of the reference face image. The style feature extraction unit is a trained style feature extraction unit, and can further extract style features from the image features, and the type and structure of the style feature extractor are not limited in the embodiments of the present specification.

It should be noted that, the server may perform feature extraction and feature splitting on the reference face image after step 306 or before step 302, and the embodiment of this specification is not limited thereto. In the case that the feature pool exists, the server can also directly acquire the style feature of the reference face image from the feature pool, which is not limited in the embodiment of the present specification.

310. And the server predicts based on the sample combination characteristics through the living body attack detection model to obtain a prediction label corresponding to the sample combination characteristics.

In a possible implementation mode, the server performs full connection and normalization on the sample combination features through the living body attack detection model, and outputs classification values of the sample combination features. And the server determines a prediction label corresponding to the sample combination characteristic based on the classification value and the classification value threshold.

The prediction label comprises two types of live attack images and non-live attack images, and the live attack detection model is also a binary classification model.

In this embodiment, the server can classify the sample combination features through the living attack detection model to obtain a final prediction label of the sample combination features, and then can train the living attack detection model based on the prediction label.

For example, the server performs full connection and normalization on the sample combination feature through the living attack detection model, and outputs a classification value of the sample combination feature. Determining the prediction label corresponding to the sample combination feature as a living attack image under the condition that the classification value is greater than or equal to the classification threshold value; and if the classification value is smaller than the classification threshold value, determining that the prediction label corresponding to the sample combination feature is not a living attack image.

In the case of performing the live attack detection using the trained live attack detection model, that is, the ability to classify images using the live attack detection model is described.

312. The server trains the living attack detection model based on first difference information between a prediction label and an annotation label of the reference face image, wherein the prediction label is determined by the living attack detection model based on the sample combination characteristics, and the annotation label is used for indicating whether the reference face image is a living attack image or not.

The annotation label is configured by technical personnel according to the actual type of the reference face image, and the type of the reference face image comprises a living attack image and a non-living attack image. The prediction label is determined by the living body attack detection model based on the sample combination characteristic, and because the relevance between the style characteristic of the image and the living body attack detection result is strong in the living body attack detection task, and the style characteristic of the reference face image is carried by the sample combination characteristic, the training of the living body attack detection model can be realized based on the prediction label and the annotation label of the reference face image. The purpose of training the in-vivo attack detection model is to reduce the first difference information as much as possible, that is, the prediction tag and the labeling tag are as close as possible, so that the in-vivo attack detection model has the capability of detecting in-vivo attacks.

In one possible embodiment, the server substitutes the prediction label and the annotation label into a first loss function, which is a cross-entropy loss function. And the server determines a first gradient of the model training based on the first difference information through the first loss function. And the server trains the living attack detection model based on the first gradient, namely, the model parameters of the living attack detection model are adjusted.

For example, the server substitutes the prediction label and the annotation label into a cross entropy loss function. And the server determines a first gradient of the model training based on the first difference information through the cross entropy loss function. The server performs back propagation in the in-vivo attack detection model based on a gradient descent mode, and trains a feature extraction unit and a classification unit of the in-vivo attack detection model.

Alternatively, the server may train the attack detection model in at least one of the following ways in addition to the attack detection model using the first difference information.

In one possible implementation, the server trains the attack detection model based on second difference information between the sample combination feature and a positive sample combination feature, and third difference information between the sample combination feature and a negative sample combination feature, wherein the positive sample combination feature and the sample combination feature have the same style feature, and the negative sample combination feature and the sample combination feature have different style features.

The purpose of training the living attack detection model based on the second difference information and the third difference information is to reduce the second difference information as much as possible, and increase the third difference information as much as possible, that is, to shorten the distance between the sample combination feature and the positive sample combination feature and to lengthen the distance between the sample combination feature and the negative sample combination feature.

In some embodiments, the positive sample combination feature and the sample combination feature have different content features, and the negative sample combination feature and the sample combination feature have the same content features, and by configuring the positive sample combination feature and the negative sample combination feature in this way, the adaptability of the in-vivo attack detection model to the style can be further improved.

In this case, the purpose of training the live attack detection model based on the second difference information and the third difference information is to shorten the distance between the sample combination feature and a positive sample combination feature of a different content feature but a same style feature, and to lengthen the distance between the sample combination feature and a negative sample combination feature of the same content feature but a different style feature.

For example, the server substitutes the prediction tag and the annotation tag into a second loss function, which is a contrast loss function. And the server determines a second gradient of the model training based on the second difference information and the third difference information through the second loss function. And the server trains the living body attack detection model based on the second gradient, namely, the model parameters of the living body attack detection model are adjusted.

For example, the server substitutes the predicted label and the label into a contrast loss function. And the server determines a second gradient of the model training based on the second difference information and the third difference information through the comparison loss function. The server performs back propagation in the in-vivo attack detection model based on a gradient descent mode, and trains a feature extraction unit and a classification unit of the in-vivo attack detection model.

In one possible embodiment, the server inputs the content features of the sample face image into a style determination unit, performs prediction based on the content features of the sample face image through the style determination unit, and outputs the prediction style of the sample face image. And the server trains the living attack detection model based on fourth difference information between the prediction style and the annotation style of the sample face image.

The style judging unit is used for predicting the style based on the content characteristics of the image, the gradient between the style judging unit and the content characteristic extracting unit is opposite, so that the style judging unit and the content characteristic extracting unit form a countermeasure, the content characteristic extracting unit is used for extracting the content characteristics of which the style judging unit cannot judge the style, and the style judging unit is used for judging the style of the content characteristics as much as possible. The purpose of the server training the live attack detection model based on the fourth difference information is to make the fourth difference information as large as possible, that is, the style corresponding to the content feature is more and more difficult to be judged by the style judgment unit.

For example, the server substitutes the prediction style and the annotation style into a third loss function, which is a countering loss function. And the server determines a third gradient of the model training based on the fourth difference information through the third loss function. And the server trains the living attack detection model based on the third gradient, namely, the model parameters of the living attack detection model are adjusted.

For example, the server substitutes the prediction style and the annotation style into the penalty function. And the server determines a third gradient of the model training based on the fourth difference information through the confrontation loss function. The server performs back propagation in the living body attack detection model based on a gradient descent mode, and trains a content feature extraction unit of the living body attack detection model.

In addition, the server may train the living attack detection model based on the first difference information, the second difference information, and the third difference information alone, and may train the living attack detection model based on the first difference information, the second difference information, and the third difference information at the same time, as follows.

In a possible implementation manner, the server determines gradient values corresponding to the current iteration based on the first difference information, the second difference information, the third difference information, and the fourth difference information. And the server trains the living attack detection model based on the gradient value.

In this embodiment, the server can determine the gradient value by using multiple kinds of difference information at the same time, and the training speed of the model is increased.

For example, the server substitutes the first difference information, the second difference information, the third difference information, and the fourth difference information into a joint loss function, and determines a gradient value corresponding to the current iteration training through the joint loss function. And the server adjusts the model parameters of the living attack detection model in a back propagation mode based on the gradient value. In some embodiments, the live attack detection model is trained in an end-to-end manner.

All the above optional technical solutions may be combined arbitrarily to form an optional embodiment of the present specification, and are not described herein again.

Referring to fig. 4, the server obtains a sample face image, where the sample face image includes two parts, namely content and style, and a solid graph in the sample face image represents the content and a dotted box represents the style. And the server inputs the sample face image into the living body attack detection model, and the characteristic extraction unit of the living body attack detection model is used for extracting the characteristic of the sample face image to obtain the image characteristic of the sample face image. The server inputs the image features into the content feature extraction unit and the style feature extraction unit respectively. The server performs feature splitting on the image features through the content feature extraction unit to obtain the content features of the sample face image. The server performs feature splitting on the image features through the content feature extraction unit to obtain the initial style features of the sample face image. And the server encodes the initial style features based on an attention mechanism to obtain the style features of the sample face image. And the server sends the content characteristics and the style characteristics of the sample face image into a characteristic pool, acquires the style characteristics of a reference face image from the characteristic pool, and combines the content characteristics of the sample face image and the style characteristics of the reference face image to obtain sample combination characteristics. And the server classifies the living body attack detection model based on the sample combination characteristics through the classification unit of the living body attack detection model to obtain the prediction label corresponding to the sample combination characteristics. The server trains the living attack detection model based on first difference information between the prediction label and an annotation label of the reference face image, second difference information between the sample combination feature and a positive sample combination feature, third difference information between the sample combination feature and a negative sample combination feature, and fourth difference information between a prediction style and an annotation style of the sample face image, wherein the prediction style is determined based on content features of the sample face image. Wherein, training is performed based on the first difference information, that is, training is performed based on a cross entropy loss function, which is also called a classification loss function; training based on the second difference information and the third difference information, namely training based on a contrast loss function; training is performed based on the fourth difference information, that is, based on a countermeasure loss function, and the countermeasure in the embodiment of the present specification is implemented by a GRL Layer (Gradient reverse Layer).

According to the technical scheme provided by the embodiment of the specification, the sample face image is input into the living body attack detection model, and the living body attack detection model is used for carrying out feature extraction and feature splitting on the sample face image to obtain the content features of the sample face image. And combining the content characteristics of the sample face image and the style characteristics of the reference face image to obtain sample combination characteristics. Training a living attack detection model based on first difference information between a prediction label corresponding to the sample combination characteristics and an annotation label of the reference face image, so that the living attack detection model has the capability of detecting living attacks. The living attack detection model is used before face recognition, so that living attacks can be recognized in time, and the safety of face recognition is improved.

In other words, the technical solution provided in the embodiments of the present specification separates general content features and style features from basic image features, where the content features are characterized as semantic features and human body attribute features, and the style features are characterized as differentiable features that assist in live attack detection classification, such as domain dissimilarity features and live body-related style features. In addition, the two separated features are randomly combined in a hybrid integration mode, so that the feature diversity between the two features is enriched. And finally, the model can have better adaptability aiming at different styles through the joint optimization of the counterstudy and the comparative study so as to achieve the purpose of cross-domain generalization in the final actual deployment.

The embodiment of the present specification further provides a method for detecting a living body attack, and referring to fig. 5, taking an execution subject as a server as an example, the method includes the following steps.

502. And the server inputs the target face image into the living body attack detection model.

The living attack detection model is obtained by training based on first difference information between a prediction label and an annotation label of a reference face image, the prediction label is determined based on sample combination characteristics, and the sample combination characteristics are obtained by combining content characteristics of the sample face image and style characteristics of the reference face image. The training method of the living attack detection model is described in the relevant description of the above steps 302-312. The target face image is a face image of a target object, the target face image is used for using face recognition service, the face recognition service is used for identity verification, and the target object is a user using the face recognition service.

In some embodiments, the target face image is acquired while the target object is in use with a face recognition service, the acquisition and use of the target face image being sufficiently authorized by the target object. In addition, the server does not store the target face image of the target object, and after face recognition is completed based on the target face image, the server deletes the target face image or anonymizes the target face image, so that the association between the target face image and the target object is eliminated, and the privacy of the target object is protected.

In one possible implementation mode, the server obtains a target face image of a target object through the terminal.

The terminal is a terminal running with payment applications, or a payment device providing a face-brushing payment function, or an automatic vending machine with a face-brushing payment function, or an access control device with face recognition, and the like, and the embodiment of the specification does not limit the terminal.

In this embodiment, the server can quickly acquire the target face image of the target object from the terminal, and can subsequently perform authentication on the target object based on the target face image.

For example, the server obtains a verification image uploaded by the terminal, and the verification image is used for identity verification. And carrying out face recognition on the verification image to obtain a face area in the verification image. The server cuts out the face area from the verification image to obtain the target face image. The verification image is an image acquired by a terminal when a target object uses a face recognition service, and comprises a background area and a face area, wherein the face area is more concerned than the background area in the face recognition process. Through the technical scheme provided in the example, the server can acquire the verification image from the terminal and cut the verification image, so that the target face image is obtained, and the living attack detection of a subsequent living attack detection model is facilitated.

For example, in response to a face recognition operation, the terminal displays a face recognition interface for prompting that the target object is about to start face recognition. After the target duration, the terminal acquires a verification image of the target object, wherein the target duration is set by a technician according to an actual situation, and the embodiment of the specification does not limit the verification image. And the terminal sends the shot verification image of the target object to the server, and the server acquires the verification image of the target object. The server inputs the verification image into a face region recognition model, and processes the verification image through the face region recognition model to obtain the position of the face region in the verification image, wherein the face region recognition model is used for recognizing the region where the face is located in the image, and the face region recognition model can adopt a model with any structure, which is not limited in the embodiments of the present specification. And the server cuts the verification image based on the position of the face area in the verification image to obtain the face area in the verification image, wherein the face area is a target face image of the target object.

In one possible embodiment, the server obtains a target face image of the target object from an object image database storing face images of a plurality of objects. In some embodiments, for any one of the plurality of objects, after the terminal used by the object acquires the facial image of the object, the facial image of the object is uploaded to an object image database, the server acquires the facial image of the object from the object image database, and a subsequent face recognition operation is performed based on the facial image of the object. Certainly, after the server processes the face images in the object image database, the face images in the object image database are deleted, so that the privacy of the object is protected, and the abuse of the face images is avoided.

In the embodiment, the object image database is set as the transfer of the face image, so that the server cannot process the situation of data loss caused by incapability of timely processing when a large number of concurrent face recognition and living body attack detection tasks occur, and the success rate of face recognition and living body attack detection is improved.

504. And the server performs feature extraction on the target face image through the living body attack detection model to obtain the image features of the target face image.

The process of extracting the features of the target face image through the living attack detection model and the process of extracting the features in the step 306 belong to the same inventive concept, and the implementation process refers to the related description of the step 306, which is not described herein again.

506. And the server predicts based on the image characteristics of the target face image through the living body attack detection model and outputs a label of the target face image, wherein the label is used for indicating whether the target face image is a living body attack image.

The method for predicting based on the image features of the target face image and the process for predicting based on the sample combination features in step 310 belong to the same inventive concept, and the implementation process refers to the related description of step 310, which is not described herein again.

Through the technical scheme provided by the embodiment of the specification, the living attack detection can be quickly realized through the living attack detection model, the living attack detection efficiency and accuracy are high, and the safety of face recognition is improved.

Fig. 6 is a schematic structural diagram of a training apparatus for a living body attack detection model provided in an embodiment of the present specification, and referring to fig. 6, the apparatus includes: a sample face image input module 601, a feature processing module 602, a feature combination module 603, and a training module 604.

The sample face image input module 601 is configured to input a sample face image into a living body attack detection model, where the living body attack detection model is configured to determine whether the input face image is a living body attack image.

The feature processing module 602 is configured to perform feature extraction and feature splitting on the sample face image through the living attack detection model to obtain a content feature of the sample face image, where the content feature of the sample face image is used to represent image content of the sample face image.

The feature combination module 603 is configured to combine the content features of the sample face image with the style features of the reference face image to obtain sample combination features, where the style features are used to represent an image style of the reference face image.

A training module 604, configured to train the attack living body detection model based on first difference information between a prediction tag and an annotation tag of the reference face image, where the prediction tag is determined by the attack living body detection model based on the sample combination feature, and the annotation tag is used to indicate whether the reference face image is an attack living body image.

In a possible implementation manner, the feature processing module 602 is configured to perform feature extraction on the sample face image to obtain image features of the sample face image, where the image features include content features and style features. And carrying out feature splitting on the image features of the sample face image to obtain the content features of the sample face image.

In one possible implementation, the feature processing module 602 is configured to perform any one of the following:

and performing convolution for at least once on the sample face image to obtain the image characteristics of the sample face image.

And carrying out at least one-time full connection on the sample face image to obtain the image characteristics of the sample face image.

In a possible implementation manner, the feature processing module 602 is configured to perform at least one convolution on the image features of the sample face image to obtain the content features of the sample face image.

In a possible embodiment, the training module 604 is further configured to perform at least one of the following:

training the living attack detection model based on second difference information between the sample combination characteristic and a positive sample combination characteristic and third difference information between the sample combination characteristic and a negative sample combination characteristic, wherein the positive sample combination characteristic and the sample combination characteristic have the same style characteristic, and the negative sample combination characteristic and the sample combination characteristic have different style characteristics.

Inputting the content characteristics of the sample face image into a style discrimination unit, predicting based on the content characteristics of the sample face image through the style discrimination unit, and outputting the predicted style of the sample face image. And training the living attack detection model based on fourth difference information between the prediction style and the annotation style of the sample face image.

In a possible implementation manner, the training module 604 is further configured to determine a gradient value corresponding to the current iteration based on the first difference information, the second difference information, the third difference information, and the fourth difference information. Based on the gradient value, the living body attack detection model is trained.

In a possible implementation manner, the apparatus further includes a prediction tag determination module, configured to perform full connection and normalization on the sample combination feature through the living attack detection model, and output a classification value of the sample combination feature. And determining a prediction label corresponding to the sample combination characteristic based on the classification value and the classification value threshold.

In a possible implementation, the feature processing module 602 is further configured to input the reference face image into the live attack detection model. And performing feature extraction and feature splitting on the reference face image through the living body attack detection model to obtain the style features of the reference face image.

In a possible implementation manner, the feature processing module 602 is further configured to perform feature extraction on the reference face image to obtain image features of the reference face image. And carrying out feature splitting on the image features of the reference face image to obtain the initial style features of the reference face image. And coding the initial style characteristics of the reference face image based on an attention mechanism to obtain the image characteristics of the reference face image.

It should be noted that: in the training apparatus for a living body attack detection model provided in the foregoing embodiment, when training the living body attack detection model, only the division of the functional modules is illustrated, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the training device of the living body attack detection model and the training method embodiment of the living body attack detection model provided by the above embodiments belong to the same inventive concept, and the specific implementation process is detailed in the method embodiments and will not be described herein again.

Fig. 7 is a schematic structural diagram of a living body attack detection apparatus provided in an embodiment of the present specification, and referring to fig. 7, the apparatus includes: a target face image input module 701, a feature extraction module 702 and a prediction module 703.

And a target face image input module 701, configured to input the target face image into the living body attack detection model.

A feature extraction module 702, configured to perform feature extraction on the target face image through the living attack detection model, so as to obtain an image feature of the target face image.

The predicting module 703 is configured to perform prediction based on the image characteristics of the target face image through the living body attack detection model, and output a label of the target face image, where the label is used to indicate whether the target face image is a living body attack image.

It should be noted that: in the above embodiment, when detecting a living body attack, the living body attack detection apparatus is exemplified by only the division of the functional modules, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the functions described above. In addition, the training device of the living body attack detection model and the training method embodiment of the living body attack detection model provided by the above embodiments belong to the same inventive concept, and the specific implementation process is detailed in the method embodiments and will not be described herein again.

Through the technical scheme provided by the embodiment of the specification, the living body attack detection can be quickly realized through the living body attack detection model, the living body attack detection efficiency and accuracy are higher, and the safety of face recognition is improved.

The embodiments of the present disclosure also provide a computer storage medium, where multiple program instructions may be stored in the computer storage medium, and the program instructions are suitable for being loaded by a processor and executing the scheme described in the foregoing method embodiments, and are not described herein again.

An embodiment of the present specification further provides a computer program product, where the computer program product stores at least one instruction, and the at least one instruction is loaded by the processor and executes the scheme described in the foregoing method embodiment, which is not described herein again.

Referring to fig. 8, a schematic structural diagram of an electronic device provided in an exemplary embodiment of the present disclosure is shown, where the electronic device may be provided as a server or a terminal. The electronic device in this specification may include one or more of the following components: a processor 810, a memory 820, an input device 830, an output device 840, and a bus 860. The processor 810, memory 820, input device 830, and output device 840 may be coupled by a bus 860.

Processor 810 may include one or more processing cores. The processor 810 interfaces with various interfaces and circuitry throughout the electronic device to perform various functions and process data for the electronic device 800 by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 820 and invoking data stored in the memory 820. Alternatively, the processor 810 may be implemented in hardware using at least one of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). Processor 810 may integrate one or a combination of Central Processing Units (CPUs), liveattack detectors (GPUs), modems, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 810, but may be implemented by a communication chip.

The Memory 820 may include a Random Access Memory (RAM) or a Read-only Memory (ROM). Optionally, the memory 820 includes a Non-transitory Computer-readable Medium (Non-transitory Computer-readable Storage Medium). The memory 820 may be used to store instructions, programs, code sets, or instruction sets. The memory 820 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like, and the operating system may be an Android (Android) system, including a system based on Android system depth development, an IOS system developed by apple, including a system based on IOS system depth development, or other systems.

In order to enable the operating system to distinguish a specific application scenario of the third-party application program, data communication between the third-party application program and the operating system needs to be opened, so that the operating system can acquire current scenario information of the third-party application program at any time, and further perform targeted system resource adaptation based on the current scenario.

The input device 830 is used for receiving input commands or data, and the input device 830 includes but is not limited to a keyboard, a mouse, a camera, a microphone, or a touch device. Output device 840 is used to output instructions or data, and output device 840 includes, but is not limited to, a display device, a speaker, and the like. In one example, the input device 830 and the output device 840 may be combined, the input device 830 and the output device 840 being a touch display screen.

In addition, those skilled in the art will appreciate that the configurations of the electronic devices illustrated in the above-described figures do not constitute limitations on the electronic devices, which may include more or fewer components than illustrated, or some components may be combined, or a different arrangement of components. For example, the electronic device further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a Wireless Fidelity (WiFi) module, a power supply, a bluetooth module, and other components, which are not described herein again.

In the electronic device shown in fig. 8, the processor 810 may be configured to invoke an application program stored in the memory 820 for live attack detection or training of a live attack detection model, for executing the method described in the above method embodiment.

The above is a schematic scheme of an electronic device according to an embodiment of the present specification. It should be noted that the technical solution of the electronic device is the same as the technical solution of the above-mentioned training method of the living body attack detection model and the technical solution of the living body attack detection method, and details that are not described in detail in the technical solution of the electronic device can be referred to the description of the technical solution of the above-mentioned training method of the living body attack detection model.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium of the computer program may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above description is only an example of the alternative embodiments of the present disclosure, and not intended to limit the present disclosure, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Claims

1. A method of training a live attack detection model, comprising:

2. The method according to claim 1, wherein the performing feature extraction and feature splitting on the sample face image to obtain the content features of the sample face image comprises:

extracting the characteristics of the sample face image to obtain the image characteristics of the sample face image, wherein the image characteristics comprise content characteristics and style characteristics;

and carrying out feature splitting on the image features of the sample face image to obtain the content features of the sample face image.

3. The method according to claim 2, wherein the extracting the features of the sample face image to obtain the image features of the sample face image comprises any one of the following steps:

performing at least one convolution on the sample face image to obtain the image characteristics of the sample face image;

4. The method according to claim 2, wherein the performing feature splitting on the image features of the sample face image to obtain the content features of the sample face image includes:

and performing convolution for at least once on the image characteristics of the sample face image to obtain the content characteristics of the sample face image.

5. The method according to claim 1, after combining the content features of the sample face image with the style features of the reference face image to obtain sample combined features, the method further comprises at least one of:

6. The method of claim 5, after combining the content features of the sample facial image with the style features of the reference facial image to obtain sample combined features, the method further comprises:

determining a gradient value corresponding to the current iteration training based on the first difference information, the second difference information, the third difference information and the fourth difference information;

and training the living body attack detection model based on the gradient value.

7. The method of claim 1, before training the live-attack detection model based on the first difference information between the predicted label and the annotation label of the reference face image, the method further comprising:

fully connecting and normalizing the sample combination characteristics through the living body attack detection model, and outputting classification values of the sample combination characteristics;

and determining a prediction label corresponding to the sample combination characteristic based on the classification value and the classification value threshold.

8. The method of claim 1, wherein before combining the content features of the sample face image with the style features of the reference face image to obtain sample combined features, the method further comprises:

inputting the reference face image into the living attack detection model;

and performing feature extraction and feature splitting on the reference face image through the living body attack detection model to obtain style features of the reference face image.

9. The method of claim 8, wherein the performing feature extraction and feature splitting on the reference face image to obtain the style features of the reference face image comprises:

extracting the features of the reference face image to obtain the image features of the reference face image;

carrying out feature splitting on the image features of the reference face image to obtain the initial style features of the reference face image;

and coding the initial style characteristics of the reference face image based on an attention mechanism to obtain the image characteristics of the reference face image.

10. A method of in vivo attack detection, comprising:

inputting a target face image into a living body attack detection model;

performing feature extraction on the target face image through the living body attack detection model to obtain the image features of the target face image;

the living body attack detection model is obtained by training based on first difference information between a prediction label and an annotation label of a reference face image, the prediction label is determined based on sample combination characteristics, and the sample combination characteristics are obtained by combining content characteristics of the sample face image and style characteristics of the reference face image.

11. A training apparatus for a live attack detection model, comprising:

12. A living body attack detection apparatus comprising:

13. A computer storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to carry out the method according to any one of claims 1 to 10.

14. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method according to any of claims 1-10.

15. A computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to carry out the method according to any one of claims 1 to 10.