CN116129534A

CN116129534A - Image living body detection method and device, storage medium and electronic equipment

Info

Publication number: CN116129534A
Application number: CN202211089114.9A
Authority: CN
Inventors: 曹佳炯; 丁菁汀
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2023-05-16

Abstract

The specification discloses an image living body detection method, an image living body detection device, a storage medium and electronic equipment, wherein the method comprises the following steps: the method comprises the steps of obtaining a target color image aiming at a target object, carrying out depth estimation processing based on a first depth model to obtain a first depth image corresponding to each frame of target color image, carrying out interframe depth fusion processing based on a second depth model to obtain a second depth image aiming at the target object, and carrying out image living body detection processing on the target object based on the second depth image and the target color image.

Description

Image living body detection method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image living body detection method, an image living body detection device, a storage medium, and an electronic device.

Background

In recent years, the biometric technology has been widely used in the production and life of people. For example, technologies such as face-brushing payment, face access control, face attendance and face-entering all need to rely on biological recognition, for example, image living body detection needs to be more and more raised in the biological recognition scenes such as face attendance, face-brushing entering and face-brushing payment, and image living body detection needs to verify whether a user is a true living body and operate, so that common attack means such as photos, face changes, masks, shielding and screen flipping can be effectively resisted, fraudulent behaviors can be conveniently screened, and user rights and interests are guaranteed.

Disclosure of Invention

The specification provides an image living body detection method, an image living body detection device, a storage medium and electronic equipment, wherein the technical scheme is as follows:

in a first aspect, the present specification provides an image live detection method, the method comprising:

acquiring at least two frames of target color images aiming at a target object;

performing depth estimation processing on each target color image based on a first depth model to obtain a first depth image corresponding to each frame of the target color image;

performing interframe depth fusion processing on each first depth image based on a second depth model to obtain a second depth image aiming at the target object;

and performing image living body detection processing on the target object based on the second depth image and the target color image.

In a second aspect, the present specification provides an image living body detection apparatus, the apparatus comprising:

the image acquisition module is used for acquiring at least two frames of target color images aiming at a target object;

the depth estimation module is used for carrying out depth estimation processing on each target color image based on a first depth model to obtain a first depth image corresponding to each frame of the target color image;

The depth fusion module is used for carrying out interframe depth fusion processing on each first depth image based on a second depth model to obtain a second depth image aiming at the target object;

and the living body detection module is used for carrying out image living body detection processing on the target object based on the second depth image and the target color image.

In a third aspect, the present description provides a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-described method steps.

In a fourth aspect, the present description provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The technical scheme provided by some embodiments of the present specification has the following beneficial effects:

in one or more embodiments of the present disclosure, an electronic device performs depth fusion on depth estimation of a plurality of target color images based on a first depth model, and performs depth fusion on inter-frame depth relations between the plurality of first depth images of the same object by mining and focusing the second depth model, so that a second depth image corresponding to the depth estimation with higher precision can be obtained, thereby reducing detection requirements on image precision and image quality when the target color images are acquired, resisting detection interference of a complex application environment, and realizing that a color two-dimensional image based on lower image precision or lower image quality can obtain a second depth image with higher precision, so that image living body detection can be performed based on the second depth image with higher precision and the target color image, thereby improving detection effects of image living body detection under a complex environment and a lower performance hardware environment, and improving robustness of living body detection.

Drawings

For a clearer description of the present description or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are only some embodiments of the present application, from which other drawings can be obtained, without inventive effort, for a person skilled in the art.

FIG. 1 is a schematic view of an image live detection system provided herein;

FIG. 2 is a schematic flow chart of an image live detection method provided in the present specification;

FIG. 3 is a flow chart of another image live detection method provided in the present specification;

FIG. 4 is a flow chart of another image live detection method provided in the present specification;

fig. 5 is a schematic structural view of an image living body detection apparatus provided in the present specification;

FIG. 6 is a schematic diagram of a depth estimation module provided herein;

FIG. 7 is a schematic structural diagram of a depth fusion module provided in the present specification;

fig. 8 is a schematic structural view of another image living body detection apparatus provided in the present specification;

Fig. 9 is a schematic structural view of an electronic device provided in the present specification;

FIG. 10 is a schematic diagram of the architecture of the operating system and user space provided herein;

FIG. 11 is an architecture diagram of the android operating system of FIG. 10;

FIG. 12 is an architecture diagram of the IOS operating system of FIG. 10.

Detailed Description

The following description of the embodiments will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In the description of the present application, it is to be understood that the terms "comprise" and "have," and any variations thereof, are intended to cover non-exclusive inclusions, unless otherwise specifically defined and defined. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art in a specific context. Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

In the related art, in the image living body detection scene such as image living body detection, interactive identification detection and the like, the accurate image living body detection is often realized by combining multi-mode image data, and the method adds more modes into a camera, such as an NIR mode and a 3D mode on the basis of an RGB mode, and even adds a thermal imaging mode. After a plurality of modes are added, the performance of the whole living body detection system can be obviously enhanced, and the prevention capability for various different types of attacks can be better. However, such methods have the disadvantage that the cost and the equipment requirement for the whole image living body detection are obviously increased, and the method cannot be applied to the scenes with low cost requirements and low equipment requirements; and, related technical means may also need to perform operations of shaking head, blinking, etc. highly matched under the prompt of the user to accurately perform the image living body detection, and often the application environment of the image living body detection is not an ideal state, etc.; based on this, the image living body detection in the related art has a large limitation;

the present application is described in detail with reference to specific examples.

Please refer to fig. 1, which is a schematic diagram of a scene of an image living body detection system provided in the present specification. As shown in fig. 1, the image biopsy system may include at least a client cluster and a service platform 100.

The client cluster may include at least one client, as shown in fig. 1, specifically including a client 1 corresponding to a user 1, a client 2 corresponding to a user 2, …, and a client n corresponding to a user n, where n is an integer greater than 0.

Each client in the client cluster may be a communication-enabled electronic device including, but not limited to: wearable devices, handheld devices, personal computers, tablet computers, vehicle-mounted devices, smart phones, computing devices, or other processing devices connected to a wireless modem, etc. Electronic devices in different networks may be called different names, for example: a user equipment, an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent or user equipment, a cellular telephone, a cordless telephone, a personal digital assistant (personal digital assistant, PDA), an electronic device in a 5G network or future evolution network, and the like.

The service platform 100 may be a separate server device, such as: rack-mounted, blade, tower-type or cabinet-type server equipment or hardware equipment with stronger computing capacity such as workstations, mainframe computers and the like is adopted; the server cluster may also be a server cluster formed by a plurality of servers, and each server in the server cluster may be formed in a symmetrical manner, wherein each server is functionally equivalent and functionally equivalent in a transaction link, and each server may independently provide services to the outside, and the independent provision of services may be understood as no assistance of another server is needed.

In one or more embodiments of the present disclosure, the service platform 100 may establish a communication connection with at least one client in the client cluster, and perform data interaction in an image living detection process based on the communication connection, such as at least two frames of target color image data interaction of an online target object, which is schematically shown, the client may collect at least two frames of target color images of the target object and send the images to the service platform 100, and the service platform 100 performs the image living detection method related to the present disclosure to obtain an image living detection result and feed the image living detection result back to the client; as another example, the service platform 100 may issue relevant depth models for image living detection, such as the first depth model, the second depth model, and the third depth model, to a plurality of clients, so as to instruct the clients to execute the image living detection method referred in the specification of the present specification to perform image living detection, so as to obtain an image living detection result; as another example, the service platform 100 may obtain training sample data, such as sample images, from a client for relevant depth model training.

It should be noted that, the service platform 100 establishes a communication connection with at least one client in the client cluster through a network for interactive communication, where the network may be a wireless network, or may be a wired network, where the wireless network includes, but is not limited to, a cellular network, a wireless local area network, an infrared network, or a bluetooth network, and the wired network includes, but is not limited to, an ethernet network, a universal serial bus (universal serial bus, USB), or a controller area network. In one or more embodiments of the specification, techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible Markup Language, XML), and the like are used to represent data exchanged over a network (e.g., target compression packages). All or some of the links may also be encrypted using conventional encryption techniques such as secure socket layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet Protocol Security, IPsec), and the like. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

The embodiments of the image recognition system provided in the present specification and the image recognition method in one or more embodiments belong to the same concept, and an execution subject corresponding to the image recognition method related to one or more embodiments in the present specification may be the service platform 100 described above; the execution subject corresponding to the image recognition method in one or more embodiments of the specification may also be a client, and specifically determined based on an actual application environment. The implementation process of the image recognition system embodiment may be described in detail in the following method embodiment, which is not described herein.

Based on the schematic view of the scenario shown in fig. 1, the image living body detection method provided in one or more embodiments of the present specification is described in detail below.

Referring to fig. 2, a flow diagram of an image biopsy method that may be implemented in a computer program and that may be executed on a von neumann system-based image biopsy device is provided for one or more embodiments of the present disclosure. The computer program may be integrated in the application or may run as a stand-alone tool class application. The image living body detection apparatus may be an electronic device.

Specifically, the image living body detection method comprises the following steps:

S102: acquiring at least two frames of target color images aiming at a target object;

while biometric identification provides convenience to people, it also presents new risk challenges. The most common means of threatening the security of biometric systems is a living attack, i.e. a technique that attempts to bypass biometric verification of images by means of a device screen, printing a photograph, etc. In order to detect living body attacks, a living body anti-attack technology is an essential link in a biological identification scene, and image living body detection in one or more embodiments of the present disclosure (also is an important link in the biological identification scene).

In the related art, image living detection is a detection method for determining the real physiological characteristics of an object in some authentication scenes, and in facial recognition applications, the image living detection needs to verify whether a target object is a real living body. The image living body detection needs to effectively resist common living body attack means such as photos, face changing, masks, shielding, screen flipping and the like, so that the user is helped to discriminate fraudulent behaviors, and the rights and interests of the user are ensured;

in one or more embodiments of the present description, the image live detection task allows for low cost implementation of image acquisition while ensuring the accuracy of image live detection. In the related art, in the image living body detection scene such as image living body detection, interactive identification detection and the like, the accurate image living body detection is often realized by combining multi-mode image data, and the method adds more modes into a camera, such as an NIR mode and a 3D mode on the basis of an RGB mode, and even adds a thermal imaging mode. After a plurality of modes are added, the performance of the whole living body detection system is obviously enhanced, and the prevention capability for various different types of attacks is better. However, such methods have the disadvantages that the cost and the equipment requirement for the living body detection of the whole image are obviously increased, and have a large limitation; by executing the image living body detection method, the first depth image corresponding to each frame of target color image can be predicted or estimated based on at least two frames of target color images, and a plurality of first depth images are combined for depth fusion to obtain a second depth image, and then image living body detection processing is carried out based on the target color images and the second depth images.

The target color image is a two-dimensional color image for a target object, such as an RGB image for the target object, acquired based on an image live detection task.

Illustratively, in an actual image application scenario, at least two frames of target color images for a target object to be currently identified or detected may be acquired, such as by an RGB camera, a monocular camera, or the like, based on a corresponding image living body detection task, and typically the target color images are two-dimensional color images.

Optionally, the at least two frame target color images may be continuous images, or at least two frame target color images that are continuously acquired with a target frame interval as a reference in a preset time (e.g. 2 s).

S104: performing depth estimation processing on each target color image based on a first depth model to obtain a first depth image corresponding to each frame of the target color image;

it can be understood that a multi-frame target color image for a target object is input to a first depth model, and the first depth model is adopted to perform depth estimation processing on the target color image, so as to obtain a first depth image corresponding to each frame of target color image.

In one or more embodiments of the present disclosure, an initial first depth model is pre-constructed and a depth estimation training is performed on the initial first depth model, after a model end training condition is satisfied, a first depth model is obtained, and the method and the device can be applied to an actual image living body detection task, and depth estimation is performed on continuous multi-frame target color images of the same target object, so as to obtain a first depth image corresponding to each target color image.

S106: performing interframe depth fusion processing on each first depth image based on a second depth model to obtain a second depth image aiming at the target object;

the second depth model is used for carrying out interframe depth fusion on first depth images corresponding to multi-frame target color images of the same target object under an actual image living detection scene so as to obtain a second depth image aiming at the target object, wherein the second depth image is a depth image after interframe depth fusion on the same target object; in short, the input of the second depth model is "each first depth image corresponding to a plurality of frame target color images of the same target object", and the output of the second depth model is "the second depth image for the target object".

In one possible implementation, an initial first depth model for Shan Zhen depth estimation and an initial second depth model for multi-frame inter-frame depth fusion may be created in advance; acquiring at least one set of image sample data comprising at least two consecutive frames of sample images for the same sample object;

the electronic equipment performs depth estimation training on the initial first depth model and interframe depth fusion training on the initial second depth model based on the image sample data, and obtains the trained first depth model and second depth model after the training conditions of the model are met.

Optionally, the image sample data may be public image data obtained from a related database, and groups a plurality of image sample data of the same sample object to form image sample data corresponding to the same sample object, and the related database may be one or more of CIFAR-10, CIFAR-100, tiny ImageNet, etc., or may be multiple groups of image sample data for different sample objects, which are user-defined and collected in a transaction scene by an actual image detection task, where a group of image sample data of the same sample object is formed by multiple pieces of image data, typically multiple continuous image sample data for the sample object, for example, image data collected from the internet is labeled with corresponding labels to make a completed image data set;

illustratively, the acquisition of multiple sets of image sample data may be: using RGB cameras to collect image sample data of a user in a face brushing stage, and collecting 1s-3s image sample data of each user, wherein each second is about 25-30 frames; the collected users can be covered and matched with various ages, sexes and the like; meanwhile, collecting image sample data of various image attack types, such as image sample data of a device screen type (a sample object is displayed on a device screen), image sample data of a print photo type (the print photo comprises the sample object), and image sample data of an object model type (the object model is a sample object such as a person, etc.), and collecting 1-3s of images at about 25-30 frames per second; when in collection, a plurality of groups of image sample data can cover sample images of various image attack types;

Illustratively, the initial first depth model and the initial second depth model may be constructed based on a machine learning model, which may include one or more of a convolutional neural network (Convolutional Neural Network, CNN) model, a deep neural network (DeepNeuralNetwork, DNN) model, a recurrent neural network (RecurrentNeuralNetworks, RNN) model, an embedded (empadd) model, a Gradient-lifting decision tree (Gradient BoostingDecisionTree, GBDT) model, a logistic regression (LogisticRegression, LR) model, and the like, and the model training process for the initial first depth model and the initial second depth model may be implemented by fitting to one or more of the machine learning models, which may be referred to as definitions in other embodiments of the present specification.

S108: and performing image living body detection processing on the target object based on the second depth image and the target color image.

It is understood that after the second depth image is obtained based on the multi-frame target color image of the target object, an image living body detection process may be performed based on the second depth image and the target color image to determine a living body detection type of the target object.

In a possible implementation manner, the electronic device performs the image live detection processing on the target object based on the second depth image and the target color image, which may be:

The electronic device inputs the second depth image and the target color image into a living body detection model, and outputs a first living body probability value corresponding to the second depth image and a second living body probability value corresponding to the target color image; a living body detection type of the target object is then determined based on the first living body probability value and the second living body probability value.

It may be appreciated that the first living body probability value corresponding to the second depth image and the second living body probability value corresponding to the target color image may be output based on the pre-trained living body detection model with the "second depth image and the target color image" as inputs of the model.

The living body probability value may be understood as a living body classification probability of the living body detection model to the corresponding image.

After obtaining "the first living probability value corresponding to the second depth image and the second living probability value for the target color image are output", the electronic device may determine the target living probability based on the first living probability value and the second living probability value.

Alternatively, the determination rule of the target living probability may be: selecting one of the first living body probability value and the second living body probability value as a target living body probability, for example, selecting the maximum probability value of the first living body probability value and the second living body probability value as the target living body probability;

Alternatively, the determination rule of the target living probability may be: and presetting a first weight factor for the second depth image and a second weight factor for the target color image, and obtaining the final target living probability by adopting a weighted fusion mode.

Illustratively, assuming that the first living probability value is P1, the second living probability value is P2, the first weight factor is preset to be a for the second depth image, and the second weight factor is preset to be b for the target color image, the target living probability P may be calculated by the following formula:

P＝p1*a+p2*b

optionally, the first weight factor and the second weight factor are 1, for example, the first weight factor is 0.5, and the second weight factor is 0.5.

Optionally, the number of the target color images is usually multiple, and one of the target color images can be taken into reference when performing image living detection processing, namely one of the target color images is selected, and image living detection processing is performed on the target object based on the second depth image and the selected target color image;

optionally, the number of the target color images is generally multiple, when the image living body detection processing is performed, the multiple target color images can be incorporated into the reference, the electronic device inputs the second depth image and all the target color images into the living body detection model, outputs the first living body probability value corresponding to the second depth image and the second living body probability value of each target color image, and then performs fitting processing on the second living body probability value of each target color image to obtain the optimal second living body probability value. The optimal second living probability value may be selected by calculating an average value, a median value, a maximum value, a minimum value, or the like of all the second living probability values.

Further, after the target living body probability is obtained, the size of the target living body probability and the size of the target threshold can be compared according to a preset target threshold so as to determine a living body detection result;

optionally, if the target living probability is greater than a target threshold, determining that the target object is a living object type;

optionally, if the target living probability is less than or equal to a target threshold, determining that the target object is an attack object type.

In one or more embodiments of the present description, the living detection model may be created based on a machine learning model, and the living detection model may be implemented based on one or more fits of a convolutional neural network (Convolutional Neural Network, CNN) model, a deep neural network (DeepNeuralNetwork, DNN) model, a recurrent neural network (RecurrentNeuralNetworks, RNN), a model, an embedding (embedding) model, a Gradient-lifting decision tree (Gradient BoostingDecisionTree, GBDT) model, a logistic regression (LogisticRegression, LR) model, and the like.

It can be appreciated that an initial living body detection model can be created in advance, a large number of sample training images are obtained as a model image training set of the living body detection model, the model image training set comprises a depth sample image and a color sample image corresponding to the same sample object, then the initial living body detection model is trained based on the model image training set until a model training end condition is met, and a trained living body detection model is obtained.

Referring to fig. 3, fig. 3 is a flow chart illustrating another embodiment of an image live detection method according to one or more embodiments of the present disclosure. Specific:

S202: creating an initial first depth model and an initial second depth model;

s204: acquiring at least one set of image sample data comprising at least two consecutive frames of sample images for the same sample object;

reference should be made specifically to the relevant definitions of other embodiments of the present disclosure, and they are not repeated here.

S206: inputting each sample image in the image sample data into an initial first depth model to output a sample depth estimation image corresponding to each sample image;

illustratively, the initial first depth model may be created based on a machine learning model, and in some embodiments, the model architecture corresponding to the initial first depth model may be a UNET model architecture, and the initial first depth model based on the UNET model architecture may include at least two parts, an Encoder and a Decoder;

further, in the model training process, the input of the first depth model is a single frame sample image, the output of the first depth model is a sample depth estimation image corresponding to the sample image, and the sample depth estimation image can be understood as a depth image containing pixel depth information.

In one possible implementation, the training process for the initial first depth model may be:

The electronic equipment pre-determines a label depth map corresponding to each sample image;

during each round of training for the initial first depth model: inputting each sample image into an initial first depth model for depth estimation processing, and outputting sample depth estimation images corresponding to each sample image respectively; such as: in a training process of a certain round, b Zhang Yangben images (sample images are usually two-dimensional images) of the same sample object A are input into an initial first depth model to perform depth estimation processing, and pixel depth characteristics of each pixel point in the sample images are respectively predicted or estimated by the initial first depth model, so that a sample depth estimation image corresponding to the sample images is generated.

Further, in training of the initial first depth model, a label depth map corresponding to each image sample data is predetermined, the label depth map is used in a back propagation training process of the initial first depth model, and model parameters of the initial first depth model are adjusted in a back propagation mode based on the label depth map and a sample depth estimation map output by the current initial first depth model in the back propagation training process.

Illustratively, the following is a definition of a label depth map corresponding to a set sample image, as follows:

In a possible implementation manner, the electronic device may set a label depth for the sample image after acquiring a plurality of sample images corresponding to the image sample data in advance, so that a label depth map corresponding to each sample image may be determined in a subsequent training process;

optionally, the label depth map may be determined for all the images of the sample objects based on the method for generating the image depth of the sample object in the related art, for example, three-dimensional depth reconstruction is performed on the sample images corresponding to all the sample objects by using three-dimensional object (such as face object) reconstruction (such as 3DMM model) and three-dimensional depth estimation technology, so that the obtained depth map is the label depth map of the sample image.

Optionally, after each sample image is collected based on the image living body detection task, the electronic device determines the label depth map corresponding to the sample image by adopting different depth map setting modes in combination with the image sample type during sample image collection.

Illustratively, in a living body detection scene corresponding to an image living body detection task, the image sample type of the sample image may include at least an attack image type and a living body image type,

the sample images of the attack image type may be sample images collected by common living attack means such as photo, face-changing, mask, occlusion, and screen-flipping, which are generally of the attack image type compared to sample images collected in an environment where a real living object is located.

The living body image type can be understood as a type corresponding to a sample image acquired under an environment where a real living body object is located.

Further, when the electronic device executes the setting of the tag depth map corresponding to the image sample data based on the image sample type, the setting may be:

the electronic device extracts from all acquired sample images: acquiring a first sample image corresponding to the attack image type, and setting a depth pixel value of a first label depth image corresponding to the first sample image as a target depth pixel value;

illustratively, for the first sample image corresponding to the attack image type, a first label depth map with depth pixel values being all target depth pixel values can be generated, for example, an image with depth pixel values being all 0 can be generated as the depth map, and the depth of the first sample image corresponding to the attack image type can be considered to be 0;

the electronic device extracts from all acquired sample images: and acquiring a second sample image corresponding to the living body image type, and calling a target image depth service to determine a second label depth map corresponding to the second sample image.

It can be understood that the label depth map corresponding to the sample image includes a first label depth map and a second label depth map;

It can be understood that, based on the method of generating the image depth of the sample object in the related art, the second label depth map is determined for all the second sample images, for example, three-dimensional object (such as facial object) reconstruction (such as 3DMM model) is adopted, and three-dimensional depth prediction technology is adopted to perform three-dimensional depth reconstruction on the sample images corresponding to all the sample objects, so that the obtained depth map is the label depth map of the sample image.

In one or more embodiments of the present disclosure, a label depth map of a sample image is used as a supervisory signal in a model training process to optimize a model training effect.

S208: performing depth estimation training on the initial first depth model based on the sample depth estimation map until the initial first depth model is trained, so as to obtain a trained first depth model;

and the sample depth estimation graph is an output result of the initial first depth model after the sample image is subjected to depth estimation processing.

In the training process of each round of initial first depth model, a label depth map corresponding to a sample image is determined, the sample image is input into the initial first depth model to carry out depth estimation processing to output a sample depth estimation map corresponding to the sample image, and model parameters of the initial first depth model are adjusted by back propagation by combining the output sample depth estimation map and the label depth map serving as a supervision signal.

Further, in the training process of each initial first depth model, the electronic device calculates pixel estimation loss focused to the pixel dimension by adopting a set loss calculation function based on the sample depth estimation graph and the label depth graph, and adjusts model parameters of the initial first depth model based on pixel estimation loss back propagation, such as performing back propagation adjustment on connection weight values and/or threshold values among neurons of each layer of the model based on pixel estimation loss.

The electronic device may set a first loss calculation formula for the initial first depth model, where the first loss calculation formula is a loss function of the initial first depth model, and the trained first depth model is obtained by inputting the sample depth estimation graph and the label depth graph into the first loss calculation formula in each round of model training, determining a pixel estimation loss, and performing model parameter adjustment on the initial first depth model based on the pixel estimation loss until the initial first depth model meets a model end training condition.

Optionally, the first loss calculation formula satisfies the following formula:

wherein the Loss is _A The loss is estimated for the pixel and,

an estimated depth value of an ith depth pixel point in the sample depth estimation graph, wherein p is as follows _i And taking the gamma as a loss self-adaptive parameter, I as an integer, and I as the total pixel number of the sample depth estimation map as a label depth value of the ith depth pixel point in the label depth map. />

Illustratively, the magnitude of the estimated depth value of the i-th depth pixel in the sample depth estimation map may represent a predicted probability value of the initial first depth model for the corresponding pixel, and in some embodiments, the predicted probability value may range between 0-1.

Illustratively, the γ is a loss adaptive parameter, which can be understood as an adjustment parameter for sample loss.

In one or more embodiments of the present disclosure, the first loss calculation formula adopts the above-mentioned form, and the pixel estimation loss calculated by the first loss calculation formula is used to calculate the pixel dimension of the depth map estimation of the loss focus, and compared with the distance reconstruction loss in the related art, the pixel estimation loss obtained by the first loss calculation formula can focus on or even sense the area difficult to reconstruct in the input sample image, and adaptively increase and adjust the weight of the area difficult to reconstruct, so that the depth estimation effect is improved in the model training process, and a better output, namely the depth estimation map, is obtained.

S210: inputting the sample depth estimation image corresponding to each sample image to an initial second depth model to output a sample depth fusion image;

The sample depth estimation map is output of an initial first depth model, when the initial second depth model is subjected to depth reconstruction, sample depth estimation maps corresponding to a plurality of frames of sample images of the same sample object are accumulated based on the initial first depth model, and then the plurality of sample depth estimation maps of the same sample object are input into the initial second depth model.

It can be understood that in the training process of each round of initial second depth model, the electronic device inputs the sample depth estimation image corresponding to each sample image of the same sample object to the initial second depth model to perform depth reconstruction processing, so as to obtain a depth image estimation feature corresponding to the sample depth estimation image and a sample depth reconstruction image corresponding to the sample depth estimation image, and in each round of training process, the sample depth reconstruction image is output, and the model parameters of the initial second depth model are adjusted by back propagation in combination with the output sample depth reconstruction image and the depth image estimation feature.

Illustratively, the depth map estimation features are obtained by extracting the depth features of each frame of sample depth estimation map, and the depth map estimation features of each frame of sample depth estimation map can be obtained by extracting the depth features of each frame of sample depth estimation map.

In one or more embodiments of the present disclosure, the initial second depth model may be regarded as an inter-frame relationship depth model or network, where the fusion of the multi-frame depth estimation maps is performed by calculating features of the multi-frame sample images corresponding to the sample depth estimation maps in the training process of the initial second depth model, and then mining depth relationships between the multi-frame depth estimation maps, and using the inter-frame relationship matrix in the inter-model depth reconstruction process.

S212: and performing interframe depth fusion training on the initial second depth model based on the sample depth fusion map until the initial second depth model is trained, so as to obtain a trained second depth model.

It can be understood that in the training process of each round of initial second depth model, interframe depth fusion is performed based on the depth map estimation features corresponding to each sample depth estimation map so as to output a sample depth fusion map, and model parameter adjustment is performed on the initial second depth model based on the sample depth reconstruction map and the sample depth estimation map.

In a possible embodiment, the initial second depth model structure composition includes at least a first depth coding network, a second depth decoding network, and a third self-attention network;

Illustratively, the first depth coding network may be used as a depth feature coding, e.g., the first depth coding network may be a ResNet18 network for extracting depth features from model inputs;

illustratively, the second depth decoding network may be used as a depth reconstruction, in some embodiments the second depth decoding network may employ a Decoder, and the output of the first depth encoding network is used as an input to the second depth decoding network to obtain a reconstructed depth map, which is used herein primarily for subsequent network parameter adjustment.

Illustratively, the third self-attention network is a network module which can be understood as a self-attention based network module, and in some embodiments, the third self-attention network can be a non-local self-attention network module, and the input of the third self-attention network is a depth map estimation feature of a multi-frame depth estimation map, and the inter-frame relation prediction is performed based on the input feature, and the output is an inter-frame relation matrix and a reference depth map.

The model training process of the initial second depth model is explained as follows:

during each round of training for the initial second depth model: respectively inputting the sample depth estimation images corresponding to the sample images of the same sample object into a first depth coding network, and extracting features from each sample depth estimation image by the first depth coding network to obtain depth image estimation features corresponding to each sample depth estimation image;

Illustratively, sample depth estimation graphs corresponding to all sample images of the same sample object are respectively input into an initial second depth model, features are extracted from each sample depth estimation graph by a first depth coding network of the initial second depth model, depth graph estimation features corresponding to all sample depth estimation graphs are sequentially obtained, and the output of the first depth coding network (such as ResNet 18) is understood to be the first depth coding network, and the input is the sample depth estimation graphs.

Further, the electronic equipment controls the initial second depth model to input depth map estimation features into a second depth decoding network to obtain a sample depth reconstruction map corresponding to the sample depth estimation map;

schematically, the electronic device controls the initial second depth model to perform depth reconstruction by a second depth decoding network (such as a Decoder) based on a plurality of depth map estimation features, so as to obtain a sample depth reconstruction map corresponding to the reconstructed sample depth estimation map, where the sample depth reconstruction map is mainly used for subsequent network parameter adjustment, and can be understood as performing subsequent network parameter adjustment based on the calculated model loss of the sample depth reconstruction map after depth reconstruction.

Further, the electronic device inputs depth map estimation features corresponding to the sample depth estimation maps into a third self-attention network to obtain a reference depth map and an inter-frame relation matrix, wherein the reference depth map is one of the sample depth estimation maps;

Illustratively, the inter-frame relationship between the multi-frame sample images is mined by controlling the initial second depth model through a third self-attention network (such as a non-local self-attention network) based on depth map estimation features corresponding to each sample depth estimation map of the same sample object, so as to obtain a reference depth map and an inter-frame relationship matrix.

Illustratively, the reference depth map is one of a plurality of input sample depth estimation maps, the selected sample depth estimation map is taken as the reference depth image, the inter-frame relationship of other sample depth estimation maps relative to the reference depth image is mined, and the inter-frame relationship matrix is represented, wherein the inter-frame relationship matrix is assumed to be represented by w _n,j The reference number n in (a) is the nth row, w, of the inter-frame relation matrix _n,j The j-th column of the inter-frame relation matrix is denoted by the reference j. In the inter-frame relation matrix w, the w _n,j And representing a relation coefficient of a j-th pixel point of a fusion depth map relative to a n-th pixel point of a selected reference depth map, wherein the fusion depth map is a depth map except the reference depth map in all sample depth estimation maps corresponding to the same sample object.

In addition, n, j can be understood as the reference numerals of the pixels in the corresponding depth map.

Further, the electronic device performs inter-frame depth fusion based on the inter-frame relation matrix and the reference depth map through the initial second depth model to output a sample depth fusion map.

Illustratively, in the training process for the initial second depth model, after determining the inter-frame relation matrix and the reference depth image through the initial second depth model, the inter-frame relation matrix feeds back the weight corresponding to the fusion depth image with the reference depth image as a benchmark, and the weighted fusion is performed by combining the parameters of the inter-frame relation matrix with the reference depth image, so as to obtain the weighted fused sample depth fusion image.

In a possible implementation manner, the performing, based on the inter-frame relation matrix and the reference depth map, inter-frame depth fusion to output a sample depth fusion map may be:

weighting fusion is carried out by adopting a second interframe fusion calculation mode based on the interframe relation matrix and the reference depth map, and a sample depth fusion map is obtained;

the second interframe fusion calculation type is used for realizing weighted fusion of depth maps corresponding to a plurality of sample images based on the interframe relation matrix and the reference depth map to obtain a sample depth fusion map

The second interframe fusion calculation formula satisfies the following formula:

Wherein the depth is _multi For the sample depth fusion map, N is the total pixel point number of the reference depth map, N is an integer, and depth is the depth _n Depth pixel value of nth pixel point of original depth map, the w is _n,j The reference number n in (a) is the nth row of the inter-frame relation matrix, and w is _n,j For reference j, the j-th column of the inter-frame relationship matrix in which w is the _n,j And representing a relation coefficient of a j-th pixel point of a fusion depth map relative to an n-th pixel point of the reference depth map, wherein the fusion depth map is a depth map except the reference depth map in all sample depth estimation maps corresponding to the same sample object.

Illustratively, the second interframe fusion calculation formula is used for combining the interframe relation matrix by taking each pixel point of the reference depth map as a reference, and the w is calculated according to the w in the interframe relation matrix _n,j The relation coefficient of the jth pixel point of the fusion depth map relative to the nth pixel point of the reference depth map can be representedAnd obtaining the weight of each pixel point, obtaining the fusion depth value of each fused depth pixel point based on the second interframe fusion calculation formula, and determining all the fusion depth values to obtain the sample depth fusion map.

It will be appreciated that in the manner described above, a sample depth image may be output during each round of model training for an initial second depth model, with model parameter adjustments being made to the initial second depth model based on the sample depth reconstruction map and the sample depth estimation map simultaneously.

In a possible implementation manner, the model parameter adjustment of the initial second depth model based on the sample depth reconstruction map and the sample depth estimation map may be:

the electronic equipment can input each sample depth estimation image and each sample depth reconstruction image aiming at the same sample object into a third loss calculation formula to determine depth reconstruction loss; and then carrying out model parameter adjustment on the initial second depth model based on the depth reconstruction loss, and adjusting model parameters of the initial second depth model through back propagation, for example, carrying out back propagation adjustment on connection weight values and/or threshold values among neurons of each layer of the model based on pixel estimation loss until the initial second depth model meets the model finishing training condition, so as to obtain the trained second depth model.

The third loss calculation satisfies the following formula:

wherein Loss B is the depth reconstruction Loss, L is an integer, L is the total number of the sample depth estimation graphs corresponding to the same sample object, and I _pred-l Reconstructing a map for the sample depth of the first page, the I _GT-l And (5) the sample depth estimation map is the first sheet.

S214: acquiring at least two frames of target color images for a target object based on an image living body detection task;

s216: performing depth estimation processing on each target color image based on a first depth model to obtain a first depth image corresponding to each frame of the target color image; performing interframe depth fusion processing on each first depth image based on a second depth model to obtain a second depth image aiming at the target object; and performing image living body detection processing on the target object based on the second depth image and the target color image.

Reference may be made specifically to the method steps of other embodiments of the present disclosure, which are not described here in detail.

In one or more embodiments of the present disclosure, an electronic device performs depth fusion on depth estimation of a plurality of target color images based on a first depth model, and performs depth fusion on inter-frame depth relations between a plurality of first depth images of the same object by mining and focusing the second depth model, so as to obtain a second depth image corresponding to the depth estimation with higher precision, thereby reducing detection requirements on image precision and image quality when the target color images are acquired, resisting detection interference of a complex application environment, and realizing that a color two-dimensional image based on lower image precision or lower image quality can obtain a second depth image with higher precision, so that image living body detection can be performed based on the second depth image with higher precision and the target color image, thereby improving detection effects of image living body detection under a complex environment and a lower performance hardware environment, and improving living body detection effects and robustness of living body detection; and introducing innovative first loss calculation type focus pixel estimation loss during single-frame depth estimation, and better focusing on the area which is difficult to fit of each image can be achieved through the first loss calculation type, so that better single-frame depth quality is achieved; and when the depth between frames is fused, indicating the initial second depth model to calculate the characteristics of the multi-frame depth estimation images and mining the relation between the multi-frame to obtain an inter-frame relation matrix, and fusing multi-frame depth maps based on the inter-frame relation matrix, wherein the depth fusion effect is good, the output depth quality of the model is improved, and the second depth map with good depth estimation effect can be obtained.

Referring to fig. 4, fig. 4 is a flow chart illustrating another embodiment of an image live detection method according to one or more embodiments of the present disclosure. Specific:

s302: acquiring at least two frames of target color images aiming at a target object;

s304: performing depth estimation processing on each target color image based on a first depth model to obtain a first depth image corresponding to each frame of the target color image; performing interframe depth fusion processing on each first depth image based on a second depth model to obtain a second depth image aiming at the target object;

s306: performing quality enhancement processing on the second depth image based on a third depth model to obtain a third depth image after the quality enhancement processing;

it can be understood that in the actual image living body detection scene, after depth estimation and interframe depth fusion are performed according to the first depth model and the second depth model, a fused second depth image is obtained, objective factors such as limitation of image quality of multi-frame color images of the same object, bottleneck of model identification processing and the like are considered, local or small-range pixel point depth value discontinuity exists in the fused second depth image with a certain probability, based on the fact, depth quality enhancement can be performed on the fused second depth image, subsequent detection interference caused by the objective factors is optimized by depth quality enhancement by adopting a third depth model, data quality is further improved on the fused depth image, and a depth estimation image with higher depth quality, namely the third depth image, can be obtained, and accuracy and accurate effect of living body detection of the subsequent image are improved.

It can be understood that the third depth model is configured to perform quality intensity on the depth image fused by the depth estimation images corresponding to the multi-frame color image, so as to resist detection interference caused by objective factors, improve quality of the fused depth image, input the third depth model is a second depth image at an output end of the second depth model, and output the third depth model is a third depth image after quality enhancement processing.

In a possible implementation manner, at least one sample depth fusion map of the second depth model corresponding to the initial second depth model is obtained by creating an initial third depth model in advance, the sample depth fusion map output by all or part of each round of training process of the initial second depth model is used as a model training sample, and the model training sample is used for carrying out quality enhancement training on the initial third depth model until the initial third depth model finishes model training, so that a trained third depth model is obtained. And performing quality enhancement training on the initial third depth model based on the depth fusion map of each sample to obtain a trained third depth model.

In one or more embodiments of the present disclosure, the initial third depth model may be constructed based on a machine learning model, where the machine learning model may include one or more of a convolutional neural network (Convolutional Neural Network, CNN) model, a deep neural network (DeepNeuralNetwork, DNN) model, a recurrent neural network (RecurrentNeuralNetworks, RNN) model, an embedded (ebedding) model, a Gradient-lifted decision tree (Gradient BoostingDecisionTree, GBDT) model, a logistic regression (LogisticRegression, LR) model, and the like, and where an error back propagation algorithm is introduced during training of the initial third depth model to perform parameter optimization in combination with model loss, thereby improving the processing effect of the machine learning model.

Illustratively, the initial third depth model may employ a UNET model structure constructed based on a machine learning model.

Optionally, the performing quality enhancement training on the initial third depth model based on each sample depth fusion map to obtain a trained third depth model may be:

in the model training process of each round aiming at the initial third depth model, the electronic equipment can acquire a depth map reinforcement tag corresponding to the sample depth fusion map by the electronic equipment;

in some embodiments, the depth image reinforcement tag may be a tag depth map corresponding to a sample image for a sample object, the sample depth fusion map being generated based on multiple frames of sample images of the same sample object, the tag depth map may be one of the multiple frames of sample images. For example, a label depth map of a sample image corresponding to a reference depth map in the initial second depth model may be selected, and it may be understood that the reference depth map is typically one of sample depth fusion maps corresponding to multiple frames of sample images, and then a label depth map of a sample image corresponding to the reference depth map in the initial second depth model may be selected. For the initial third depth model, the depth image enhancement tag serves as an intensity optimization target for the initial third depth model processing stage.

The electronic device performs model training on the initial third depth model: firstly, carrying out pixel disturbance processing on each sample depth fusion map to obtain the sample depth fusion map after pixel disturbance; the electronic equipment inputs each sample depth fusion image into an initial third depth model respectively to carry out quality enhancement processing, outputs a sample enhancement depth image corresponding to the sample depth fusion image, and carries out model parameter adjustment on the initial third depth model based on the sample enhancement depth image and the depth image enhancement label until the initial third depth model finishes training, so as to obtain a trained third depth model;

it can be understood that, firstly, pixel perturbation processing is performed on the sample depth estimation image corresponding to each sample image of the same sample object, so as to obtain each sample depth estimation image after pixel perturbation; schematically, the pixel disturbance on the sample depth estimation image can promote the depth reconstruction and depth fusion effects of the depth model in the training process, simulate attack interference in a real environment, and reconstruct the sample depth estimation image after the pixel disturbance in the model training stage due to the need, so that the model has better depth quality strengthening capability after the training is completed.

Alternatively, the pixel perturbation processing may be performing pixel perturbation processing on the sample depth estimation map by using a pixel perturbation algorithm in the related art, for example, a differential evolution method may be used to perturb the depth values of a few pixels in the sample depth estimation map (for example, only a few pixels in 1024 pixels are perturbed).

Illustratively, the model parameter adjustment for the initial third depth model based on the sample enhanced depth map and the depth map enhanced label may be:

during each round of training of the initial third depth model: after the initial third depth network outputs the sample strengthening depth map, inputting the sample strengthening depth map and the depth map strengthening label into a fourth loss calculation formula, and determining quality strengthening loss; performing model parameter adjustment on the initial third depth model based on the quality enhancement loss;

the fourth loss calculation satisfies the following formula:

wherein the Loss C is the quality enhancement Loss, the I _re Enhancing a depth map for the sample, the I _GT And reinforcing labels for the depth map.

Illustratively, at each round: after the initial third depth network outputs the sample strengthening depth map, calculating quality strengthening loss according to the sample strengthening depth map and the depth map strengthening label, then carrying out model parameter adjustment on the initial third depth model based on the quality strengthening loss, and adjusting model parameters of the initial third depth model through back propagation, for example, carrying out back propagation adjustment on connection weight values and/or threshold values among neurons of each layer of the model based on pixel estimation loss until the initial third depth model meets the model finishing training condition, thereby obtaining the trained third depth model.

It can be understood that after the third depth model is generated by training, the second depth image is input into the third depth model to perform quality enhancement processing after the second depth image for the target object is obtained in the actual application stage, so as to obtain the third depth image after the quality enhancement processing.

S308: and executing image living detection processing on the target object based on the second depth image and the target color image by taking the third depth image as the second depth image.

It may be appreciated that, after the electronic device obtains the third depth image after the quality enhancement processing, the electronic device may use the third depth image as the second depth image, and execute a step of performing image living detection processing on the target object based on the second depth image and the target color image, and specific reference may be made to the method steps in other embodiments of the present disclosure, which will not be repeated herein.

In one or more embodiments of the present disclosure, an electronic device performs depth fusion on depth estimation of a plurality of target color images based on a first depth model, and performs depth fusion on inter-frame depth relations between a plurality of first depth images of the same object by mining and focusing the second depth model, so as to obtain a second depth image corresponding to the depth estimation with higher precision, thereby reducing detection requirements on image precision and image quality when the target color images are acquired, resisting detection interference of a complex application environment, and realizing that a color two-dimensional image based on lower image precision or lower image quality can obtain a second depth image with higher precision, so that image living body detection can be performed based on the second depth image with higher precision and the target color image, thereby improving detection effects of image living body detection under a complex environment and a lower performance hardware environment, and improving living body detection effects and robustness of living body detection; and introducing a third depth model to perform frame rate enhancement after interframe depth fusion, so that the data quality can be further improved on the fused depth estimation image, the environmental interference can be effectively resisted, and the stability and accuracy of living body detection can be improved.

The image living body detection apparatus provided in the present specification will be described in detail with reference to fig. 5. Note that, the image living body detection apparatus shown in fig. 5 is used to perform the method of the embodiment shown in fig. 1 to 4 of the present application, and for convenience of explanation, only the portion relevant to the present specification is shown, and specific technical details are not disclosed, please refer to the embodiment shown in fig. 1 to 4 of the present application.

Referring to fig. 5, a schematic structural diagram of the image living body detection apparatus of the present specification is shown. The image living body detection apparatus 1 may be implemented as all or a part of a user terminal by software, hardware, or a combination of both. According to some embodiments, the image living body detection apparatus 1 comprises an image acquisition module 11, a depth estimation module 12, a depth fusion module 13 and a living body detection module 14, in particular for:

an image acquisition module 11 for acquiring at least two frames of target color images for a target object based on an image living body detection task;

the depth estimation module 12 is configured to perform depth estimation processing on each of the target color images based on a first depth model, so as to obtain a first depth image corresponding to each frame of the target color image;

The depth fusion module 13 is configured to perform inter-frame depth fusion processing on each of the first depth images based on a second depth model, so as to obtain a second depth image for the target object;

a living body detection module 14, configured to perform image living body detection processing on the target object based on the second depth image and the target color image.

Alternatively, as shown in fig. 8, the apparatus 1 includes:

a model training module 15 for creating an initial first depth model and an initial second depth model;

the model training module 15 is configured to acquire at least one set of image sample data, where the image sample data includes at least two consecutive sample images for the same sample object;

the model training module 15 is configured to instruct, based on the image sample data, the depth estimation module 12 to perform depth estimation training on an initial first depth model and instruct the depth fusion module 13 to perform inter-frame depth fusion training on an initial second depth model, so as to obtain a trained first depth model and second depth model.

Optionally, the depth estimation module 12 is configured to input each sample image in the image sample data into an initial first depth model, output a sample depth estimation map corresponding to each sample image, and perform depth estimation training on the initial first depth model based on the sample depth estimation map until the initial first depth model completes training, so as to obtain a trained first depth model;

The depth fusion module 13 is configured to input the sample depth estimation map corresponding to each sample image to an initial second depth model to output a sample depth fusion map, and perform inter-frame depth fusion training on the initial second depth model based on the sample depth fusion map until the initial second depth model completes training, so as to obtain a trained second depth model.

Optionally, as shown in fig. 6, the depth estimation module 12 includes:

a depth reconstruction unit 121, configured to determine a label depth map corresponding to each sample image; inputting each sample image into an initial first depth model for depth estimation processing, and outputting a sample depth estimation image corresponding to each sample image;

a parameter adjustment unit 122, configured to determine a pixel estimation loss based on the label depth map, and perform model parameter adjustment on the initial first depth map based on the pixel estimation loss.

Optionally, the parameter adjusting unit 122 is configured to:

inputting the sample depth estimation image and the label depth image into a first loss calculation formula, and determining pixel estimation loss;

the first loss calculation satisfies the following formula:

Wherein the Loss is _A The loss is estimated for the pixel and,

an estimated depth value of an ith depth pixel point in the sample depth estimation graph, wherein p is as follows _i The label depth value of the ith depth pixel point in the label depth map is the label depth value, the gamma is a loss self-adaptive parameter, the i is an integer, andand I is the total pixel number of the sample depth estimation graph.

Optionally, the model training module 15 is configured to:

acquiring each sample image based on an image living body detection task, and determining an image sample type of the sample image;

and setting a label depth map corresponding to the sample image based on the image sample type.

Optionally, the image sample types include an attack image type and a living body image type, and the model training module 15 is configured to:

acquiring a first sample image corresponding to the attack image type, and setting a depth pixel value of a first label depth image corresponding to the first sample image as a target depth pixel value;

and acquiring a second sample image corresponding to the living body image type, and calling a target image depth service to determine a second label depth map corresponding to the second sample image.

Optionally, as shown in fig. 7, the depth fusion module 13 includes:

The depth reconstruction unit 131 is configured to input the sample depth estimation map corresponding to each sample image to an initial second depth model for performing depth reconstruction processing, so as to obtain a depth map estimation feature corresponding to the sample depth estimation map and a sample depth reconstruction map corresponding to the sample depth estimation map;

the parameter adjustment unit 132 is configured to perform inter-frame depth fusion based on the depth map estimation feature corresponding to each sample depth estimation map to output a sample depth fusion map, and perform model parameter adjustment on the initial second depth model based on the sample depth reconstruction map and the sample depth estimation map.

Optionally, the initial second depth model comprises at least a first depth coding network, a second depth decoding network and a third self-attention network,

the depth reconstruction unit 131 is configured to: respectively inputting the sample depth estimation images corresponding to the sample images of the same sample object into the first depth coding network to obtain depth image estimation characteristics corresponding to the sample depth estimation images; inputting the depth map estimation characteristics into the second depth decoding network to obtain a sample depth reconstruction map corresponding to the sample depth estimation map;

The parameter adjustment unit 132 is configured to input a depth map estimation feature corresponding to each sample depth estimation map into a third self-attention network, so as to obtain a reference depth map and an inter-frame relationship matrix, where the reference depth map is one of the sample depth estimation maps;

and carrying out interframe depth fusion based on the interframe relation matrix and the reference depth map to output a sample depth fusion map.

Optionally, the parameter adjusting unit 132 is configured to: performing weighted fusion by adopting a second interframe fusion calculation mode based on the interframe relation matrix and the reference depth map to obtain a sample depth fusion map;

Optionally, the parameter adjusting unit 132 is configured to: inputting each sample depth estimation image and each sample depth reconstruction image aiming at the same sample object into a third loss calculation formula to determine depth reconstruction loss;

model parameter adjustment is carried out on the initial second depth model based on the depth reconstruction loss;

the third loss calculation satisfies the following formula:

wherein Loss B is the depth reconstruction Loss, L is an integer, L is the total number of the sample depth estimation graphs corresponding to the same sample object, and I _pred-l Reconstructing a map for the sample depth of the first page, the I _GT -l is the sample depth estimation map of item l.

Optionally, the device 1 is further configured to: performing quality enhancement processing on the second depth image based on a third depth model to obtain a third depth image after the quality enhancement processing;

the living body detection module 14 is further configured to:

and executing image living detection processing on the target object based on the second depth image and the target color image by taking the third depth image as the second depth image.

Optionally, the device 1 is further configured to:

Creating an initial third depth model;

acquiring at least one sample depth fusion map of the second depth model corresponding to the initial second depth model;

and performing quality enhancement training on the initial third depth model based on each sample depth fusion map to obtain a trained third depth model.

Optionally, the device 1 is further configured to: acquiring a depth map strengthening label corresponding to the sample depth fusion map;

performing pixel disturbance processing on each sample depth fusion map to obtain the sample depth fusion map after pixel disturbance;

and respectively inputting each sample depth fusion image into an initial third depth model for quality enhancement processing, outputting a sample enhancement depth image corresponding to the sample depth fusion image, and carrying out model parameter adjustment on the initial third depth model based on the sample enhancement depth image and the depth image enhancement label until the initial third depth model is trained, so as to obtain a trained third depth model.

Optionally, the device 1 is further configured to: inputting the sample strengthening depth map and the depth map strengthening label into a fourth loss calculation formula, and determining quality strengthening loss;

Model parameter adjustment is carried out on the initial third depth model based on the quality enhancement loss;

the fourth loss calculation satisfies the following formula:

Optionally, the living body detecting module 14 is configured to:

inputting the second depth image and the target color image into a living body detection model, and outputting a first living body probability value corresponding to the second depth image and a second living body probability value corresponding to the target color image;

a living body detection type of the target object is determined based on the first living body probability value and the second living body probability value.

Optionally, the living body detecting module 14 is configured to:

determining a target living probability based on the first living probability value and the second living probability value;

if the target living body probability is greater than a target threshold value, determining that the target object is a living body object type;

and if the target living probability is smaller than or equal to a target threshold value, determining that the target object is an attack object type.

It should be noted that, when the image living body detection apparatus provided in the above embodiment performs the image living body detection method, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the image living body detection device and the image living body detection method provided in the above embodiments belong to the same concept, which embody the detailed implementation process and are not described herein.

The foregoing description is provided for the purpose of illustration only and does not represent the advantages or disadvantages of the embodiments.

The present disclosure further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are adapted to be loaded by a processor and executed by the processor, where the specific execution process may refer to the specific description of the embodiment shown in fig. 1 to 4, and the details are not repeated herein.

The application further provides a computer program product, where at least one instruction is stored, where the at least one instruction is loaded by the processor and executed by the processor, and the specific execution process may refer to the specific description of the embodiment shown in fig. 1 to 4, and details are not repeated herein.

Referring to fig. 9, a block diagram of an electronic device according to an exemplary embodiment of the present application is shown. An electronic device in the present application may include one or more of the following components: processor 110, memory 120, input device 130, output device 140, and bus 150. The processor 110, the memory 120, the input device 130, and the output device 140 may be connected by a bus 150.

Processor 110 may include one or more processing cores. The processor 110 utilizes various interfaces and lines to connect various portions of the overall electronic device, perform various functions of the electronic device 100, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and invoking data stored in the memory 120. Alternatively, the processor 110 may be implemented in at least one hardware form of digital signal processing (digital signal processing, DSP), field-programmable gate array (field-programmable gate array, FPGA), programmable logic array (programmable logic Array, PLA). The processor 110 may integrate one or a combination of several of a central processor (central processing unit, CPU), an image processor (graphics processing unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 110 and may be implemented solely by a single communication chip.

The memory 120 may include a random access memory (random Access Memory, RAM) or a read-only memory (ROM). Optionally, the memory 120 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, which may be an Android (Android) system, including an Android system-based deep development system, an IOS system developed by apple corporation, including an IOS system-based deep development system, or other systems, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the electronic device in use, such as phonebooks, audiovisual data, chat log data, and the like.

Referring to FIG. 10, the memory 120 may be divided into an operating system space in which the operating system is running and a user space in which native and third party applications are running. In order to ensure that different third party application programs can achieve better operation effects, the operating system allocates corresponding system resources for the different third party application programs. However, the requirements of different application scenarios in the same third party application program on system resources are different, for example, under the local resource loading scenario, the third party application program has higher requirement on the disk reading speed; in the animation rendering scene, the third party application program has higher requirements on the GPU performance. The operating system and the third party application program are mutually independent, and the operating system often cannot timely sense the current application scene of the third party application program, so that the operating system cannot perform targeted system resource adaptation according to the specific application scene of the third party application program.

In order to enable the operating system to distinguish specific application scenes of the third-party application program, data communication between the third-party application program and the operating system needs to be communicated, so that the operating system can acquire current scene information of the third-party application program at any time, and targeted system resource adaptation is performed based on the current scene.

Taking an operating system as an Android system as an example, as shown in fig. 11, a program and data stored in the memory 120 may be stored in the memory 120 with a Linux kernel layer 320, a system runtime library layer 340, an application framework layer 360 and an application layer 380, where the Linux kernel layer 320, the system runtime library layer 340 and the application framework layer 360 belong to an operating system space, and the application layer 380 belongs to a user space. The Linux kernel layer 320 provides the underlying drivers for various hardware of the electronic device, such as display drivers, audio drivers, camera drivers, bluetooth drivers, wi-Fi drivers, power management, and the like. The system runtime layer 340 provides the main feature support for the Android system through some C/c++ libraries. For example, the SQLite library provides support for databases, the OpenGL/ES library provides support for 3D graphics, the Webkit library provides support for browser kernels, and the like. Also provided in the system runtime library layer 340 is a An Zhuoyun runtime library (Android run) which provides mainly some core libraries that can allow developers to write Android applications using the Java language. The application framework layer 360 provides various APIs that may be used in building applications, which developers can also build their own applications by using, for example, campaign management, window management, view management, notification management, content provider, package management, call management, resource management, location management. At least one application program is running in the application layer 380, and these application programs may be native application programs of the operating system, such as a contact program, a short message program, a clock program, a camera application, etc.; and may also be a third party application developed by a third party developer, such as a game-like application, instant messaging program, photo beautification program, etc.

Taking an operating system as an IOS system as an example, the program and data stored in the memory 120 are shown in fig. 9, the IOS system includes: core operating system layer 420 (Core OS layer), core service layer 440 (Core Services layer), media layer 460 (Media layer), and touchable layer 480 (Cocoa Touch Layer). The core operating system layer 420 includes an operating system kernel, drivers, and underlying program frameworks that provide more hardware-like functionality for use by the program frameworks at the core services layer 440. The core services layer 440 provides system services and/or program frameworks required by the application, such as a Foundation (Foundation) framework, an account framework, an advertisement framework, a data storage framework, a network connection framework, a geographic location framework, a sports framework, and the like. The media layer 460 provides an interface for applications related to audiovisual aspects, such as a graphics-image related interface, an audio technology related interface, a video technology related interface, an audio video transmission technology wireless play (AirPlay) interface, and so forth. The touchable layer 480 provides various commonly used interface-related frameworks for application development, with the touchable layer 480 being responsible for user touch interactions on the electronic device. Such as a local notification service, a remote push service, an advertisement framework, a game tool framework, a message User Interface (UI) framework, a User Interface UIKit framework, a map framework, and so forth.

Among the frameworks illustrated in fig. 12, frameworks related to most applications include, but are not limited to: the infrastructure in core services layer 440 and the UIKit framework in touchable layer 480. The infrastructure provides many basic object classes and data types, providing the most basic system services for all applications, independent of the UI. While the class provided by the UIKit framework is a basic UI class library for creating touch-based user interfaces, iOS applications can provide UIs based on the UIKit framework, so it provides the infrastructure for applications to build user interfaces, draw, process and user interaction events, respond to gestures, and so on.

The manner and principle of implementing data communication between the third party application program and the operating system in the IOS system may refer to the Android system, which is not described herein.

The input device 130 is configured to receive input instructions or data, and the input device 130 includes, but is not limited to, a keyboard, a mouse, a camera, a microphone, or a touch device. The output device 140 is used to output instructions or data, and the output device 140 includes, but is not limited to, a display device, a speaker, and the like. In one example, the input device 130 and the output device 140 may be combined, and the input device 130 and the output device 140 are a touch display screen for receiving a touch operation thereon or thereabout by a user using a finger, a touch pen, or any other suitable object, and displaying a user interface of each application program. Touch display screens are typically provided on the front panel of an electronic device. The touch display screen may be designed as a full screen, a curved screen, or a contoured screen. The touch display screen can also be designed to be a combination of a full screen and a curved screen, and a combination of a special-shaped screen and a curved screen is not limited in this specification.

In addition, those skilled in the art will appreciate that the configuration of the electronic device shown in the above-described figures does not constitute a limitation of the electronic device, and the electronic device may include more or less components than illustrated, or may combine certain components, or may have a different arrangement of components. For example, the electronic device further includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, a wireless fidelity (wireless fidelity, wiFi) module, a power supply, and a bluetooth module, which are not described herein.

In this specification, the execution subject of each step may be the electronic device described above. Optionally, the execution subject of each step is an operating system of the electronic device. The operating system may be an android system, an IOS system, or other operating systems, which is not limited in this specification.

The electronic device of the present specification may further have a display device mounted thereon, and the display device may be various devices capable of realizing a display function, for example: cathode ray tube displays (cathode ray tubedisplay, CR), light-emitting diode displays (light-emitting diode display, LED), electronic ink screens, liquid crystal displays (liquid crystal display, LCD), plasma display panels (plasma display panel, PDP), and the like. A user may utilize a display device on electronic device 101 to view displayed text, images, video, etc. The electronic device may be a smart phone, a tablet computer, a gaming device, an AR (Augmented Reality ) device, an automobile, a data storage device, an audio playing device, a video playing device, a notebook, a desktop computing device, a wearable device such as an electronic watch, electronic glasses, an electronic helmet, an electronic bracelet, an electronic necklace, an electronic article of clothing, etc.

In the electronic device shown in fig. 9, where the electronic device may be a terminal, the processor 110 may be configured to invoke an application program stored in the memory 120 and specifically perform the following operations:

acquiring at least two frames of target color images aiming at a target object;

In one embodiment, the processor 110, prior to performing the image live detection method, further performs the following:

creating an initial first depth model and an initial second depth model;

acquiring at least one set of image sample data comprising at least two consecutive frames of sample images for the same sample object;

and performing depth estimation training on the initial first depth model and inter-frame depth fusion training on the initial second depth model based on the image sample data to obtain a trained first depth model and a trained second depth model.

In one embodiment, the processor 110 performs the following operations when performing depth estimation training on the initial first depth model and interframe depth fusion training on the initial second depth model based on the image sample data to obtain the trained first depth model and second depth model:

inputting each sample image in the image sample data into an initial first depth model to output a sample depth estimation image corresponding to each sample image, and performing depth estimation training on the initial first depth model based on the sample depth estimation image until the initial first depth model is trained, so as to obtain a trained first depth model;

inputting the sample depth estimation image corresponding to each sample image to an initial second depth model to output a sample depth fusion image, and performing interframe depth fusion training on the initial second depth model based on the sample depth fusion image until the initial second depth model is trained, so as to obtain a trained second depth model.

In one embodiment, the processor 110 outputs a sample depth estimation map corresponding to each of the sample images after executing the inputting the sample images in the image sample data into an initial first depth model, and performs depth estimation training on the initial first depth model based on the sample depth estimation map, specifically performing the following steps:

Determining a label depth map corresponding to each sample image;

inputting each sample image into an initial first depth model for depth estimation processing, and outputting a sample depth estimation image corresponding to each sample image;

determining a pixel estimation loss based on the label depth map, and performing model parameter adjustment on the initial first depth map based on the pixel estimation loss.

In one embodiment, the processor 110, when executing the determining pixel estimation loss based on the sample depth estimation map and the label depth map, specifically executes the following steps:

the first loss calculation satisfies the following formula:

wherein the Loss is _A The loss is estimated for the pixel and,

estimating for the sample depthEstimated depth value of ith depth pixel point in the graph, p _i And taking the gamma as a loss self-adaptive parameter, I as an integer, and I as the total pixel number of the sample depth estimation map as a label depth value of the ith depth pixel point in the label depth map. />

In one embodiment, before executing the determining the tag depth map corresponding to each sample image, the processor 110 further includes:

In one embodiment, the image sample type includes an attack image type and a living body image type, and the processor 110 specifically performs the following steps when executing the setting of the label depth map corresponding to the sample image based on the image sample type:

In one embodiment, the processor 110 performs the input of the sample depth estimation map corresponding to each of the sample images to an initial second depth model to output a sample depth fusion map, and performs inter-frame depth fusion training on the initial second depth model based on the sample depth fusion map, specifically performs the following steps:

Inputting the sample depth estimation image corresponding to each sample image into an initial second depth model for depth reconstruction processing to obtain depth image estimation characteristics corresponding to the sample depth estimation image and a sample depth reconstruction image corresponding to the sample depth estimation image;

and carrying out interframe depth fusion on the depth map estimation characteristics corresponding to each sample depth estimation map to output a sample depth fusion map, and carrying out model parameter adjustment on the initial second depth model on the basis of the sample depth reconstruction map and the sample depth estimation map.

In one embodiment, the initial second depth model at least includes a first depth coding network, a second depth decoding network, and a third self-attention network, and the processor 110 performs the depth reconstruction processing by inputting the sample depth estimation image corresponding to each sample image to the initial second depth model, to obtain a depth image estimation feature corresponding to the sample depth estimation image, and a sample depth reconstruction image corresponding to the sample depth estimation image, and performs inter-frame depth fusion based on the depth image estimation feature corresponding to each sample depth estimation image to output a sample depth fusion image, and specifically performs the following steps:

Respectively inputting the sample depth estimation images corresponding to the sample images of the same sample object into the first depth coding network to obtain depth image estimation characteristics corresponding to the sample depth estimation images;

inputting the depth map estimation characteristics into the second depth decoding network to obtain a sample depth reconstruction map corresponding to the sample depth estimation map;

inputting depth map estimation features corresponding to the sample depth estimation maps into a third self-attention network to obtain a reference depth map and an inter-frame relation matrix, wherein the reference depth map is one of the sample depth estimation maps;

In one embodiment, the processor 110 performs the inter-frame depth fusion based on the inter-frame relation matrix and the reference depth map to output a sample depth fusion map, specifically performing the following steps:

performing weighted fusion by adopting a second interframe fusion calculation mode based on the interframe relation matrix and the reference depth map to obtain a sample depth fusion map;

In one embodiment, the processor 110 performs the model parameter adjustment on the initial second depth model based on the sample depth reconstruction map and the sample depth estimation map, specifically performing the following steps:

inputting each sample depth estimation image and each sample depth reconstruction image aiming at the same sample object into a third loss calculation formula to determine depth reconstruction loss;

The third loss calculation satisfies the following formula:

wherein Loss B is the depth reconstruction Loss, L is an integer, L is the total number of the sample depth estimation graphs corresponding to the same sample object, and I _pred-l Reconstructing a map for the sample depth of the first page, the I _GT-l For the first sampleAnd (5) a depth estimation map.

In one embodiment, after performing the inter-frame depth fusion processing on each of the first depth images based on the second depth model, the processor 110 further specifically performs the following steps:

performing quality enhancement processing on the second depth image based on a third depth model to obtain a third depth image after the quality enhancement processing;

the performing image living body detection processing on the target object based on the second depth image and the target color image includes:

In one embodiment, the processor 110, when executing the image live detection method, further performs the steps of:

Creating an initial third depth model;

In one embodiment, the processor 110 performs the quality enhancement training on the initial third depth model based on each of the sample depth fusion maps to obtain a trained third depth model, and specifically performs the following steps:

acquiring a depth map strengthening label corresponding to the sample depth fusion map;

In one embodiment, the processor 110 performs the model parameter adjustment on the initial third depth model based on the sample enhanced depth map and the depth map enhanced label, specifically performing the following steps:

inputting the sample strengthening depth map and the depth map strengthening label into a fourth loss calculation formula, and determining quality strengthening loss;

the fourth loss calculation satisfies the following formula:

In one embodiment, the processor 110 performs the image live detection processing on the target object based on the second depth image and the target color image, specifically performs the following steps:

In one embodiment, the processor 110, when executing the determining the living detection type of the target object based on the first living probability value and the second living probability value, specifically executes the following steps:

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, or the like.

The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims

1. An image live detection method, the method comprising:

acquiring at least two frames of target color images aiming at a target object;

2. The method of claim 1, the method further comprising:

creating an initial first depth model and an initial second depth model;

3. The method of claim 2, wherein the training for depth estimation of the initial first depth model and the training for interframe depth fusion of the initial second depth model based on the image sample data, to obtain the trained first depth model and second depth model, comprises:

4. A method according to claim 3, wherein inputting each sample image in the image sample data into an initial first depth model outputs a sample depth estimation map corresponding to each sample image, and performing depth estimation training on the initial first depth model based on the sample depth estimation map, comprising:

determining a label depth map corresponding to each sample image;

5. The method of claim 4, the determining pixel estimation loss based on the sample depth estimation map and the label depth map, comprising:

the first loss calculation satisfies the following formula:

/>

wherein the Loss is _A The loss is estimated for the pixel and,

an estimated depth value of an ith depth pixel point in the sample depth estimation graph, wherein p is as follows _i And taking the gamma as a loss self-adaptive parameter, I as an integer, and I as the total pixel number of the sample depth estimation map as a label depth value of the ith depth pixel point in the label depth map.

6. The method of claim 4, further comprising, prior to determining the label depth map for each of the sample images:

7. The method of claim 6, wherein the image sample types include an attack image type and a living image type,

the setting the label depth map corresponding to the sample image based on the image sample type comprises the following steps:

8. The method of claim 4, wherein inputting the sample depth estimation map corresponding to each sample image to an initial second depth model to output a sample depth fusion map, and performing inter-frame depth fusion training on the initial second depth model based on the sample depth fusion map, comprises:

9. The method of claim 8, wherein the initial second depth model comprises at least a first depth coding network, a second depth decoding network, and a third self-attention network,

Inputting the sample depth estimation image corresponding to each sample image to an initial second depth model for depth reconstruction processing to obtain a depth image estimation feature corresponding to the sample depth estimation image and a sample depth reconstruction image corresponding to the sample depth estimation image, and performing interframe depth fusion based on the depth image estimation feature corresponding to each sample depth estimation image to output a sample depth fusion image, including:

10. The method of claim 9, the interframe depth fusing based on the interframe relationship matrix and the reference depth map to output a sample depth fusion map, comprising:

11. The method of claim 8, the model parameter adjustment of the initial second depth model based on the sample depth reconstruction map and the sample depth estimation map, comprising:

the third loss calculation satisfies the following formula:

12. The method of claim 1, wherein after performing the inter-depth fusion processing on each of the first depth images based on the second depth model to obtain the second depth image for the target object, further comprising:

13. The method of claim 12, the method further comprising:

creating an initial third depth model;

14. The method of claim 13, wherein the performing quality enhancement training on the initial third depth model based on each of the sample depth fusion maps to obtain a trained third depth model comprises:

15. The method of claim 14, the model parameter adjustment of the initial third depth model based on the sample reinforcement depth map and the depth map reinforcement tag comprising:

the fourth loss calculation satisfies the following formula:

16. The method according to any one of claims 1-15, the performing image live detection processing on the target object based on the second depth image and the target color image, comprising:

17. The method of claim 16, the determining a biopsy type of the target object based on the first and second live probability values, comprising:

18. An image living body detection apparatus, the apparatus comprising:

the image acquisition module is used for acquiring at least two frames of target color images aiming at the target object based on the image living body detection task;

19. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any one of claims 1 to 17.

20. A computer program product storing at least one instruction for loading by the processor and performing the method steps of any one of claims 1 to 17.

21. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1-17.