CN116524609A

CN116524609A - Living body detection method and system

Info

Publication number: CN116524609A
Application number: CN202310458589.9A
Authority: CN
Inventors: 曹佳炯
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-08-01

Abstract

The specification provides a living body detection method and a living body detection system, which are used for acquiring a single-mode target image of a target user, then performing multi-mode living body detection based on the single-mode image on the target image based on a target living body detection model, and outputting an obtained target living body detection result. The target living body detection model is a single-mode living body detection model obtained by carrying out knowledge distillation based on the multi-mode living body detection model, so that the single-mode living body detection model has living body detection performance of the multi-mode living body detection model, the accuracy of living body detection based on the single mode is improved, and the cost of the image acquisition module can be saved.

Description

Living body detection method and system

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a living body detection method and system.

Background

Face recognition has become a main identity authentication mode and is widely applied in various scenes such as finance, payment and the like. With the widespread use of face recognition systems, security is also challenged by living attacks. Therefore, it is necessary to add a link of living body detection in the face recognition system. In order to improve the safety capability of the in vivo detection algorithm, in vivo detection has two schemes: one scheme is to collect multi-modal images by adding hardware devices and to perform living body detection based on the multi-modal images, and the hardware cost is high. Another approach is a multi-action based biopsy method. The scheme needs the user to complete some appointed actions in a matching way, and collects a plurality of user images in a period of time to improve the information quantity. Both of the above solutions require more time and effort, and the requirements for the terminal equipment are also higher.

In summary, it is desirable to provide a new living body detection method and system that can improve the living body detection performance without increasing the hardware cost additionally.

Disclosure of Invention

The present specification provides a living body detection method and system capable of improving living body detection performance without increasing hardware cost additionally.

In a first aspect, the present specification provides a living body detection method, comprising: acquiring a target image of a target user, wherein the target image is a single-mode image acquired for the target user in a target mode; performing multi-mode living body detection on the target image based on a target living body detection model to obtain a target living body detection result, wherein the target living body detection model is a single-mode living body detection model obtained by performing knowledge distillation based on the multi-mode living body detection model; and outputting the target living body detection result.

In some embodiments, the target modality is one of a plurality of modalities corresponding to the multi-modality living body detection model.

In some embodiments, the knowledge distillation comprises: and performing predictive recognition distillation on the target living body detection model based on the multi-mode living body detection model, wherein a training target of the predictive recognition distillation comprises that a first predictive living body classification result output by the target living body detection model is consistent with a predictive multi-mode living body classification result output by the multi-mode living body detection model.

In some embodiments, the training objectives of predictive recognition distillation further include at least one of: the first predicted living body classification result output by the target living body detection model is consistent with the corresponding real living body classification result; and the correlation between the first prediction feature of the target mode output by the target living body detection model and the prediction multi-mode fusion feature of the multiple modes output by the multi-mode living body detection model meets the preset requirement.

In some embodiments, the preset requirements include at least one of: mutual information quantity between a first prediction feature of the target mode output by the target living body detection model and a prediction multi-mode fusion feature output by the multi-mode living body detection model approaches to a first preset value; and in the case that the target living body detection model is determined to output the first prediction feature of the target mode, the condition self-information quantity of the first prediction feature of the target mode output by the multi-mode living body detection model approaches to a second preset value.

In some embodiments, the knowledge distillation further comprises: advancing advanced knowledge distillation of the target biopsy model based on the multimodal biopsy model, a training target of the advanced knowledge distillation comprising a predicted distribution consistency comprising: the distribution of the predicted objects under the living body category output by the target living body detection model is consistent with the distribution of the predicted objects under the living body category output by the multi-mode living body detection model; and the distribution of the predicted objects under the attack category output by the target living body detection model is consistent with the distribution of the predicted objects under the attack category output by the multi-mode living body detection model.

In some embodiments, the distribution of the predicted objects under the living class output by the target living model is consistent with the distribution of the predicted objects under the living class output by the multi-modal living model, including at least one of: the class center of the predicted object under the living body class output by the target living body detection model is consistent with the class center of the predicted object under the living body class output by the multi-mode living body detection model; and the distance between the predicted object under the living body category output by the target living body detection model and the class center of the corresponding living body category is consistent with the distance between the predicted object under the living body category output by the multi-mode living body detection model and the class center of the corresponding living body category.

In some embodiments, the distribution of the predicted objects under the attack category output by the target biopsy model is consistent with the distribution of the predicted objects under the attack category output by the multi-modal biopsy model, comprising at least one of: the class center of the predicted object under the attack category output by the target living body detection model is consistent with the class center of the predicted object under the attack category output by the multi-mode living body detection model; and the distance between the predicted object under the attack category output by the target living body detection model and the class center of the attack category corresponding to the predicted object is consistent with the distance between the predicted object under the attack category output by the multi-mode living body detection model and the class center of the attack category corresponding to the predicted object.

In some embodiments, the prediction object comprises a prediction feature and/or a predicted living body classification result output by the model.

In some embodiments, the training objectives of the advanced knowledge distillation further comprise: and the second predicted living body classification result output by the target living body detection model is consistent with the corresponding real living body classification result.

In some embodiments, the training objectives of the multimodal biopsy model include: and a plurality of predicted living body classification results of a plurality of modes output by the multi-mode living body detection model are consistent with the corresponding real living body classification results.

In some embodiments, the training target of the multimodal biopsy model further comprises at least one of: the multi-modal living body classification results of the multiple modalities output by the multi-modal living body detection model are consistent with the fused prediction multi-modal living body classification results; and the multiple prediction features of the multiple modes output by the multi-mode living body detection model are consistent.

In a second aspect, the present specification also provides a living body detection system including: at least one storage medium storing at least one set of instructions for performing a living organism detection; and at least one processor communicatively coupled to the at least one storage medium, wherein, when the in-vivo detection system is running, the at least one processor reads the at least one instruction set and performs the method of any of the first aspects as directed by the at least one instruction set.

According to the technical scheme, the living body detection method and the living body detection system provided by the specification are used for collecting the single-mode target image of the target user, and performing single-mode-based multi-mode living body detection on the target image through the target living body detection model, so that the obtained target living body detection result is output. In the scheme, a multimode living body detection model is used for carrying out knowledge distillation on the multimode living body detection model to obtain a target living body detection model. The target living body detection model can represent the mapping relation between the single-mode image and the multi-mode living body detection result obtained based on the multi-mode image, so that the single-mode living body detection model has the living body detection performance of the multi-mode living body detection model, and the effect close to the multi-mode living body detection result is achieved when the living body detection is carried out based on the single-mode image. In addition, because the living body detection is carried out based on the single-mode image, the calculation force can be saved, the living body detection efficiency can be improved, and the hardware cost of the image acquisition module can be reduced. Meanwhile, the target living body detection model is a lightweight model based on single-mode image detection, has low requirement on the computing power of the computing equipment, can be deployed on the terminal equipment and also can be deployed on a remote server, and can avoid image transmission when deployed on the terminal equipment, further save the computing time and avoid the risk of privacy leakage in the data transmission process.

Additional functionality of the biopsy method and system provided in the present specification will be set forth in part in the description that follows. The following numbers and examples presented will be apparent to those of ordinary skill in the art in view of the description. The inventive aspects of the living being detection methods and systems provided herein may be fully explained by practicing or using the methods, devices, and combinations described in the detailed examples below.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present description, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows an application scenario schematic of a living body detection system provided according to an embodiment of the present specification;

FIG. 2 illustrates a hardware architecture diagram of a computing device provided in accordance with an embodiment of the present description;

FIG. 3 shows a flow chart of a method of in-vivo detection provided in accordance with an embodiment of the present description;

FIG. 4 shows a network architecture diagram of a preset multi-modal living body detection model provided in accordance with an embodiment of the present disclosure; and

fig. 5 shows a network structure schematic diagram of a preset target living body detection model provided according to an embodiment of the present specification.

Detailed Description

The following description is presented to enable one of ordinary skill in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the disclosure. Thus, the present description is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. For example, as used herein, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise. The terms "comprises," "comprising," "includes," and/or "including," when used in this specification, are taken to specify the presence of stated integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

These and other features of the present specification, as well as the operation and function of the related elements of structure, as well as the combination of parts and economies of manufacture, may be significantly improved upon in view of the following description. All of which form a part of this specification, reference is made to the accompanying drawings. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the description. It should also be understood that the drawings are not drawn to scale.

The flowcharts used in this specification illustrate operations implemented by systems according to some embodiments in this specification. It should be clearly understood that the operations of the flow diagrams may be implemented out of order. Rather, operations may be performed in reverse order or concurrently. Further, one or more other operations may be added to the flowchart. One or more operations may be removed from the flowchart.

For convenience of description, terms appearing in the specification are explained first as follows:

living body attack: for example, by presenting with a screen, paper, mask, etc., an attack that attempts to bypass the face recognition system.

Living body detection: also referred to as living body attack prevention, refers to a technology for detecting and intercepting living body attacks by using an artificial intelligent model, such as a mobile phone screen, printing paper, and the like.

Distillation learning: refers to a method for guiding a lightweight network to learn by using a powerful teacher network. The model learned using distillation can result in better performance of the model without using the method.

Cross-modal distillation: the method refers to a distillation learning method with inconsistent modes input by a teacher network and a student network.

Before describing the specific embodiments of the present specification, the application scenario of the present specification will be described as follows:

the living body detection method provided by the specification can be applied to living body detection scenes in any biological recognition process, for example, in scenes such as face payment or face recognition, and the like, the living body detection can be carried out on the acquired original images of the biological characteristics of the user to be paid or recognized through the living body detection method of the specification; in an authentication scene, the collected original image of the biological feature of the user can be subjected to living body detection by the living body detection method of the specification; the method can also be applied to any living body detection scene, and the details are not repeated here. The biometric features may include, but are not limited to, one or more of facial images, irises, sclera, fingerprint, palmprint, voiceprint, bone projections. For convenience of description, the present application will be described taking an example of performing in-vivo detection of a face in a face recognition scenario, to which the in-vivo detection method is applied.

Those skilled in the art will appreciate that the biopsy method and system described herein are applicable to other usage scenarios and are within the scope of the present disclosure.

Fig. 1 shows an application scenario schematic diagram of a living body detection system 001 provided according to an embodiment of the present specification. The living body detection system 001 (hereinafter referred to as system 001) may be applied to living body detection of an arbitrary scene, such as living body detection in a face payment scene, living body detection in an authentication scene, living body detection in other face recognition scenes, and the like, as shown in fig. 1, the system 001 may include a terminal device 200 and a server 300. The application scenario of system 001 may include target user 100, system 001, and network 400.

The target user 100 may be a user who needs biometric identification or a user who is performing biometric identification. The target user 100 may be the object detected by the system 001.

The terminal device 200 may be a device that performs living body detection on the target user 100. In some embodiments, the in-vivo detection method may be performed on the terminal device 200. At this time, the terminal device 200 may store data or instructions to perform the living body detection method described in the present specification, and may execute or be used to execute the data or instructions. In some embodiments, the terminal device 200 may include a hardware device having a data information processing function and a program necessary to drive the hardware device to operate. In some embodiments, the terminal device 200 may include a mobile device, a tablet, a laptop, a built-in device of a motor vehicle, or the like, or any combination thereof. In some embodiments, the mobile device may include a smart home device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home device may include a smart television, a desktop computer, or the like, or any combination. In some embodiments, the smart mobile device may include a smart phone, personal digital assistant, gaming device, navigation device, etc., or any combination thereof. In some embodiments, the virtual reality device or augmented reality device may include a virtual reality helmet, virtual reality glasses, virtual reality patch, augmented reality helmet, augmented reality glasses, augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device or the augmented reality device may include google glass, head mounted display, VR, or the like. In some embodiments, the built-in devices in the motor vehicle may include an on-board computer, an on-board television, and the like. In some embodiments, the terminal device 200 may comprise an image acquisition device for acquiring a single-modality target image of the target user 100. In some embodiments, the image capture device may be a two-dimensional image capture device (such as an RGB camera), or may be a two-dimensional image capture device (such as an RGB camera) and a depth image capture device (such as a 3D structured light camera, a laser detector, etc.). In some embodiments, the terminal device 200 may be a device with positioning technology for positioning the location of the terminal device 200.

In some embodiments, terminal device 200 may be installed with one or more Applications (APP). The APP can provide the target user 100 with the ability to interact with the outside world via the network 400 as well as an interface. The APP includes, but is not limited to: web browser-like APP programs, search-like APP programs, chat-like APP programs, shopping-like APP programs, video-like APP programs, financial-management-like APP programs, instant messaging tools, mailbox terminal devices, social platform software, and the like. In some embodiments, the terminal device 200 may have a target APP installed thereon. The target APP is capable of acquiring a target image of a biological feature of a target user in a target mode for the terminal device 200. In some embodiments, the target user 100 may also trigger a liveness detection request through the target APP. The target APP may perform the living body detection method described in the present specification in response to the living body detection request. The living body detection method will be described in detail later.

As shown in fig. 1, the terminal device 200 may be communicatively connected to a server 300. In some embodiments, the server 300 may be communicatively coupled to a plurality of terminal devices 200 and receive data transmitted by the terminal devices 200. In some embodiments, terminal device 200 may interact with server 300 through network 400 to receive or send messages, etc. The server 300 may be a server providing various services, such as a background server providing support for a living body detection method deployed on a plurality of terminal apparatuses 200. In some embodiments, the in-vivo detection method may be performed on the server 300. At this time, the server 300 may store data or instructions to perform the living body detection method described in the present specification, and may execute or be used to execute the data or instructions. In some embodiments, the server 300 may include a hardware device having a data information processing function and a program necessary to drive the hardware device to operate.

The network 400 is a medium used to provide a communication connection between the terminal device 200 and the server 300. The network 400 may facilitate the exchange of information or data. As shown in fig. 1, the terminal device 200 and the server 300 may be connected to a network 400 and mutually transmit information or data through the network 400. In some embodiments, the network 400 may be any type of wired or wireless network, or a combination thereof. The network 400 may comprise, for example, a cable network, a wired network, an optical fiber network, a telecommunication network,Intranet, internet, local Area Network (LAN), wide Area Network (WAN), wireless Local Area Network (WLAN), metropolitan Area Network (MAN), wide Area Network (WAN), public Switched Telephone Network (PSTN), bluetooth network ^TM 、ZigBee ^TM A network, a Near Field Communication (NFC) network, or the like. In some embodiments, network 400 may include one or more network access points. For example, network 400 may include a wired or wireless network access point, such as a base station or an internet switching point, through which one or more components of terminal device 200 and server 300 may connect to network 400 to exchange data or information.

It should be understood that the number of terminal devices 200, servers 300, and networks 400 in fig. 1 are merely illustrative. There may be any number of terminal devices 200, servers 300, and networks 400, as desired for implementation.

The living body detection method may be performed entirely on the terminal device 200, entirely on the server 300, or partially on the terminal device 200 and partially on the server 300. Particularly, the method for detecting the living body uses the multimode living body detection model to carry out knowledge distillation on the multimode living body detection model, so that the multimode living body detection model has living body detection performance of the multimode living body detection model, and the effect close to a multimode living body detection result is achieved when the living body detection is carried out on the basis of an image of the multimode living body. Therefore, the living body detection method can save calculation force, improve living body detection efficiency, and the adopted single-mode living body detection model is a lightweight model, has low calculation force requirements on the computing equipment, can be deployed on the terminal equipment 200, can further reduce image transmission, save calculation time and avoid privacy leakage risks in the data transmission process. The following description will be given taking as an example the execution of the living body detection method on the terminal device 200.

Fig. 2 illustrates a hardware architecture diagram of a computing device 600 provided in accordance with an embodiment of the present description. The computing device 600 may perform the in-vivo detection method described herein. The living body detection method is described in other parts of the specification. When the living body detection method is performed on the terminal device 200, the computing device 600 may be the terminal device 200. When the in-vivo detection method is performed on the server 300, the computing device 600 may be the server 300. When the living body detection method may be partially performed on the terminal device 200 and partially performed on the server 300, the computing device 600 may be either the terminal device 200 or the server 300.

As shown in fig. 2, computing device 600 may include at least one storage medium 630 and at least one processor 620. In some embodiments, computing device 600 may also include a communication port 650 and an internal communication bus 610. Meanwhile, computing device 600 may also include I/O component 660.

Internal communication bus 610 may connect the various system components including storage medium 630, processor 620, and communication ports 650.

I/O component 660 supports input/output between computing device 600 and other components.

The communication port 650 is used for data communication between the computing device 600 and the outside world, for example, the communication port 650 may be used for data communication between the computing device 600 and the network 400. The communication port 650 may be a wired communication port or a wireless communication port.

The storage medium 630 may include a data storage device. The data storage device may be a non-transitory storage medium or a transitory storage medium. For example, the data storage devices may include one or more of magnetic disk 632, read Only Memory (ROM) 634, or Random Access Memory (RAM) 636. The storage medium 630 further includes at least one set of instructions stored in the data storage device. The instructions are computer program code that may include programs, routines, objects, components, data structures, procedures, modules, etc. that perform the biopsy methods provided herein.

The at least one processor 620 may be communicatively coupled with at least one storage medium 630 and a communication port 650 via an internal communication bus 610. The at least one processor 620 is configured to execute the at least one instruction set. When the computing device 600 is running, the at least one processor 620 reads the at least one instruction set and performs the in-vivo detection method provided herein according to the instructions of the at least one instruction set. The processor 620 may perform all the steps involved in the in vivo detection method. The processor 620 may be in the form of one or more processors, and in some embodiments, the processor 620 may include one or more hardware processors, such as microcontrollers, microprocessors, reduced Instruction Set Computers (RISC), application Specific Integrated Circuits (ASICs), application specific instruction set processors (ASIPs), central Processing Units (CPUs), graphics Processing Units (GPUs), physical Processing Units (PPUs), microcontroller units, digital Signal Processors (DSPs), field Programmable Gate Arrays (FPGAs), advanced RISC Machines (ARM), programmable Logic Devices (PLDs), any circuit or processor capable of executing one or more functions, or the like, or any combination thereof. For illustrative purposes only, only one processor 620 is depicted in the computing device 600 in this specification. It should be noted, however, that computing device 600 may also include multiple processors, and thus, operations and/or method steps disclosed in this specification may be performed by one processor as described herein, or may be performed jointly by multiple processors. For example, if the processor 620 of the computing device 600 performs steps a and B in this specification, it should be understood that steps a and B may also be performed by two different processors 620 in combination or separately (e.g., a first processor performs step a, a second processor performs step B, or the first and second processors perform steps a and B together).

Fig. 3 shows a flowchart of a living body detection method P100 provided according to an embodiment of the present specification. As before, the computing device 600 may perform the in-vivo detection method P100 of the present specification. Specifically, the computing device 600 may read an instruction set stored in its local storage medium and then execute the living body detection method P100 of the present specification according to the specification of the instruction set. As shown in fig. 3, the method P100 may include:

s120: and acquiring a target image of the target user.

The target user is a user to be subjected to or being subjected to living detection.

The target image is a single-mode image acquired for a target user in a target mode. The target image includes the biometric features of the target user. The biological characteristics are physiological characteristics inherent to the human body, and may include at least one of face, iris, sclera, fingerprint, palmprint, voiceprint, bone projection, and other physiological characteristics inherent to the human body capable of performing face recognition. For convenience of description, the description will be given in this specification taking a biological feature as an example of a human face. It should be understood by those skilled in the art that other features of the biological feature are also within the scope of the present description.

The target image may be acquired by the terminal device 200. The terminal device 200 may be a device that performs liveness authentication or face verification. In some embodiments, the target user may complete authentication at the terminal device 200 by performing in-vivo authentication or face authentication at the terminal device 200.

The terminal device 200 has an image acquisition module corresponding to the target mode. The target modality is a single modality, meaning a single modality. The modality may be an imaging modality. The target modality may be any of a variety of modalities. Multiple modalities may refer to multiple different imaging modes. The different imaging modes may include imaging modes in different visual fields, imaging modes in different dimensions, imaging modes of thermal imaging, and the like. Correspondingly, different modes correspond to different image acquisition modules. The visual field may refer to a spectrum range in which an image is located, such as an ultraviolet light field, a visible light field, a near infrared light field, a mid-infrared light field, a far-infrared light field, and the like. The image acquisition modules in different vision fields are, for example, an ultraviolet camera, a visible light camera, a near infrared camera, a middle infrared camera, a far infrared camera and the like. The working principles of the image acquisition modules of different vision domains are different. For example, a visible light camera mainly includes a lens, an image sensor, and an image processor. The lens is used for projecting the shot object on the image sensor, the image processor calculates proper parameters through photometry and ranging and instructs the lens to focus, when a shooting instruction is detected (for example, the face of the target user 100 is detected to be completely placed in the face view-finding frame), the image sensor completes one exposure, and the image is changed into an image through the image processor. The near infrared camera mainly comprises an infrared emission device and an infrared receiving device, and the working principle of the near infrared camera is that the infrared emission device emits infrared rays to irradiate a shot object, and the infrared receiving device receives the infrared rays reflected by the shot object, so that a near infrared image is formed. The mode images shot by the image acquisition modules in different visual fields are different, such as ultraviolet images, visible light images, near infrared images, mid infrared images, far infrared images and the like.

The image acquisition modules in different dimensions can acquire images in different dimensions, such as a 2D camera, a 3D camera (or a depth camera). The 2D camera may acquire a planar image of the target user 100. The 3D camera may acquire a depth image of the target user 100, where the depth image includes depth information of the target user 100, such as a distance between the target user 100 and the 3D camera. 3D cameras such as structured light cameras, TOF cameras, binocular stereo cameras, laser detectors, etc. The thermal imaging camera may also be referred to as a thermal imager, and is configured to passively receive infrared radiation energy (heat) emitted by a measured object, and convert the heat energy into a visual image with temperature data, where the visual image is a thermal imaging image, and the thermal imaging image shows a temperature distribution of the surface of the measured object.

The terminal device 200 performs image acquisition on the target user through the image acquisition module, so as to obtain a target image corresponding to the target mode. For example, the image acquisition module may include one of an RGB camera module, an infrared camera module, an NIR camera module, a 3D camera module, and a thermal imaging camera module. Correspondingly, the target modality may be any one of a plurality of modalities such as RGB modality, infrared modality, NIR modality, 3D modality, and thermal imaging modality. The terminal device 200 performs image acquisition on a target user through one of the RGB camera module, the infrared camera module, the NIR camera module, the 3D camera module, and the thermal imaging camera module, and may obtain a target image.

S140: and performing multi-mode living body detection on the target image based on the target living body detection model to obtain a target living body detection result.

The target living body detection model is a single-mode living body detection model obtained by knowledge distillation based on a multi-mode living body detection model. That is, the target living body detection model is a model for living body detection based on a single-mode image, and the result or accuracy of living body detection can reach the result or accuracy of a model for living body detection based on a multi-mode image. The multi-modality living body detection model is a model for living body detection based on images of a plurality of modalities. For example, the terminal device 200 acquires images in a plurality of modes, and inputs the images to a multi-mode living body detection model to perform multi-mode living body detection, thereby obtaining a multi-mode living body detection result. The multi-modal living body detection result can improve the accuracy of living body detection.

The knowledge distillation can transfer the knowledge learned by the multi-mode living body detection model to the target living body detection model, so that the target living body detection model can represent the mapping relation between the single-mode image and the multi-mode living body detection result, living body detection is carried out by utilizing the single-mode target image, and the effect identical to or close to the multi-mode living body detection result is achieved.

The computing device 600 needs to acquire the multi-modal biopsy model prior to knowledge distillation of the target biopsy model based on the multi-modal biopsy model. The training process of the multi-modal living body detection model will be described below.

The multi-mode living body detection model can be obtained by training the following steps: for example, the computing device 600 obtains a multi-modal training image and its corresponding label, and trains a preset multi-modal living detection model based on the multi-modal training image and its corresponding label and training target, resulting in a multi-modal living detection model.

The labels represent real living body classification results corresponding to the multi-mode training images. The real living body classification result represents that the multi-mode training image is a living body type or an attack type. The training target of the multimodal living detection model may include a first target. The first target may be that a plurality of predicted living body classification results corresponding to a plurality of modalities output by the multi-modality living body detection model are identical to the real living body classification results corresponding thereto. The agreement may be that the difference between the two is within a first preset range. The first target can restrict a plurality of prediction living body classification results output by the multi-mode living body detection model to approach to corresponding real living body classification results in the training process, so that the accuracy of the multi-mode living body detection model is higher.

In some embodiments, the training targets of the multi-modal living body detection model may further include a second target. The second target may be that a plurality of predicted living body classification results corresponding to a plurality of modalities output by the multi-modality living body detection model are identical to the predicted multi-modality living body classification results after the fusion of the plurality of modalities. The agreement may be that the difference between the two is within a second preset range. The second target can restrict the multiple prediction living body classification results output by the multi-mode living body detection model to approach to the fused prediction multi-mode living body classification results in the training process, so that the accuracy of the multi-mode living body detection model is higher.

In some embodiments, the training targets of the multi-modal living body detection model may further include a third target. The third objective may include agreement of a plurality of predictive features corresponding to a plurality of modalities of the multi-modality in-vivo detection model output. The plurality of predicted features may be consistent such that a difference between the plurality of predicted features is within a third predetermined range.

The training process of the multi-mode living body detection model is described below with reference to the accompanying drawings. Fig. 4 shows a schematic structural diagram of a preset multi-modal living body detection model provided according to an embodiment of the present specification. As shown in fig. 4, the preset multi-modal living body detection model may include a preset multi-modal feature encoding network, a preset multi-modal living body classification network, and a preset multi-modal relationship constraint network.

The preset multi-mode feature encoding network can be a convolutional neural network (Convolutional Neural Networks, CNN) such as a res net (residual network) or a classical network DenseNet. The preset multimodal feature encoding network may be configured to perform feature extraction on the multimodal training image. For example, the computing device 600 inputs the training images of the plurality of modalities to a preset multimodal feature encoding network, such that the preset multimodal feature encoding network performs feature extraction of the plurality of modalities and extraction of fusion features of the plurality of modalities based on the training images of the plurality of modalities, thereby obtaining prediction features of the plurality of modalities and prediction fusion features of the plurality of modalities. For example, the training images of the plurality of modalities include a training image of an RGB (red-green-blue) modality, a training image of a NIR (Near Infrared) modality, and a training image of a Depth modality. The computing device 600 may derive predicted features of the RGB modality, predicted features of the NIR modality, and predicted features of the Depth modality based on the training images of the RGB modality, the NIR modality, and the Depth modality, and a predicted multi-modality fusion feature based on the RGB modality+nir modality+depth modality. When the computing device 600 extracts the predicted multi-modal fusion feature by using the preset multi-modal feature encoding network, the predicted multi-modal fusion feature may be obtained by feature fusion based on the extracted predicted features of the multiple modalities, or may be obtained by fusing training images of the multiple modalities, and extracting the predicted multi-modal fusion feature based on the fused images.

The preset multi-modal in-vivo classification network may be a multi-layer perceptron (Multilayer Perceptron, MLP), also referred to as an artificial neural network (Artificial Neural Network, ANN), configured to perform in-vivo detection based on the predicted features extracted by the preset multi-modal feature encoding network. For example, the computing device 600 inputs the predicted features and the predicted multimodal fusion features of the plurality of modalities to a preset multimodal living body classification network, such that the preset multimodal living body classification network performs living body detection based on the predicted features of the plurality of modalities, obtains a plurality of predicted living body classification results corresponding to the plurality of modalities, and performs living body detection based on the predicted multimodal fusion features, obtains the predicted multimodal living body classification results. For example, the computing device 600 performs living detection based on the prediction feature of the RGB mode, the prediction feature of the NIR mode, and the prediction feature of the Depth mode, and the prediction multi-mode fusion feature of the RGB mode+the NIR mode+the Depth mode, respectively, so as to obtain a prediction living detection result of the RGB mode, a prediction living detection result of the NIR mode, and a prediction living detection result of the Depth mode, and a multi-mode prediction living detection result of the RGB mode+the NIR mode+the Depth mode.

The preset multimodal relationship constraint network may be a multi-layer perceptron (Multilayer Perceptron, MLP), also referred to as an artificial neural network (Artificial Neural Network, ANN), configured to constrain predicted features of the plurality of modalities extracted by the preset multimodal feature encoding network or to constrain based on predicted features of the plurality of modalities extracted by the preset multimodal feature encoding network and predicted multimodal fusion features. For example, the computing device 600 inputs the predicted features of the plurality of modalities to a preset multimodal relationship constraint network, such that the preset multimodal relationship network outputs similarities between the predicted features of the plurality of modalities. Or the computing device 600 inputs the prediction features and the prediction multimodal fusion features of the plurality of modalities to the preset multimodal relationship constraint network, so that the preset multimodal relationship network outputs the similarity between the prediction features and the prediction multimodal fusion features of the plurality of modalities.

After obtaining the outputs of the preset multi-modal feature encoding network, the preset multi-modal living body classification network, and the preset multi-modal relation constraint network, the computing device 600 may determine a first comprehensive loss based on the outputs of the preset multi-modal feature encoding network, the preset multi-modal living body classification network, and the preset multi-modal relation constraint network, and converge the preset multi-modal feature encoding network, the preset multi-modal living body classification network, and the preset multi-modal relation constraint network based on the first comprehensive loss, to obtain the living body detection model. The first integrated loss may be expressed as the following equation (1):

Loss_total1＝Loss_cls1+a×Loss_pred+b×Loss_feat；(1)

In the formula (1), loss_total1 represents a first comprehensive Loss; loss_cls1 represents a multi-modal living body classification Loss, and represents a first target of multi-modal living body detection model training; loss_pred represents a Loss of multimodal predictive consistency, characterizing a second objective of multimodal biopsy model training; loss_coat represents a Loss of multi-modal feature consistency, characterizing a third objective of multi-modal living detection model training. a. b represent weights, respectively. Wherein a=0 or 1 and b=0 or 1.

The multi-modal living body classification Loss los_cls1 may be determined based on a weighted sum of differences between a plurality of predicted living body classification results corresponding to a plurality of modalities output by a preset multi-modal living body classification network and corresponding real living body classification results, and differences between the predicted multi-modal living body classification results and corresponding real living body classification results. For example, the computing device 600 determines sub-living classification loss corresponding to the RGB modality based on a difference between the predicted living classification result of the RGB modality and the real living classification result corresponding thereto. Similarly, computing device 600 may derive a sub-living classification loss corresponding to the NIR modality and a sub-living classification loss corresponding to the Depth modality. In addition, the computing device 600 may further determine a sub-living body classification Loss corresponding to the multi-modality based on a difference between the predicted multi-modality living body classification result and the corresponding real living body classification result, and obtain a living body classification Loss loss_cls1 based on a weighted sum of the sub-living body classification Loss corresponding to the RGB modality, the sub-living body classification Loss corresponding to the NIR modality, the sub-living body classification Loss corresponding to the Depth modality, and the sub-living body classification Loss corresponding to the multi-modality. The loss_cls1 is intended to constrain the differences between the predicted living body classification results of the plurality of modalities and the corresponding real living body classification results thereof, and the weighted sum of the differences between the predicted multi-modality living body classification results and the corresponding real living body classification results thereof to be within a first preset range. The computing device 600 performs back propagation based on the loss_cls1 obtained by the current training to update parameters of the preset multi-modal feature encoding network and the preset multi-modal living body classification network until the training is finished.

The multi-modal prediction consistency Loss _ pred may be determined based on differences between the plurality of predicted living body classification results of the plurality of modalities and the fused predicted multi-modal living body classification results. For example, the computing device 600 determines a sub-prediction consistency loss for an RGB modality based on a difference between a predicted living body classification result for the RGB modality and a predicted multi-modality living body classification result. Similarly, computing device 600 may determine a sub-predictive consistency loss for the NIR modality and a sub-predictive consistency loss for the Depth modality. Thereafter, the computing device 600 may derive a multi-modal prediction consistency loss based on a weighted sum of the sub-prediction consistency loss for the RGB modality, the sub-prediction consistency loss for the NIR modality, and the sub-prediction consistency loss for the Depth modality. The accuracy of the multi-mode fusion living body classification result is higher, and the computing device 600 restricts the difference between the multi-mode prediction living body classification result and the multi-mode living body classification result within the second preset range based on the multi-mode prediction consistency Loss loss_pred, so that the accuracy of the multi-mode prediction living body classification result can be improved. The computing device 600 performs back propagation based on the loss_pred obtained by the current training to update the parameters of the preset multi-modal feature encoding network and the preset multi-modal living body classification network until the training is finished.

The Loss of multi-modal feature consistency Loss may be determined based on the similarity between the predicted features of the plurality of modalities, or based on the difference between the predicted features of the plurality of modalities and the multi-modal predicted features. Taking the multi-modal feature consistency Loss loss_coat as an example, the computing device 600 may determine the similarity corresponding to the RGB-NIR mode based on the similarity between the prediction features corresponding to the RGB mode and the prediction features corresponding to the NIR mode based on the similarity between the prediction features of the multiple modes. Similarly, computing device 600 may determine a similarity for the RGB modality-Depth modality correspondence and a similarity for the NIR-Depth modality correspondence. The computing device 600 may then obtain the similarity between the prediction features of the plurality of modalities based on the weighted sum of the similarity corresponding to the RGB-NIR modality, the similarity corresponding to the RGB-Depth modality, and the similarity corresponding to the NIR-Depth modality. The multi-modality feature consistency penalty Loss loss_coat aims to constrain the differences between the predicted features of the multiple modalities to be within a third preset range. In the training process, the computing device 600 performs back propagation based on the loss_coat obtained in the current training process, so as to update parameters of the preset multi-mode feature coding network and the preset multi-mode relation constraint network until the training is finished.

In the training process, the condition that the training is finished may be that the first comprehensive loss is smaller than the first loss value, or the training frequency reaches the preset training frequency, or the training precision of the multi-mode living body detection model reaches the preset precision, which is not limited in this specification.

After training to obtain the multi-modal living models, computing device 600 may use the multi-modal living models as a teacher model and the single-modal preset target living models as student models to guide training of the single-modal preset target living models. That is, the computing device 600 may employ the multi-modal biopsy model to perform knowledge distillation on the pre-set target biopsy model, resulting in a trained target biopsy model. The trained target living body detection model can use a single-mode target image as input, so that the living body detection performance is close to that of the multi-mode living body detection model.

The computing device 600 uses the multi-modal living detection model as a teacher model, and various ways of implementing knowledge distillation on the preset target living detection model are available, such as one-time training, further such as multiple training steps, and so on.

In some embodiments, computing device 600 may perform progressive knowledge-distillation (stepped multiple training) on the target biopsy model based on the multi-modal biopsy model. The progressive knowledge distillation may include predictive knowledge distillation and advanced knowledge distillation. The predictive knowledge distillation and the advanced knowledge distillation will be described in order. The implementation of progressive knowledge distillation will be described below in connection with the network structure of a preset target biopsy model.

Fig. 5 shows a network structure schematic diagram of a preset target living body detection model provided according to an embodiment of the present specification. As shown in fig. 5, the preset target living body detection model may include a preset single-mode feature encoding network and a preset single-mode living body classification network.

The preset unimodal feature encoding network may be a convolutional neural network such as ResNet or DenseNet, configured to perform feature extraction based on training images of the target modality. Wherein the target modality is one of a plurality of modalities employed by the multi-modality living body detection model. The training image of the target mode is one of a plurality of training images of modes adopted by the multi-mode living body detection model, such as a training image corresponding to an RGB mode, an NIR mode or a Depth mode. In predicting the distillation, the computing device 600 may input training images of the target modality to the unimodal feature encoding network such that the unimodal feature encoding network performs feature extraction based on the training images of the unimodality, resulting in first predicted features of the target modality.

The preset single-mode living body classification network may be an MLP configured to perform living body detection based on the first predicted features of the target mode extracted by the preset single-mode feature encoding network. When distillation is predicted, the computing device 600 inputs the first prediction feature of the target modality to the preset single-modality living body classification network, so that the preset single-modality living body classification network performs living body detection based on the first prediction feature of the target modality, and a first prediction living body classification result can be obtained.

It should be noted that, the preset single-mode feature encoding network and the preset multi-mode feature encoding network are convolutional neural networks such as res net or densnet, and the preset single-mode living body classifying network and the preset multi-mode living body classifying network are both MLPs. The network structure of the teacher model is high in complexity, so that the teacher model has good living body detection performance and generalization capability. The structure of the student model is simpler, and the parameter operation amount and the consumption of calculation resources are less. The computing device 600 guides training of the student model through the teacher model, and can enable the lightweight chemical raw model with a simpler network structure to achieve the effect close to the living body detection performance of the teacher model, so that when the trained student model is applied to the terminal device 200 for living body detection, computing resources can be greatly saved, living body detection efficiency is improved, and hardware cost of the image acquisition module can be reduced.

After obtaining the outputs of the preset single-mode feature encoding network and the preset single-mode living body classification network, the computing device 600 may determine a second comprehensive loss based on the outputs of the preset single-mode feature encoding network and the preset single-mode living body classification network, and converge the preset single-mode feature encoding network and the preset single-mode living body classification network based on the second comprehensive loss, so as to obtain the intermediate target living body detection model. The second integrated loss may characterize the training target of the target biopsy model in the predictive recognition distillation stage. The second integrated loss can be expressed as the following equation (2):

Loss_total2＝Loss_pred’+c×Loss_cls2+d×Loss_h+e×Loss_csi；(2)

in the formula (2), loss_total2 represents a second integrated Loss; loss_pred' represents a predicted fit Loss, which can characterize the first target of the target biopsy model during training; loss_cls2 represents a first single-mode living body classification Loss, and can represent a second target of the target living body detection model during training; loss_h represents mutual information Loss, and can represent a third target of the target living detection model during training; loss_csi represents a conditional self-information Loss, and can characterize a fourth target of the target living detection model during training. c. d and e are weights respectively. Wherein c=0 or 1, d=0 or 1, e=0 or 1. The training target of the target biopsy model at the time of training may include a first target. The training target of the target living detection model at the time of training may further include at least one of a second target, a third target, and a fourth target.

The first target may include a first predicted living body classification result output by the target living body detection model consistent with a predicted multi-modal living body classification result output by the multi-modal living body detection model. The agreement may be that the difference between the two is within a fourth preset range. The Loss _ pred' may be determined based on a difference between a first predicted living body classification result output by the preset single-mode living body classification network and a predicted multi-mode living body classification result output by the multi-mode living body detection model. The loss_pred' is intended to constrain the first predicted living body classification result output by the target living body detection model to approach the predicted multi-modal living body classification result output by the multi-modal living body detection model. During the training process, the computing device 600 may perform back propagation based on the loss_pred' obtained by the current training to update parameters of the preset single-mode feature encoding network and the preset single-mode living body classification network until the training is finished.

The second target may include the first predicted living body classification result output by the target living body detection model being consistent with its corresponding real living body classification result. The agreement may be that the difference between the two is within a fifth preset range. The loss_cls2 may be determined based on a difference between the first predicted living body classification result and its corresponding real living body classification result. The loss_cls is intended to constrain the first predicted living organism classification result to approach its corresponding true living organism classification. During the training process, the computing device 600 may perform back propagation based on the loss_cls2 obtained by the current training to update parameters of the preset single-mode feature encoding network and the preset single-mode living body classification network until the training is finished.

The third objective may comprise that the first predictive feature of the target modality output by the target biopsy model approaches the predictive multimodal fusion feature output by the multimodal biopsy model, i.e. that the amount of mutual information between the two is maximized. The mutual information amount maximization may be that the mutual information amount approaches a maximum value. The Loss h may be determined based on a difference between the amount of mutual information between the first predicted feature of the target modality and the predicted multimodal fusion feature and a first preset value. The first preset value may be a maximum value of the mutual information amount. It should be noted that, the maximum value of the mutual information amount may be the maximum allowable value of the mutual information amount after normalization processing. The loss_h aims to restrict the difference between the mutual information quantity between the first prediction feature and the prediction multi-mode fusion feature of the target mode and the first preset value to be within a sixth preset range. It can also be understood that the constraint objective of loss_h is to make the mutual information amount between the first prediction feature of the target modality and the prediction multi-modality fusion feature approach to a first preset value, for example, approach to 1, so as to achieve the purpose of mutual information maximization. The loss_h can achieve the effect that the output of the student model is estimated as much as possible according to the output of the teacher model under the condition that the output of the teacher model is known, so that the trained target living body detection model achieves the effect that the detection performance of the target living body detection model is close to that of the multi-mode living body detection model. During the training process, the computing device 600 may perform back propagation based on the loss_h obtained by the current training, so as to update the parameters of the preset single-mode feature encoding network until the training is finished.

The amount of mutual information between the first predicted feature of the target modality and the predicted multimodal fusion feature may be determined as follows: for example, calculateThe apparatus 600 marks the predicted multi-modal fusion feature output by the teacher model as X, marks the first predicted feature of the target modality output by the student model as Y, and determines the amount of mutual information between the two based on the formula I (X, Y) =h (X) +h (Y) -H (X, Y). Wherein I (X, Y) represents the correlation between the teacher model and the student model; h (X) represents entropy of X output by the teacher model; h (Y) represents entropy of Y output by the student model; h (X, Y) represents the joint entropy of X output by the teacher model and Y output by the student model. H (X) = -P _dos log ₂ P _dos +(-P _live log ₂ P _live )。P _dos The determination may be based on a ratio of the number of attack categories in the predicted multi-modal biopsy result to the predicted multi-modal biopsy result. Similarly, computing device 600 may obtain H (Y). Notably, when computing device 600 is computing H (X, Y), it is determined based on the joint probabilities of the teacher model and the student model. That is, H (X, Y) = -P _dos log ₂ P _dos +(-P _live log ₂ P _live )+(-P _live log ₂ P _dos )+(-P _dos log ₂ P _live )。

The fourth objective may include minimizing a conditional self-information amount of the first predicted feature of the target modality output by the multi-modality living body detection model in a case where the first predicted feature of the target modality output by the preset single-modality feature encoding network is determined. The condition self-information amount minimization may be that the condition self-information amount approaches a minimum value. The loss_csi may be based on the following determination of the preset Under the condition that the single-mode feature encoding network outputs the first prediction feature of the target mode, the condition of the first prediction feature of the target mode output by the multi-mode living body detection model is determined from the difference between the information quantity and the second preset value. The second preset value may be a minimum value of the conditional self-information amount. The minimum value of the self-information amount may be a minimum allowable value of the self-information amount after normalization processing. Loss_csi aims at being constrained in determining the first mode of outputting target by a preset single-mode characteristic coding networkUnder the condition of the prediction feature, the difference between the condition self-information quantity of the first prediction feature of the target mode output by the multi-mode living body detection model and the second preset value is in a seventh preset range. It can also be understood that the constraint target of the loss_csi is to enable the condition self-information quantity of the first prediction feature of the target mode output by the multi-mode living body detection model to approach to a second preset value, such as 0, under the condition that the first prediction feature of the target mode output by the preset single-mode living body classification network is determined, so that the purpose of minimizing the condition self-information quantity is achieved. The larger the condition self-information amount is, the smaller the probability of estimating the output of the teacher model from the output of the student model is in the case where the output of the student model is known. The loss_csi can achieve the effect that the output of the teacher model is estimated as little as possible according to the output of the student model under the condition that the output of the student model is known, so that the effect that the trained target living body detection model has low dependence on the teacher model is achieved. Based on the loss_h and the loss_csi, the purpose of transferring priori knowledge learned by a teacher model to a student model can be achieved, the purpose that the dependence of the student model on the teacher model is low, and accurate living body detection results can be independently made can be achieved. In the training process, the computing device 600 may perform back propagation based on the loss_csi obtained in the current training, so as to update the parameters of the preset single-mode feature encoding network until the training is finished.

The conditional self-information amount can be expressed as I' (X, Y) =log ₂ (P(Y _dos |X _dos ))+log ₂ (P(Y _live |X _live ))+log ₂ (P(Y _dos |X _live ))+log ₂ (P(Y _live |X _dos ))。P(Y _dos |X _dos ) The probability of an attack class occurring in X is represented in a sample where Y is known to be an attack class. P (Y) _live |X _live ) The probability of the occurrence of the living organism category in X is represented in the sample where Y is known to be the living organism category. P (Y) _dos |X _live ) The probability of the occurrence of a living class in X is represented in a sample where Y is known to be an attack class. P (Y) _live |X _dos ) The attack class in X is expressed in a sample where Y is known to be a living classProbability of occurrence. With P (Y) _live |X _dos ) For example, P (Y _live |X _dos ) The determination may be made in the following manner: for example, the computing device 600 may determine, as P (Y), the proportion of the samples discriminated as the attack class among the X samples corresponding to Y among the samples in which Y is the living body class _live |X _dos )。

The third term and the fourth term in the above formula (2) may be understood as constraint that the correlation between the first prediction feature of the target modality output by the target living body detection model and the prediction multi-modality fusion feature of the multiple modalities output by the multi-modality living body detection model meets the preset requirement. The preset requirements may include: the mutual information quantity between the first prediction feature of the target mode output by the target living body detection model and the prediction multi-mode fusion feature output by the multi-mode living body detection model approaches to a first preset value; and under the condition that the target living body detection model outputs the first prediction characteristic of the target mode is determined, the condition self-information quantity of the first prediction characteristic of the target mode output by the multi-mode living body detection model approaches to a second preset value.

When determining the second comprehensive loss, the computing device 600 needs to include a training image of a single mode in the training images of a plurality of modes corresponding to the prediction multi-mode living body classification result. That is, the training image corresponding to the first predicted living body classification result should be included in the training images of the plurality of modalities corresponding to the predicted multi-modality living body classification result.

In the training process of the predictive recognition distillation, the condition of the end of training may be that the second comprehensive loss is smaller than the second loss value, or the training frequency reaches the preset training frequency, or the training precision of the preset target living body detection model reaches the preset precision, which is not limited in this specification.

And after the training of the pre-knowledge distillation is finished, the obtained preset target living body detection model is the trained intermediate target living body detection model. The intermediate target living body detection model comprises an intermediate single-mode characteristic coding network and an intermediate single-mode living body classification network. The computing device 600 may advance the order knowledge distillation of the preset target biopsy model based on the multi-modal biopsy model. The advanced knowledge distillation aims to further improve the living body detection performance of the student model.

The training target of the advanced knowledge distillation may include prediction distribution consistency, and may further include that the second predicted living body classification result output by the target living body detection model is consistent with the corresponding real living body classification result. The consistency may be that the difference between the two is within an eighth preset range, or may be that the second predicted living body classification result output by the target living body detection model approaches to the corresponding real living body classification result.

The prediction distribution consistency may be that the distribution of the prediction object under the target category output to the target living body detection model is consistent with the distribution of the prediction object under the target category output to the multi-mode living body detection model. The target class may be an attack class or a living class. Predicting distribution consistency may include: the distribution of the predicted objects under the living body category output by the target living body detection model is consistent with the distribution of the predicted objects under the living body category output by the multi-mode living body detection model; and the distribution of the predicted objects under the attack category output by the target living body detection model is consistent with the distribution of the predicted objects under the attack category output by the multi-mode living body detection model. The agreement may be that the difference between the two is within a ninth preset range. The prediction consistency distribution can restrict the distribution rule of the prediction object under the living body category output by the target living body detection model to approach to the distribution rule of the prediction object under the living body category output by the multi-mode living body detection model in the training process.

The prediction object includes a prediction feature and/or a prediction living body classification result output by the model. For the target living detection model, the prediction object may be a second prediction feature and/or a second prediction living body classification result output by the target living detection model. For a multi-modal living body detection model, the prediction object may be a prediction multi-modal fusion feature and/or a prediction multi-modal living body classification result output by the multi-modal living body detection model.

In some embodiments, predicting distribution consistency may include: the distribution of the second prediction features under the living body category output by the target living body detection model is consistent with the distribution of the prediction multi-mode fusion features under the living body category output by the multi-mode living body detection model. The agreement may be that the difference between the two is within a tenth preset range to restrict the distribution rule of the second prediction feature under the living body category output by the target living body detection model to approach the distribution rule of the prediction multi-mode fusion feature under the living body category output by the multi-mode living body detection model. And the distribution of the second prediction features under the attack category output by the target living body detection model is consistent with the distribution of the prediction multi-mode fusion features under the attack category output by the multi-mode living body detection model. The agreement may be that the difference between the two is within an eleventh preset range. The distribution rule of the second prediction features under the attack category output by the uniform constraint target living body detection model approaches to the distribution rule of the prediction multi-mode fusion features under the attack category output by the multi-mode living body detection model.

In some embodiments, predicting distribution consistency may include: the distribution of the second predicted living body classification result under the living body category output by the target living body detection model is consistent with the distribution of the predicted multi-mode living body classification result under the living body category output by the multi-mode living body detection model. The agreement may be that the difference between the two is within a twelfth preset range to restrict the distribution rule of the second predicted living body classification result under the living body category output by the target living body detection model to approach the distribution rule of the predicted multi-mode living body classification result under the living body category output by the multi-mode living body detection model. And the distribution of the second predicted living body classification result under the attack category output by the target living body detection model is consistent with the distribution of the predicted multi-mode living body classification result under the attack category output by the multi-mode living body detection model. The agreement may be that the difference between the two is within a thirteenth preset range so as to restrict the distribution rule of the second predicted living body classification result under the attack category output by the target living body detection model to approach the distribution rule of the predicted multi-mode living body classification result under the attack category output by the multi-mode living body detection model.

In some embodiments, predicting distribution consistency may include: the distribution of the second prediction features under the living body category output by the target living body detection model is consistent with the distribution of the prediction multi-mode fusion features under the living body category output by the multi-mode living body detection model, and the distribution of the second prediction living body classification results under the living body category output by the target living body detection model is consistent with the distribution of the prediction multi-mode living body classification results under the living body category output by the multi-mode living body detection model; and the distribution of the second prediction features under the attack category output by the target living body detection model is consistent with the distribution of the prediction multi-mode fusion features under the attack category output by the multi-mode living body detection model, and the distribution of the second prediction living body classification results under the attack category output by the target living body detection model is consistent with the distribution of the prediction multi-mode living body classification results under the attack category output by the multi-mode living body detection model. Reference is made to the description of the relevant embodiments herein for consistency.

The distribution of predicted objects may be characterized by a class center. The computing device 600 may constrain the training process of advanced knowledge distillation from the class center of the living class and the class center of the attack class, respectively.

Wherein the distribution of the predicted objects under the living body category output by the target living body detection model is consistent with the distribution of the predicted objects under the living body category output by the multi-mode living body detection model, may include at least one of: the class center of the predicted object under the living body class output by the target living body detection model is consistent with the class center of the predicted object under the living body class output by the multi-mode living body detection model; and the distance between the predicted object under the living body category output by the target living body detection model and the class center of the corresponding living body category is consistent with the distance between the predicted object under the living body category output by the multi-mode living body detection model and the class center of the corresponding living body category. The agreement may be that the difference between the two is within a preset range.

That is, the distribution of the predicted objects under the living category output by the target living detection model is identical to the distribution of the predicted objects under the living category output by the multi-modal living detection model, and may include the following several implementations:

in some embodiments, the distribution of the predicted objects under the living category output by the target living detection model is consistent with the distribution of the predicted objects under the living category output by the multi-modal living detection model, may include: the class center of the predicted object under the living body class output by the target living body detection model is consistent with the class center of the predicted object under the living body class output by the multi-mode living body detection model. The agreement may be that the difference between the two is within a fourteenth preset range to restrict the class center of the predicted object under the living class output by the target living detection model to approach the class center of the predicted object under the living class output by the multi-mode living detection model.

In some embodiments, the distribution of the predicted objects under the living category output by the target living detection model is consistent with the distribution of the predicted objects under the living category output by the multi-modal living detection model, may include: the distance between the predicted object under the living body category output by the target living body detection model and the class center of the corresponding living body category is consistent with the distance between the predicted object under the living body category output by the multi-mode living body detection model and the class center of the corresponding living body category. The agreement may be that the difference between the two is within a fifteenth preset range to restrict the distance between the predicted object under the living body class output by the target living body detection model and the class center of the living body class corresponding thereto, approaching the distance between the predicted object under the living body class output by the multi-mode living body detection model and the class center of the living body class corresponding thereto.

In some embodiments, the distribution of the predicted objects under the living category output by the target living detection model is consistent with the distribution of the predicted objects under the living category output by the multi-modal living detection model, may include: the class center of the predicted object under the living body class output by the target living body detection model is consistent with the class center of the predicted object under the living body class output by the multi-mode living body detection model, and the distance between the predicted object under the living body class output by the target living body detection model and the class center of the corresponding living body class is consistent with the distance between the predicted object under the living body class output by the multi-mode living body detection model and the class center of the corresponding living body class. Reference is made to the description of the relevant content of the above embodiments.

In addition, the distribution of the predicted objects under the attack category output by the target living body detection model is consistent with the distribution of the predicted objects under the attack category output by the multi-mode living body detection model, and at least one of the following may be included: the class center of the predicted object under the attack class output by the target living body detection model is consistent with the class center of the predicted object under the attack class output by the multi-mode living body detection model; and the distance between the predicted object under the attack category output by the target living body detection model and the class center of the attack category corresponding to the predicted object is consistent with the distance between the predicted object under the attack category output by the multi-mode living body detection model and the class center of the attack category corresponding to the predicted object. The agreement may be that the difference between the two is within a preset range.

That is, the distribution of the predicted objects under the attack category output by the target living body detection model is consistent with the distribution of the predicted objects under the attack category output by the multi-mode living body detection model, and the following several implementation modes may be further included:

in some embodiments, the distribution of the predicted objects under the attack category output by the target living detection model is consistent with the distribution of the predicted objects under the attack category output by the multi-mode living detection model, and may further include: the class center of the predicted object under the attack class output by the target living body detection model is consistent with the class center of the predicted object under the attack class output by the multi-mode living body detection model. The agreement may be that the difference between the two is within a sixteenth preset range to restrict the class center of the predicted object under the attack class output by the target living body detection model to approach the class center of the predicted object under the attack class output by the multi-mode living body detection model.

In some embodiments, the distribution of the predicted objects under the attack category output by the target living detection model is consistent with the distribution of the predicted objects under the attack category output by the multi-mode living detection model, and may further include: the distance between the predicted object under the attack category output by the target living body detection model and the class center of the corresponding attack category is consistent with the distance between the predicted object under the attack category output by the multi-mode living body detection model and the class center of the corresponding attack category. The consistency may be that the difference between the two is within a seventeenth preset range so as to restrict the distance between the predicted object under the attack category output by the target living body detection model and the class center of the attack category corresponding to the predicted object, and approach the distance between the predicted object under the attack category output by the multi-mode living body detection model and the class center of the attack category corresponding to the predicted object.

In some embodiments, the distribution of the predicted objects under the attack category output by the target living detection model is consistent with the distribution of the predicted objects under the attack category output by the multi-mode living detection model, and may further include: the class center of the predicted object under the attack class output by the target living detection model is consistent with the class center of the predicted object under the attack class output by the multi-mode living detection model, and the distance between the predicted object under the attack class output by the target living detection model and the class center of the corresponding attack class is consistent with the distance between the predicted object under the attack class output by the multi-mode living detection model and the class center of the corresponding attack class. Reference is made to the description of the relevant content of the above embodiments.

During advanced knowledge distillation, the computing device 600 may input the training image of the target modality to the intermediate unimodal feature encoding network to perform feature extraction on the training image of the target modality using the intermediate unimodal feature encoding network to obtain a second prediction feature of the target modality, and input the second prediction feature of the target modality to the intermediate unimodal living body classification network to perform living body detection using the intermediate unimodal living body classification network based on the second prediction feature of the target modality, thereby obtaining a second prediction living body classification result.

After obtaining the second predicted feature and the second predicted living body classification result of the target modality, the computing device 600 may determine a third comprehensive loss based on the second predicted feature and/or the second predicted living body classification result. Taking the example of computing device 600 determining a third comprehensive loss based on the second predicted feature of the target modality and the second predicted living body classification result, the third comprehensive loss may be expressed as the following equation (3):

Loss_total3＝Loss_cls3+x×Loss_c1+y×Loss_c2；(3)

in the formula (3), loss_total3 represents a third integrated Loss; loss_cls3 represents a second single-mode living body classification Loss, which can represent a fifth target of the target living body detection model during training; loss_c1 represents class center relationship constraint Loss, and can represent a sixth target of the target living detection model during training; loss c2 represents a fine-grained class-centered relationship constraint Loss, which may represent a seventh goal of the target biopsy model at training time. And x and y are weights respectively. Where x=0 or 1 and y=0 or 1. The training target of the target biopsy model at the time of training may include a fifth target. The training targets of the target biopsy model at the time of training may further include a sixth target and/or a seventh target.

The fifth objective may include that the second predicted living organism classification result is consistent with its corresponding real living organism classification result. The agreement may be that the difference between the two is within a third preset range. The loss_cls3 may be determined based on a difference between the second predicted living body classification result and its corresponding real living body classification result. The loss_cls is intended to restrict the second predicted living body classification result from approaching the true living body classification result corresponding thereto. When determining loss_h, the computing device 600 needs to include a training image of a single mode in the training images of a plurality of modes corresponding to the prediction multi-mode living body classification result. During the training process, the computing device 600 may perform back propagation based on the loss_cls3 obtained from the current training to update parameters of the intermediate unimodal feature encoding network and the intermediate unimodal living body classification network until the training is completed.

The sixth object may include that a class center of the predicted object under the living category output by the target living detection model coincides with a class center of the predicted object under the living category output by the multi-modal living detection model, and a class center of the predicted object under the attack category output by the target living detection model coincides with a class center of the predicted object under the attack category output by the multi-modal living detection model. The agreement may be that the difference between the two is within a preset range. The loss_c1 may be determined based on at least one of a difference between a class center of the predicted object under the living category output by the target living detection model and a class center of the predicted object under the living category output by the multi-modal living detection model, and a difference between a class center of the predicted object under the attack category output by the target living detection model and a class center of the predicted object under the attack category output by the multi-modal living detection model. For example, the computing device 600 may determine loss_c1 based on a difference between a class center of the predicted object under the living class output by the target living detection model and a class center of the predicted object under the living class output by the multi-mode living detection model, may also determine loss_c1 based on a difference between a class center of the predicted object under the attack class output by the target living detection model and a class center of the predicted object under the attack class output by the multi-mode living detection model, and may also determine loss_c1 based on a weighted sum of differences between a class center of the predicted object under the attack class output by the target living detection model and a class center of the predicted object under the attack class output by the multi-mode living detection model. The loss_c1 aims to restrict the class center of the predicted object under the living class output by the target living detection model to approach the class center of the predicted object under the living class output by the multi-mode living detection model, and to restrict the class center of the predicted object under the attack class output by the target living detection model to approach the class center of the predicted object under the attack class output by the multi-mode living detection model.

Taking the class center corresponding to the living body class as an example, the computing device 600 may determine the class center of the predicted object under the living body class output by the target living body detection model in a variety of ways. For example, the computing device 600 may determine a class center of the predicted object under the living organism class based on an average of the predicted objects under the living organism class output by the target living organism detection model. Similarly, computing device 600 may also obtain class centers of predicted objects under the attack class output by the target biopsy model, class centers of predicted objects under the live class output by the multi-modal biopsy model, and class centers of predicted objects under the attack class output by the multi-modal biopsy model. The loss_c1 can restrict the class center of the predicted object output by the target living body detection model to be consistent with the class center of the multi-mode living body detection model as much as possible, so that the target living body detection model fully learns the priori knowledge of the multi-mode living body detection model, and the living body detection accuracy of the target living body detection model is improved.

The distance between class centers of the predicted objects under the living body class output by the target living body detection model characterizes the compactness degree of the predicted objects under the living body class output by the target living body detection model. When the compactness is larger, the distribution of the predicted objects under the living body category output by the target living body detection model is more dispersed, and the fitting difficulty to the model is larger. The trained multi-mode living body detection model has higher living body detection performance, namely, the model has good fitting to a training object. The computing device 600 can restrict the distribution rule of the predicted object under the living category output by the target living detection model to approach the distribution rule of the predicted object under the living category output by the multi-mode living detection model by restricting the class center of the predicted object under the living category output by the target living detection model to be as uniform as possible with the class center of the living category of the predicted object under the living category output by the multi-mode living detection model. The same holds true for attack categories. The class center constraint can enable the distribution of the extracted features of the student model in the feature space to be close to the distribution of the extracted features of the teacher model in the feature space, so that the distribution of the extracted features of the student model in the feature space is more compact, the robustness to noise is stronger, and living body classification is facilitated.

The seventh object may include a distance between the predicted object under the living body class output by the object living body detection model and a class center of its corresponding living body class, consistent with a distance between the predicted object under the living body class output by the multi-modal living body detection model and a class center of its corresponding living body class. The agreement may be that the difference between the two is within a preset range. The loss_c2 may be determined based on at least one of a difference in distance between the predicted object under the living type output by the target living detection model and the class center of the living type corresponding thereto, a difference in distance between the predicted object under the living type output by the multi-modal living detection model and the class center of the living type corresponding thereto, and a difference in distance between the predicted object under the attack type output by the target living detection model and the class center of the attack type corresponding thereto, and a difference in distance between the predicted object under the attack type output by the multi-modal living detection model and the class center of the attack type corresponding thereto. For example, the computing device 600 may determine the loss_c2 based on a distance between the predicted object under the living class output by the target living detection model and a class center of its corresponding living class, a difference between the predicted object under the living class output by the multi-modal living detection model and a class center of its corresponding living class, a distance between the predicted object under the attack class output by the target living detection model and a class center of its corresponding attack class, a difference between the predicted object under the attack class output by the multi-modal living detection model and a class center of its corresponding attack class, a difference between the Loss_c2 based on a distance between the predicted object under the living class output by the target living detection model and a class center of its corresponding living class, and a weighted difference between the predicted object under the multi-modal living detection model and a class center of its corresponding living class, a difference between the predicted object under the multi-modal living class output by the multi-modal living detection model and a weighted difference between the predicted object under the attack class and a class center of its corresponding class center. The loss_c2 aims to further restrict the class center of the predicted object output by the target living body detection model to be consistent with the class center of the multi-mode living body detection model as much as possible, or can be understood as restricting the class center of the predicted object output by the target living body detection model to approach to the class center of the multi-mode living body detection model, so that the target living body detection model can further fully learn the priori knowledge of the multi-mode living body detection model, and the living body detection accuracy of the target living body detection model is further improved.

Wherein, the computing device 600 may determine the distance between the predicted object under the living body category output by the target living body detection model and the class center of the living body category corresponding thereto based on the euclidean distance, the L2 distance, and the like between the predicted object under the living body category output by the target living body detection model and the class center of the living body category corresponding thereto when determining the distance between the predicted object under the living body category output by the target living body detection model and the class center of the living body category corresponding thereto.

The distance between the predicted object under the living body category output by the target living body detection model and the class center of the corresponding living body category represents the degree of deviation of the predicted object under the living body category output by the target living body detection model from the class center of the corresponding living body category. When the deviation degree is larger, the distribution of the predicted objects under the living body category output by the target living body detection model is more dispersed, and the difficulty of fitting the model is larger. The trained multi-mode living body detection model has higher living body detection performance, namely, the model has good fitting to a training object. The computing device 600 can further restrict the distribution of the predicted objects under the living object category output by the target living object detection model from approaching the distribution of the predicted objects under the living object category output by the multi-mode living object detection model by restricting the degree to which the predicted objects under the living object category output by the target living object detection model deviate from the class center of the living object category corresponding thereto to be as uniform as possible with the degree to which the predicted objects under the living object category output by the multi-mode living object detection model deviate from the class center of the living object category corresponding thereto. With reference to the above description of the living class loss_c2, the effect of the attack class loss_c2 can be understood, and will not be described here.

In the training process, the computing device 600 may perform back propagation based on the loss_c1 obtained by the current training, so as to update the parameters of the intermediate unimodal feature encoding network and/or the intermediate unimodal living body classification network, until the third comprehensive Loss is smaller than the third Loss value, or the training frequency reaches the preset training frequency, or the training precision of the preset target living body detection model reaches the preset precision, and the training is ended.

The fine-grained class center constraint can further improve the consistency of the distribution of the extracted features of the student model in the feature space and the distribution of the extracted features of the teacher model in the feature space, so that the distribution of the extracted features of the student model in the feature space is more compact, the robustness to noise is further improved, and the accuracy of living body classification is further improved.

And obtaining an intermediate single-mode characteristic coding network and an intermediate single-mode living body classification network at the end of advanced knowledge distillation training, wherein the intermediate single-mode characteristic coding network and the intermediate single-mode living body classification network are the trained target living body detection model.

In some embodiments, a mapping network may also be added during the stepwise knowledge distillation. The mapping network may be a fully-connected layer configured to feature map the second predicted features output by the intermediate single-mode feature encoding network to obtain remapped second predicted features. The feature map may include: rotation, scaling, translation, etc. Accordingly, the prediction object may also include the remapped second prediction feature. The computing device 600 may also determine a third composite loss based on the remapped second predicted features. The feature map can further improve the performance and accuracy of the target living body detection model. The mapping module can ensure that the training result of the predictive recognition distillation is not greatly changed, but the living body detection performance and accuracy can be improved.

The foregoing describes the implementation of progressive knowledge distillation. In some embodiments, the computing device 600 may also perform one-time knowledge-distillation on the preset target living detection model, i.e., performing joint knowledge-distillation based on the second integrated loss and the third integrated loss. For example, the computing device 600 may determine a living classification loss, a class-center relationship constraint loss, and a fine-grained class-center relationship constraint loss based on the first predicted features and/or the first predicted living classification results. That is, it is understood that the computing device 600 may replace the second predicted feature and the second predicted living body classification result in the above formula (3) with the first predicted feature and the first predicted living body classification result to determine a third integrated loss, and perform joint training based on the third integrated loss and the second integrated loss, thereby obtaining a trained target living body detection model.

After the trained target living body detection model is obtained, the target living body detection model may be output to the terminal device 200. So that the terminal device 200 performs living body detection based on the acquired target image, resulting in a target living body detection result. For example, the terminal device 200 inputs the target image to the target living detection model so that the target living detection model outputs the living probability P1 or the attack probability P2. The living body probability P1 may characterize a probability that the target image is a living body object. The attack probability P2 may characterize the probability that the target image is an attack object. The terminal device 200 can determine the target living body detection result based on the living body probability P1 or the attack probability P2. For example, if the living probability P1 is greater than the set threshold T1, the target image is identified as a living type, and if the living probability P1 is less than the set threshold T1, the target image is identified as an attack type. For example, if the attack probability P2 is greater than the set threshold T2, the target image is identified as the attack type, and if the attack probability P2 is less than the set threshold T2, the target image is identified as the living body type.

With continued reference to fig. 3, after step S140, the method P100 may further include step S160.

S160: and outputting a target living body detection result.

There may be various ways in which the computing device 600 outputs the living body detection result, for example, the computing device 600 may visually present the living body detection result. The manner of visually displaying the living body detection result may be various, for example, the computing device 600 may display the living body detection result through an explicit device, or may send out prompt information of the living body detection result through an acousto-optic manner, or the like.

In summary, the living body detection method P100 and the system 001 provided in the present disclosure collect a single-mode target image of a target user, and perform a single-mode-based multi-mode living body detection on the target image through a target living body detection model, so as to output an obtained target living body detection result. In the scheme, a multimode living body detection model is used for carrying out knowledge distillation on the multimode living body detection model to obtain a target living body detection model. The target living body detection model can represent the mapping relation between the single-mode image and the multi-mode living body detection result obtained based on the multi-mode image, so that the single-mode living body detection model has the living body detection performance of the multi-mode living body detection model, and the effect close to the multi-mode living body detection result is achieved when the living body detection is carried out based on the single-mode image. In addition, because the living body detection is carried out based on the single-mode image, the calculation force can be saved, the living body detection efficiency can be improved, and the hardware cost of the image acquisition module can be reduced. Meanwhile, the target living body detection model is a lightweight model based on single-mode image detection, has low requirement on the computing power of the computing equipment, can be deployed on the terminal equipment and also can be deployed on a remote server, and can avoid image transmission when deployed on the terminal equipment, further save the computing time and avoid the risk of privacy leakage in the data transmission process.

Another aspect of the present disclosure provides a non-transitory storage medium storing at least one set of executable instructions for performing a biopsy. When executed by a processor, the executable instructions direct the processor to perform the steps of the in-vivo detection method P100 described herein. In some possible implementations, aspects of the specification can also be implemented in the form of a program product including program code. The program code is for causing the computing device 600 to perform the steps of the in-vivo detection method P100 described in the present specification when the program product is run on the computing device 600. The program product for implementing the methods described above may employ a portable compact disc read only memory (CD-ROM) comprising program code and may run on computing device 600. However, the program product of the present specification is not limited thereto, and in the present specification, the readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system. The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Program code for carrying out operations of the present specification may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on computing device 600, partly on computing device 600, as a stand-alone software package, partly on computing device 600, partly on a remote computing device, or entirely on a remote computing device.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In view of the foregoing, it will be evident to a person skilled in the art that the foregoing detailed disclosure may be presented by way of example only and may not be limiting. Although not explicitly described herein, those skilled in the art will appreciate that the present description is intended to encompass various adaptations, improvements, and modifications of the embodiments. Such alterations, improvements, and modifications are intended to be proposed by this specification, and are intended to be within the spirit and scope of the exemplary embodiments of this specification.

Furthermore, certain terms in the present description have been used to describe embodiments of the present description. For example, "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present description. Thus, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined as suitable in one or more embodiments of the invention.

It should be appreciated that in the foregoing description of embodiments of the present specification, various features have been combined in a single embodiment, the accompanying drawings, or description thereof for the purpose of simplifying the specification in order to assist in understanding one feature. However, this is not to say that a combination of these features is necessary, and it is entirely possible for a person skilled in the art to label some of the devices as separate embodiments to understand them upon reading this description. That is, embodiments in this specification may also be understood as an integration of multiple secondary embodiments. While each secondary embodiment is satisfied by less than all of the features of a single foregoing disclosed embodiment.

Each patent, patent application, publication of patent application, and other materials, such as articles, books, specifications, publications, documents, articles, etc., cited herein are hereby incorporated by reference. All matters are to be interpreted in a generic and descriptive sense only and not for purposes of limitation, except for any prosecution file history associated therewith, any and all matters not inconsistent or conflicting with this document or any and all matters not complaint file histories which might have a limiting effect on the broadest scope of the claims. Now or later in association with this document. For example, if there is any inconsistency or conflict between the description, definition, and/or use of terms associated with any of the incorporated materials, the terms in the present document shall prevail.

Finally, it is to be understood that the embodiments of the application disclosed herein are illustrative of the principles of the embodiments of the present specification. Other modified embodiments are also within the scope of this specification. Accordingly, the embodiments disclosed herein are by way of example only and not limitation. Those skilled in the art can adopt alternative arrangements to implement the application in the specification based on the embodiments in the specification. Therefore, the embodiments of the present specification are not limited to the embodiments precisely described in the application.

Claims

1. A living body detection method, comprising:

acquiring a target image of a target user, wherein the target image is a single-mode image acquired for the target user in a target mode;

performing multi-mode living body detection on the target image based on a target living body detection model to obtain a target living body detection result, wherein the target living body detection model is a single-mode living body detection model obtained by performing knowledge distillation based on the multi-mode living body detection model; and

and outputting the target living body detection result.

2. The method of claim 1, wherein the target modality is one of a plurality of modalities corresponding to the multi-modality living body detection model.

3. The method of claim 2, wherein the knowledge distillation comprises:

and performing predictive recognition distillation on the target living body detection model based on the multi-mode living body detection model, wherein a training target of the predictive recognition distillation comprises that a first predictive living body classification result output by the target living body detection model is consistent with a predictive multi-mode living body classification result output by the multi-mode living body detection model.

4. A method according to claim 3, wherein the training objectives for predictive recognition distillation further comprise at least one of:

the first predicted living body classification result output by the target living body detection model is consistent with the difference of the corresponding real living body classification result; and

the correlation between the first prediction feature of the target mode output by the target living body detection model and the prediction multi-mode fusion feature of the multiple modes output by the multi-mode living body detection model meets the preset requirement.

5. The method of claim 4, wherein the preset requirements include at least one of:

mutual information quantity between a first prediction feature of the target mode output by the target living body detection model and a prediction multi-mode fusion feature output by the multi-mode living body detection model approaches to a first preset value; and

And under the condition that the target living body detection model is determined to output the first prediction characteristic of the target mode, the condition of the first prediction characteristic of the target mode output by the multi-mode living body detection model approaches to a second preset value from the information quantity.

6. The method of claim 3, wherein the knowledge distillation further comprises:

advancing advanced knowledge distillation of the target biopsy model based on the multimodal biopsy model, a training target of the advanced knowledge distillation comprising a predicted distribution consistency comprising:

the distribution of the predicted objects under the living body category output by the target living body detection model is consistent with the distribution of the predicted objects under the living body category output by the multi-mode living body detection model; and

the distribution of the predicted objects under the attack category output by the target living body detection model is consistent with the distribution of the predicted objects under the attack category output by the multi-mode living body detection model.

7. The method of claim 6, wherein the distribution of predicted objects under the living category output by the target living detection model is consistent with the distribution of predicted objects under the living category output by the multi-modal living detection model, comprising at least one of:

The class center of the predicted object under the living body class output by the target living body detection model is consistent with the class center of the predicted object under the living body class output by the multi-mode living body detection model; and

the distance between the predicted object under the living body category output by the target living body detection model and the class center of the corresponding living body category is consistent with the distance between the predicted object under the living body category output by the multi-mode living body detection model and the class center of the corresponding living body category.

8. The method of claim 6, wherein the distribution of predicted objects under the attack category output by the target biopsy model is consistent with the distribution of predicted objects under the attack category output by the multi-modal biopsy model, comprising at least one of:

the class center of the predicted object under the attack category output by the target living body detection model is consistent with the class center of the predicted object under the attack category output by the multi-mode living body detection model; and

and the distance between the predicted object under the attack category output by the target living body detection model and the class center of the attack category corresponding to the predicted object is consistent with the distance between the predicted object under the attack category output by the multi-mode living body detection model and the class center of the attack category corresponding to the predicted object.

9. The method of claim 6, wherein the prediction object comprises a prediction feature and/or a predicted living body classification result output by a model.

10. The method of claim 6, wherein the training objective of advanced knowledge distillation further comprises:

and the second predicted living body classification result output by the target living body detection model is consistent with the corresponding real living body classification result.

11. The method of claim 1, wherein the training goal of the multimodal biopsy model comprises: and a plurality of predicted living body classification results of a plurality of modes output by the multi-mode living body detection model are consistent with the corresponding real living body classification results.

12. The method of claim 11, wherein the training objective of the multimodal biopsy model further comprises at least one of:

the multi-modal living body classification results of the multiple modalities output by the multi-modal living body detection model are consistent with the fused prediction multi-modal living body classification results; and

and the multi-mode living body detection model outputs a plurality of predictive features of a plurality of modes which are consistent.

13. A biopsy system, comprising:

at least one storage medium storing at least one set of instructions for performing a living organism detection; and

At least one processor communicatively coupled to the at least one storage medium,

wherein the at least one processor reads the at least one instruction set and performs the method of any of claims 1-12 as directed by the at least one instruction set when the in-vivo detection system is running.