CN113642639A

CN113642639A - Living body detection method, living body detection device, living body detection apparatus, and storage medium

Info

Publication number: CN113642639A
Application number: CN202110927106.6A
Authority: CN
Inventors: 胡炳然; 刘青松; 宁学成; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-11-12
Anticipated expiration: 2041-08-12
Also published as: CN113642639B

Abstract

The invention relates to a living body detection method, a living body detection device, electronic equipment and a storage medium, which are applied to the technical field of living body detection, wherein the method comprises the following steps: acquiring a first image and a second image of a target object, wherein the first image and the second image are images of different modalities; extracting a first object feature of the first image and a second object feature of the second image, performing feature fusion to obtain a fusion feature, judging whether the target object is a living body according to the fusion feature, and obtaining a judgment result; and determining the judgment result as the detection and identification result of the target object.

Description

Living body detection method, living body detection device, living body detection apparatus, and storage medium

Technical Field

The invention relates to the technical field of in-vivo detection, in particular to a method, a device, equipment and a storage medium for in-vivo detection.

Background

With the rapid development and wide application of artificial intelligence technology, the security problem is more and more emphasized by the public. The face recognition is a biological recognition technology for identity recognition based on face feature information of a person, and has the advantages of non-mandatory property, non-contact property and the like. With the improvement of the accuracy of the face recognition algorithm and the development of the large-scale parallel computing technology, the face recognition technology is used in various ground scenes (for example, security, financial fields, electronic commerce and other scenes which need identity verification, such as bank remote account opening, access control systems, remote transaction operation verification and the like) of identity authentication. Among them, the most important face anti-counterfeiting in the face recognition technology is an indispensable ring.

Human face anti-counterfeiting, also called live body detection, is a technology for distinguishing whether the human face in front of a camera is from a live body or a paper photo/screen photo/mask and other dummy. Whether the detected object is a living individual or not can be determined, and the detected object is not an inanimate object such as a photo, a video and the like, so that a malicious attacker can be prevented from carrying out malicious attack in the modes of a recorded video, a shot photo, a 3D human face model, a forged mask and the like.

Disclosure of Invention

The invention provides a living body detection method, a living body detection device, electronic equipment and a storage medium, which are used for solving the problem that the safety and the reliability of an identity verification system for face recognition in the prior art are lower.

The technical scheme for solving the technical problems is as follows:

the invention provides a living body detection method, which comprises the following steps:

acquiring a first image and a second image of a target object, wherein the first image and the second image are images of different modalities;

extracting a first object feature of the first image and a second object feature of the second image, performing feature fusion to obtain a fusion feature, judging whether the target object is a living body according to the fusion feature, and obtaining a judgment result;

and determining the judgment result as the detection and identification result of the target object.

Further, in the above-mentioned living body detection method, the extracting a first object feature of the first image and a second object feature of the second image, performing feature fusion to obtain a fusion feature, and determining whether the target object is a living body according to the fusion feature to obtain a determination result includes:

inputting the first image and the second image into a dual-stream convolutional network model;

extracting the first object feature and the second object feature through the double-current convolution network model, and performing feature fusion on the first object feature and the second object feature to obtain fusion features;

and judging whether the target object is a living body according to the fusion characteristics to obtain a judgment result.

Further, in the above method for detecting a living body, the training process of the dual-flow convolutional network model includes:

acquiring a sample set, wherein the sample set comprises at least one group of sample data, the sample data comprises a first sample image, a second sample image, a modality type identifier and a living body type identifier, the living body type identifier is used for indicating whether a target sample corresponding to the first sample image and the second sample image is a living body, and the modality type identifier is used for indicating whether the target sample corresponding to the first sample image and the second sample image is consistent;

sequentially carrying out the following training process on each group of the sample data in the sample set:

inputting the sample data into an initial double-current convolution network model;

extracting a first sample feature in the first sample image and extracting a second sample feature in the second sample image;

performing feature fusion on the first sample feature and the second sample feature to obtain a sample fusion feature;

obtaining a first modal prediction probability according to the first sample characteristic, obtaining a second modal prediction probability according to the second sample characteristic, and obtaining a fusion prediction probability according to the sample fusion characteristic;

calculating a loss function value based on the first modality prediction probability, the second modality prediction probability, the fusion prediction probability, the modality category identification and the living body category identification;

and reversely propagating the gradient to each layer of the initial double-current convolution network model according to the loss function value, optimizing parameters of the initial double-current convolution network model, acquiring a next group of sample data from the sample set, and repeatedly executing the training process until the loss function is smaller than a preset value, wherein the initial double-current convolution network model is used as the final double-current convolution network model.

Further, in the above method for detecting a living body, the calculating a loss function value based on the first modality prediction probability, the second modality prediction probability, the fusion prediction probability, the modality category identifier, and the living body category identifier includes:

determining a first intermediate value according to the living body category identification and the first modal prediction probability;

determining a second intermediate value according to the living body category identification and the second modal prediction probability;

determining a third intermediate value according to the living body category identification and the fusion modality prediction probability;

determining a first adjustment factor and a second adjustment factor according to the first intermediate value and the second intermediate value;

and calculating to obtain the loss function value according to the first intermediate value, the second intermediate value, the third intermediate value, the first regulating factor, the second regulating factor and the living body class identifier.

substituting the first modal prediction probability, the second modal prediction probability, the fusion prediction probability, the modal class identifier and the living body class identifier into the following loss function formula to obtain the loss function value;

the loss function formula L is:

L_CE＝-log(m_t)

wherein m represents the fusion prediction probability, p represents the first modality prediction probability, q represents the second modality prediction probability, η represents the modality category identifier, y represents the living body category identifier, and λ, α, γ are all preset parameters, where λ is greater than 0.5.

Further, the above-mentioned living body detecting method further includes:

when the sample data comprises any one of the first sample image and the second sample image, acquiring a first living body type of a target sample corresponding to the first sample image or the second sample image;

acquiring a sample image consistent with the first living body class as the second sample image or the first sample image;

and determining the modal class identification in the sample data formed by the first sample image and the second sample image as target sample inconsistency.

Further, in the above-mentioned living body detecting method, the acquiring the first image and the second image of the target object includes:

shooting the target object by using a binocular camera to obtain a first original image and a second original image;

performing channel transformation on the first original image to obtain a first transformed image;

performing channel transformation on the second original image to obtain a second transformed image;

carrying out image scaling on the first transformation image to obtain a first image;

and carrying out image scaling on the second transformation image to obtain the second image.

The present invention also provides a living body detection apparatus comprising:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a first image and a second image of a target object, and the first image and the second image are images of different modalities;

the detection module is used for extracting a first object feature of the first image and a second object feature of the second image, performing feature fusion to obtain a fusion feature, judging whether the target object is a living body according to the fusion feature, and obtaining a judgment result;

and the determining module is used for determining that the judgment result is the detection and identification result of the target object.

The present invention also provides a living body detecting apparatus comprising: a processor and a memory;

the processor is configured to execute a living body detection program stored in the memory to implement the living body detection method of the first aspect.

The present invention also provides a storage medium storing one or more programs that when executed implement the method of living body detection of the first aspect.

The invention has the beneficial effects that:

acquiring a first image and a second image of a target object, wherein the first image and the second image are images of different modalities; extracting a first object feature of the first image and a second object feature of the second image, performing feature fusion to obtain a fusion feature, judging whether the target object is a living body according to the fusion feature, and obtaining a judgment result; and determining the judgment result as the detection and identification result of the target object. Therefore, the living body detection can be carried out based on the images of the target object in different modalities, the cost of the living body detection is reduced, and in addition, after the image characteristics of the different modalities are fused, whether the target object is a living body is judged, and the mutual connection among the modalities is considered, so that the safety and the reliability of the identity verification system for the face recognition are improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a real diagram of a hardware environment scenario of the positioning adjustment method of the present invention;

FIG. 2 is a flowchart of an embodiment of a positioning adjustment method of the present invention;

FIG. 3 is a block diagram of an embodiment of a positioning adjustment apparatus according to the present invention;

fig. 4 is a structural view of an embodiment of the living body detecting apparatus of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

It should be noted that the method of the embodiment of the present invention may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In the case of such a distributed scenario, one device of the multiple devices may only perform one or more steps of the method according to the embodiment of the present invention, and the multiple devices interact with each other to complete the method.

Fig. 1 is a real diagram of a hardware environment scene of the positioning adjustment method of the present invention. In the embodiment of the present invention, the above-described living body detection method can be applied to a hardware environment constituted by the terminal 101 and the server 102 as shown in fig. 1. As shown in fig. 1, a server 102 is connected to a terminal 101 through a network, which may be used to provide services (such as video services, application services, etc.) for the terminal or a client installed on the terminal, and a database may be provided on the server or separately from the server for providing data storage services for the server 102, and the network includes but is not limited to: the terminal 101 is not limited to a PC, a mobile phone, a tablet computer, and the like.

The living body detection method according to the embodiment of the present invention may be executed by the server 102, the terminal 101, or both the server 102 and the terminal 101. The terminal 101 may execute the living body detection method according to the embodiment of the present invention, or may execute the living body detection method by a client installed thereon.

Taking a terminal as an example to execute the living body detection method of the embodiment of the present invention, the method may be applied to the terminal, fig. 2 is a flowchart of the positioning adjustment method of the embodiment of the present invention, and as shown in fig. 2, the flowchart of the method may include the following steps:

201. a first image and a second image of a target object are acquired, the first image and the second image being images of different modalities.

In some embodiments, the target object may be any type of living being, for example: a human or an animal. The first image and the second image can be obtained by shooting through the binocular camera module arranged in the detection area. The binocular camera module can acquire the infrared image and the RGB image of a target object simultaneously. The present invention refers to images obtained by different imaging principles as images of different "modalities".

Illustratively, taking the target object as a person as an example, the first image and the second image may be, but are not limited to, face images of a photographed person.

The first image and the second image are often images of the same specification. Generally, the input of the neural network model is uniform in size of the input image, so that the neural network model can be better adapted when being input, and the first image and the second image which are consistent with the requirement specification of the neural network model are input, so that the accuracy of a classification task can be ensured. In addition, the subsequent classification effect, namely the influence of what the "correcting" operation does in particular, can be improved by selecting a good ROI (region of interest). Generally, affine transformation or custom transformation is performed according to face key points obtained in a face detection stage, but inputs with the same size are required to be obtained finally.

In an optional embodiment, a target object is shot by using a binocular camera to obtain a first original image and a second original image; performing channel transformation on the first original image to obtain a first transformed image; performing channel transformation on the second original image to obtain a second transformed image; carrying out image scaling on the first transformation image to obtain a first image; and carrying out image scaling on the second transformation image to obtain a second image.

In some embodiments, the channel transform is a process of linearly transforming the image. The first original image and the second original image are taken as an RGB image and an infrared image as an example. The RGB image is converted into an HSV image through a channel, and the infrared image is converted into a gray-scale image through the channel.

Specifically, the process of converting an RGB image into an HSV image includes:

max＝max(R,G,B)；

min＝min(R,G,B)；

if R＝max,H＝(G-B)/(max-min)；

if G＝max,H＝2+(B-R)/(max-min)；

if B＝max,H＝4+(R-G)/(max-min)；

H＝H*60if H<0,H＝H+360；

V＝max(R,G,B)；

S＝(max-min)/max；

and further obtaining an HSV image according to the obtained HSV in the formula.

Unlike the RGB image, which is three-channel, the infrared image actually has only single-channel information. Original data (such as yuyv format) of the infrared image is firstly converted into three channels (three channels are the same) similar to RGB, and then one channel of the original data is taken as a gray-scale image of the infrared image.

There are various ways to scale the image, for example, scaling based on equally spaced extraction of image pixels or scaling based on region sub-block extraction of the image.

202. And extracting a first object feature of the first image and a second object feature of the second image, performing feature fusion to obtain a fusion feature, judging whether the target object is a living body according to the fusion feature, and obtaining a judgment result.

In some embodiments, the object features of the first image and the second image may be extracted in various ways, for example, by a neural network model.

In an optional embodiment, extracting a first object feature of the first image and a second object feature of the second image, performing feature fusion to obtain a fusion feature, determining whether the target object is a living body according to the fusion feature, and obtaining a determination result, includes:

inputting the first image and the second image into a double-current convolution network model; extracting a first object feature and a second object feature through a double-current convolution network model, and performing feature fusion on the first object feature and the second object feature to obtain a fusion feature; and judging whether the target object is a living body according to the fusion characteristics to obtain a judgment result.

In some embodiments, the first image and the second image are input into a dual-flow and dual-flow convolution network model, and after the first object feature and the second object feature are obtained through the model extraction, the first object feature and the second object feature are subjected to feature fusion, so that the calculation amount can be reduced during subsequent processing.

In an alternative embodiment, the training process of the dual-stream convolutional network model includes:

the method comprises the steps of obtaining a sample set, wherein the sample set comprises at least one group of sample data, the sample data comprises a first sample image, a second sample image, a modality type identifier and a living body type identifier, the living body type identifier is used for indicating whether a target sample corresponding to the first sample image and the second sample image is a living body, and the modality type identifier is used for indicating whether the target sample corresponding to the first sample image and the second sample image is consistent;

sequentially carrying out the following training process on each group of sample data in the sample set:

inputting sample data into an initial double-current convolution network model;

obtaining a first modal prediction probability according to the first sample characteristic, obtaining a second modal prediction probability according to the second sample characteristic and obtaining a fusion prediction probability according to the sample fusion characteristic;

calculating to obtain a loss function value based on the first modal prediction probability, the second modal prediction probability, the fusion prediction probability, the modal class identifier and the living body class identifier;

and reversely propagating the gradient to each layer of the initial double-current convolution network model according to the loss function value, optimizing parameters of the initial double-current convolution network model, acquiring a next group of sample data from the sample set, and repeatedly executing the training process until the loss function is smaller than a preset value, wherein the initial double-current convolution network model is used as a final double-current convolution network model.

In some embodiments, in the obtained sample data, the living body category identifier of the sample object may be set artificially, and is marked as 1 when the sample object is a living body, and is marked as 0 when the sample object is a non-living body (paper photo/screen photo/mask and other dummies).

In an alternative embodiment, the calculating the loss function value based on the first modality prediction probability, the second modality prediction probability, the fusion prediction probability, the modality category identifier, and the living body category identifier includes:

determining a third intermediate value according to the living body category identification and the fusion mode prediction probability;

and calculating to obtain a loss function value according to the first intermediate value, the second intermediate value, the third intermediate value, the first regulating factor, the second regulating factor and the living body class identifier.

Specifically, the first modality prediction probability, the second modality prediction probability, the fusion prediction probability, the modality category identifier and the living body category identifier may be substituted into the following loss function formula to obtain a loss function value;

the loss function equation L is:

L_CE＝-log(m_t)

wherein m represents fusion prediction probability, p represents first modality prediction probability, q represents second modality prediction probability, eta represents modality category identification, y represents living body category identification, and lambda, alpha and gamma are preset parameters, wherein lambda is larger than 0.5.

Where α may be, but is not limited to, 0.5, γ may be, but is not limited to, 3, and λ may be, but is not limited to, 0.8.

In an optional embodiment, the method further comprises: when the sample data comprises any one of the first sample image and the second sample image, acquiring a first living body type of a target sample corresponding to the first sample image or the second sample image; acquiring a sample image consistent with the first living body type, and taking the sample image as a second sample image or a first sample image; and determining the mode category identification in the sample data formed by the first sample image and the second sample image as the target sample inconsistency.

In the embodiment, when a double-current convolution network model is trained, an infrared image and an RGB image with the same shooting time are directly combined into an (IR, RGB) image pair, and the modal class identifier is marked as 1; when only single-mode data (namely only infrared images or RGB images) exist at a certain shooting time, another mode data with the randomly matched living body class identification consistent with the other mode data is formed into an (IR, RGB) image pair, and the mode class identification is marked as 0.

203. And determining the judgment result as the detection and identification result of the target object.

In some embodiments, after the determination result is obtained, the determination result is used as a detection recognition result of the target object, so that the living body detection of the target object is realized.

In the invention, by additionally marking a 'modal class' label, when training data is not paired on a time axis, the loss of the fusion characteristic branch is set to be 0, namely, the parameters of the fusion characteristic are not updated, and each modal characteristic is trained independently. And otherwise, restarting the fusion characteristic training if the training data has consistency on the time axis. The method relieves the limitation that training data must match with multi-modal and paired input, and can be compatible and effectively utilized with single-modal data while inputting multi-modal data normally.

In addition, for the loss function, an important adjustment factor is

It also takes into account the multi-modal (in this embodiment, infrared branch and RGB branch, respectively) prediction probabilities. This factor increases with the probability of prediction of another modality and decreases with the probability of prediction of the same modality. Compared with a cross entropy loss function or a single-mode loss function, the adjusting mode provided by the invention can better deal with the over-fitting problem in multi-mode modeling.

Based on the same concept, a living body detecting apparatus is provided in an embodiment of the present invention, and fig. 3 is a structural diagram of an embodiment of a positioning adjustment apparatus according to the present invention. The specific implementation of the apparatus can refer to the description of the method embodiment, and repeated descriptions are omitted, as shown in fig. 3, the apparatus mainly includes:

an acquiring module 31, configured to acquire a first image and a second image of a target object, where the first image and the second image are images of different modalities;

the detection module 32 is configured to extract a first object feature of the first image and a second object feature of the second image, perform feature fusion to obtain a fusion feature, determine whether the target object is a living body according to the fusion feature, and obtain a determination result;

and the determining module 33 is configured to determine that the determination result is the detection and identification result of the target object.

Further, in the foregoing embodiment, the detection module 32 is specifically configured to:

the extracting a first object feature of the first image and a second object feature of the second image, performing feature fusion to obtain a fusion feature, and determining whether the target object is a living body according to the fusion feature to obtain a determination result, including:

Further, in the foregoing embodiment, the training process of the dual-stream convolutional network model includes:

Further, in the above embodiment, the detecting module 32 is further configured to:

the loss function formula L is:

L_CE＝-log(m_t)

Further, in the foregoing embodiment, the obtaining module 31 is further configured to:

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and specific implementation schemes thereof may refer to the method described in the foregoing embodiment and relevant descriptions in the method embodiment, and have beneficial effects of the corresponding method embodiment, which are not described herein again.

Fig. 4 is a schematic structural diagram of an embodiment of the living body detecting apparatus of the present invention, and as shown in fig. 4, the passing apparatus of the present embodiment may include: a processor 1010 and a memory 1020. Those skilled in the art will appreciate that the device may also include input/output interface 1030, communication interface 1040, and bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The present invention also provides a storage medium storing one or more programs that when executed implement the in-vivo detection method of the above-described embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of in vivo detection, comprising:

2. The living body detection method according to claim 1, wherein the extracting a first object feature of the first image and a second object feature of the second image, performing feature fusion to obtain a fusion feature, and determining whether the target object is a living body according to the fusion feature to obtain a determination result comprises:

3. The in-vivo detection method according to claim 2, wherein the training process of the dual-stream convolutional network model comprises:

4. The in vivo detection method as defined in claim 3, wherein the calculating a loss function value based on the first modality prediction probability, the second modality prediction probability, the fusion prediction probability, the modality category identification, and the in vivo category identification comprises:

5. The in vivo detection method as set forth in claim 3 or 4, wherein the calculating a loss function value based on the first modality prediction probability, the second modality prediction probability, the fusion prediction probability, the modality category identification, and the in vivo category identification comprises:

the loss function formula L is:

L_CE＝-log(m_t)

6. The in-vivo detection method according to claim 3, further comprising:

7. The in-vivo detection method according to claim 1, wherein the acquiring the first image and the second image of the target object comprises:

8. A living body detection device, comprising:

9. A living body examination apparatus, comprising: a processor and a memory;

the processor is configured to execute a liveness detection program stored in the memory to implement the liveness detection method of any one of claims 1-7.

10. A storage medium characterized in that the storage medium stores one or more programs that can be executed to implement the living body detecting method according to any one of claims 1 to 7.