CN111368811A

CN111368811A - Living body detection method, living body detection device, living body detection equipment and storage medium

Info

Publication number: CN111368811A
Application number: CN202010455648.3A
Authority: CN
Inventors: 郭子毅; 梁健; 白琨
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-07-03
Anticipated expiration: 2040-05-26
Also published as: CN111368811B

Abstract

The application discloses a method, a device, equipment and a storage medium for detecting living organisms. The method comprises the following steps: acquiring a target image and a target reflected sound wave corresponding to an object to be detected; extracting image characteristics of a target image and sound wave characteristics of target reflected sound waves; classifying the image characteristics, and obtaining a first living body detection result of the object to be detected according to the classification result; classifying the sound wave characteristics, and obtaining a second living body detection result of the object to be detected according to the classification result; matching the image characteristics and the sound characteristics to obtain a matching result; and determining a target living body detection result of the object to be detected based on the first living body detection result, the second living body detection result and the matching result. In the process, besides information in the aspect of images, information in the aspect of reflected sound waves and matching information between the two aspects are considered, the considered information is comprehensive, the defense capability against attacks can be improved, the detection stability is high, and the accuracy of the in-vivo detection result is high.

Description

Living body detection method, living body detection device, living body detection equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for detecting a living body.

Background

With the development of artificial intelligence technology, face recognition has been widely applied to the fields of security, finance and the like, and in the process of face recognition, in addition to identity verification, living body detection is required. The living body detection is a method for verifying real physiological characteristics, and can verify whether a user is a living body or not, so that attack means such as photos, videos and masks are effectively resisted, the user is helped to discriminate fraudulent behaviors, and the benefit of the user is guaranteed.

In the related art, an image corresponding to an object to be detected is acquired, and a living body detection is performed on the object to be detected based on the acquired image, so that a living body detection result of the object to be detected is obtained. The living body detection process only focuses on information in the aspect of images, the focused information is limited and is easy to attack, the accuracy of a living body detection result is low, and the living body detection effect is poor.

Disclosure of Invention

The embodiment of the application provides a living body detection method, a living body detection device, living body detection equipment and a storage medium, which can be used for improving the accuracy of a living body detection result.

In one aspect, an embodiment of the present application provides a method for detecting a living body, where the method includes:

acquiring a target image and a target reflected sound wave corresponding to an object to be detected;

extracting image characteristics of the target image and sound wave characteristics of the target reflected sound wave;

classifying the image features, and obtaining a first living body detection result of the object to be detected according to a classification result; classifying the sound wave characteristics, and obtaining a second living body detection result of the object to be detected according to a classification result;

matching the image characteristics and the sound wave characteristics to obtain a matching result of the image characteristics and the sound wave characteristics;

and determining a target living body detection result of the object to be detected based on the first living body detection result, the second living body detection result and the matching result.

In another aspect, there is provided a living body detection apparatus, the apparatus comprising:

the acquisition unit is used for acquiring a target image and a target reflected sound wave corresponding to an object to be detected;

an extracting unit configured to extract an image feature of the target image and a sound wave feature of the target reflected sound wave;

the classification processing unit is used for performing classification processing on the image characteristics and obtaining a first living body detection result of the object to be detected according to a classification result; classifying the sound wave characteristics, and obtaining a second living body detection result of the object to be detected according to a classification result;

the matching processing unit is used for matching the image characteristics and the sound wave characteristics to obtain a matching result of the image characteristics and the sound wave characteristics;

a determination unit configured to determine a target in-vivo detection result of the object to be detected based on the first in-vivo detection result, the second in-vivo detection result, and the matching result.

In a possible implementation manner, the extracting unit is configured to invoke a target image feature extraction model to extract image features of the target image; calling a target sound wave feature extraction model to extract the sound wave features of the target reflected sound waves;

the classification processing unit is used for calling a first target classification model to perform classification processing on the image characteristics and obtaining a first living body detection result of the object to be detected according to a classification result; calling a second target classification model to classify the sound wave characteristics, and obtaining a second living body detection result of the object to be detected according to the classification result;

and the matching processing unit is used for calling a target discriminator model to match the image characteristics and the sound wave characteristics to obtain a matching result of the image characteristics and the sound wave characteristics.

In a possible implementation manner, the determining unit is configured to pass live detection as a target live detection result of the object to be detected in response to that the first live detection result indicates that the object to be detected is a live body, the second live detection result indicates that the object to be detected is a live body, and the matching result indicates that the matching between the image feature and the acoustic wave feature is successful.

In a possible implementation manner, the obtaining unit is further configured to obtain a training sample set, where the training sample set includes at least two training sample subsets, different training sample subsets correspond to different targets, any training sample subset includes at least one training sample corresponding to any target, and any training sample corresponding to any target includes any image corresponding to any target and a reflected sound wave corresponding to any image;

the device further comprises:

the training unit is used for selecting training samples from at least two training sample subsets to form target training samples respectively, training an initial image feature extraction model, an initial sound wave feature extraction model, a first initial classification model, a second initial classification model and an initial discriminator model in a to-be-trained living body detection model based on the target training samples to obtain the living body detection model, and the living body detection model comprises a target image feature extraction model, a target sound wave feature extraction model, a first target classification model, a second target classification model and a target discriminator model.

In a possible implementation manner, the training unit is configured to invoke an initial image feature extraction model to extract initial image features of images in the target training sample; calling an initial sound wave feature extraction model to extract initial sound wave features of reflected sound waves in the target training sample; calling a first initial classification model to classify the initial image features, and obtaining a first initial living body detection result according to a classification result; calling a second initial classification model to classify the initial sound wave characteristics, and obtaining a second initial living body detection result according to a classification result; forming sample feature groups based on the initial image features and the initial sound wave features, wherein any sample feature group is formed by one initial image feature and one initial sound wave feature; training the first initial classification model based on the first initial in-vivo detection result; training the second initial classification model based on the second initial in-vivo detection result; training the initial discriminator model based on the sample feature group; training the initial image feature extraction model based on the first initial in-vivo detection result and the sample feature group; training the initial acoustic feature extraction model based on the second initial in-vivo detection result and the sample feature group; and responding to the training to obtain a target image feature extraction model, a target sound wave feature extraction model, a first target classification model, a second target classification model and a target discriminator model, and obtaining a living body detection model.

In one possible implementation manner, the sample feature set includes a positive sample feature set and a negative sample feature set, and the training unit is further configured to determine, based on the initial image features and the initial acoustic wave features, an initial image feature-initial acoustic wave feature pair that satisfies a positive matching condition and an initial image feature-initial acoustic wave feature pair that satisfies a negative matching condition; forming a positive sample feature group based on the initial image feature-initial sound wave feature pairs meeting the positive matching condition, and forming a negative sample feature group based on the initial image feature-initial sound wave feature pairs meeting the negative matching condition; the image corresponding to the initial image features and the reflected sound waves corresponding to the initial sound wave features come from the same training sample in the target training samples; the image corresponding to the initial image features and the reflected sound waves corresponding to the initial sound wave features come from two training samples meeting the reference conditions in the target training samples, and the two training samples meeting the reference conditions are two training samples from different training sample subsets.

In one possible implementation manner, the initial discriminator model includes an initial similarity operator model, and the training unit is further configured to invoke the initial similarity operator model, calculate a first similarity value between an initial image feature and an initial acoustic wave feature in the positive sample feature group, and calculate a second similarity value between an initial image feature and an initial acoustic wave feature in the negative sample feature group; calculating a first target loss function based on the first similarity value, the second similarity value, and a first reference threshold; updating parameters of the initial similarity operator model based on the first objective loss function.

In one possible implementation manner, the positive sample feature set corresponds to a first tag, the negative sample feature set corresponds to a second tag, and the initial discriminator model includes an initial feature fusion submodel and an initial classification processing submodel; the training unit is further used for inputting the initial image features and the initial sound wave features in the positive sample feature group into the initial feature fusion submodel for fusion to obtain first fusion features; calling the initial classification processing submodel to perform classification processing on the first fusion characteristics to obtain a first classification result; inputting the initial image feature and the initial sound wave feature in the negative sample feature group into the initial feature fusion sub-model for fusion to obtain a second fusion feature; calling the initial classification processing submodel to perform classification processing on the second fusion characteristics to obtain a second classification result; calculating a second target loss function based on a loss function between the first classification result and the first label and a loss function between the second classification result and the second label; and updating the parameters of the initial feature fusion sub-model and the initial classification processing sub-model based on the second target loss function.

In a possible implementation manner, the obtaining unit is further configured to obtain an attack sample set, where the attack sample set includes at least one attack sample subset, where different attack sample subsets correspond to different attackers, and any attack sample subset includes at least one attack sample corresponding to any attacker, and any attack sample corresponding to any attacker includes any image corresponding to any attacker and reflected sound waves corresponding to any image;

the device further comprises:

the optimization unit is used for selecting attack samples in the attack sample set to form target attack samples, and calling the target image feature extraction model to extract the target image features of the images in the target attack samples; calling the target sound wave feature extraction model to extract the target sound wave features of the reflected sound waves in the target attack sample; based on the target image characteristics and the target sound wave characteristics, attack characteristic groups are formed, and any attack characteristic group is formed by one target image characteristic and one target sound wave characteristic; and optimizing a target discriminator model in the in-vivo detection model based on the attack feature group, and obtaining the optimized in-vivo detection model based on the optimized target discriminator model.

In a possible implementation manner, the target discriminator model includes a target similarity degree operator model, and the optimization unit is further configured to obtain a reference feature group, where any one of the reference feature groups is composed of one reference image feature and one reference acoustic wave feature; calling the target similarity degree operator model, and calculating a third similarity value between the reference image feature and the reference sound wave feature in the reference feature group and a fourth similarity value between the target image feature and the target sound wave feature in the attack feature group; calculating a third target loss function based on the third similarity value, the fourth similarity value, and a second reference threshold; updating parameters of the target similarity degree operator model based on the third target loss function.

In one possible implementation manner, the attack feature group corresponds to a second tag, and the target discriminator model includes a target feature fusion submodel and a target classification processing submodel; the optimization unit is further configured to input the target image features and the target sound wave features in the attack feature group into the target feature fusion sub-model for fusion to obtain third fusion features; calling the target classification processing sub-model to classify the third fusion characteristics to obtain a third classification result; and updating the parameters of the target feature fusion submodel and the target classification processing submodel based on the loss function between the third classification result and the second label.

In a possible implementation manner, the acquiring unit is further configured to acquire, for any object, an image sequence of the any object and a reflected sound wave reflected by the any object; dividing the reflected acoustic wave into at least one reflected sub-acoustic wave; aligning the at least one reflected sub-sound wave with at least one image in the image sequence to obtain reflected sub-sound waves aligned with the at least one image respectively; for any image in the at least one image, constructing a reflected sound wave corresponding to the any image based on the reflected sub sound waves aligned with the any image; forming any training sample corresponding to any target object based on any image and the reflected sound wave corresponding to any image; and forming any training sample subset based on at least one training sample corresponding to any target object formed by the at least one image.

In another aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory having stored therein at least one program code, the at least one program code being loaded and executed by the processor to implement any of the above-mentioned liveness detection methods.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement any of the above-mentioned living body detection methods.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

the method comprises the steps of obtaining an image and a reflected sound wave corresponding to an object to be detected, further extracting image characteristics and sound wave characteristics, and determining a living body detection result of the object to be detected by comprehensively considering a living body detection result obtained based on the image characteristics, a living body detection result obtained based on the sound wave characteristics and a matching result of the image characteristics and the sound wave characteristics. In the process of the living body detection, besides information in the aspect of images, information in the aspect of reflected sound waves and matching information between the two aspects are considered, the considered information is comprehensive, the defense capability against attacks can be improved, the detection stability is high, the accuracy of the living body detection result is improved, and the living body detection effect is good.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an environment for implementing a method for detecting a living organism according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of in vivo detection provided by an embodiment of the present application;

fig. 3 is a schematic diagram of a process for acquiring reflected sound waves reflected by a real human face according to an embodiment of the present application;

fig. 4 is a schematic diagram of an installation situation of an image acquisition device, an acoustic wave transmitting device and an acoustic wave receiving device in a mobile phone according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a process for invoking a live body detection model to perform live body detection according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a password entry interface provided by an embodiment of the present application;

FIG. 7 is a flowchart of a method for training a living body detection model according to an embodiment of the present application;

fig. 8 is a schematic process diagram for obtaining a training sample corresponding to any target object according to an embodiment of the present disclosure;

FIG. 9 is a flowchart illustrating training of an in-vivo detection model to be trained based on a target training sample according to an embodiment of the present application;

FIG. 10 is a flowchart illustrating an optimization of an in vivo testing model according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a process for optimizing a target discriminator model according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a process for acquiring an optimized in-vivo detection model according to an embodiment of the present application;

FIG. 13 is a schematic view of a living body detecting apparatus provided by an embodiment of the present application;

FIG. 14 is a schematic view of a living body detecting apparatus provided by an embodiment of the present application;

fig. 15 is a schematic structural diagram of a server provided in an embodiment of the present application;

fig. 16 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence. Computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image inpainting, image Recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D (3-Dimension) technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition. In the process of face recognition, a living body detection technology is generally involved.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

Referring to fig. 1, a schematic diagram of an implementation environment of the method provided in the embodiment of the present application is shown. The implementation environment may include: a terminal 11 and a server 12.

The terminal 11 is provided with a face recognition system, and in the process of performing face recognition on an object to be detected by using the face recognition system, the method provided by the embodiment of the present application can be applied to perform living body detection on the object to be detected. The terminal 11 may collect an image and a reflected sound wave corresponding to an object to be detected, and then perform living body detection on the object to be detected based on the image and the reflected sound wave corresponding to the object to be detected. Of course, the terminal 11 may also send the acquired image and reflected sound wave corresponding to the object to be detected to the server 12, and the server 12 performs living body detection on the object to be detected based on the image and reflected sound wave corresponding to the object to be detected. In one possible implementation, the server 12 transmits the result of the live body test to the terminal 11 after the live body test is performed.

In one possible implementation manner, the terminal 11 may be any electronic product capable of performing human-Computer interaction with a user through one or more manners of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction, or a handwriting device, for example, a PC (Personal Computer), a mobile phone, a smart phone, a PDA (Personal Digital Assistant), a wearable device, a pocket PC (pocket PC), a tablet Computer, a smart car machine, a smart television, a smart sound box, and the like. The server 12 may be a server, a server cluster composed of a plurality of servers, or a cloud computing service center. The terminal 11 establishes a communication connection with the server 12 through a wired or wireless network.

It should be understood by those skilled in the art that the above-mentioned terminal 11 and server 12 are only examples, and other existing or future terminals or servers may be suitable for the present application and are included within the scope of the present application and are herein incorporated by reference.

Based on the implementation environment shown in fig. 1, the embodiment of the present application provides a method for detecting a living body, which is applied to the terminal 11 as an example. As shown in fig. 2, the method provided by the embodiment of the present application may include the following steps.

In step 201, a target image and a target reflected sound wave corresponding to an object to be detected are acquired.

The object to be detected refers to any object that needs to be subjected to a biopsy. It should be noted that the object to be detected may be a real human body or a non-real human body, which is not limited in the embodiment of the present application. When the object to be detected is a non-real human body, the object to be detected may be a printed photograph, a screen including an image, a person with a mask, or the like, and of course, in some exemplary embodiments, the object to be detected may also be an animal, a plant, or the like.

No matter which object to be detected is, in the process of performing the living body detection, the object to be detected is located in an image acquisition area of the terminal, and after receiving a living body detection instruction, the terminal acquires a target image corresponding to the object to be detected, so that the terminal acquires the target image corresponding to the object to be detected. In a possible implementation manner, an image acquisition device is installed in the terminal, and the terminal acquires a target image corresponding to an object to be detected by using the image acquisition device. For example, in a scene of face recognition, an object to be detected usually faces a terminal screen, so an image acquisition device for acquiring an image is usually a front image acquisition device of a terminal. The image acquisition device in the embodiment of the present application is not limited as long as it can acquire an image, and is illustratively a camera.

The screen of the terminal can display an image acquisition frame, and the object to be detected can be placed in the image acquisition frame by moving the position so that the detection part for in-vivo detection is arranged. The target image corresponding to the object to be detected may refer to an image of a detection portion appearing in the image capturing frame. For example, in the case that the object to be detected is a real human body, the target image corresponding to the object to be detected may refer to a facial image of the human body; for the case that the object to be detected is a printed human body photograph, the target image corresponding to the object to be detected may refer to a face image in the human body photograph.

In one possible implementation manner, the process of the terminal acquiring the target image corresponding to the object to be detected is as follows: the terminal acquires a video corresponding to the object to be detected by using the image acquisition device, and captures a frame of image in the video as a target image corresponding to the object to be detected. The manner of intercepting a frame of image in the video may be randomly intercepting a frame of image in the video, or intercepting a frame of image at a reference position in the video. The reference position may be set empirically or may be determined according to the emission period of the sound wave, which is not limited in the embodiment of the present application. For example, the reference position is a position corresponding to a 6 ms timestamp in the video.

Besides acquiring a target image corresponding to the object to be detected, acquiring a target reflected sound wave corresponding to the object to be detected. In the process of propagation, the sound wave is reflected when encountering an obstacle or plane, and the generated reflected sound wave not only contains information of distance, but also contains information of shape and material of the reflecting surface or the obstacle. In one possible implementation manner, the manner in which the terminal acquires the target reflected sound wave corresponding to the object to be detected is as follows: and the terminal transmits sound waves to the object to be detected, and the received reflected sound waves meeting the conditions are used as target reflected sound waves corresponding to the object to be detected. In one possible implementation, the reflected sound wave satisfying the condition may refer to a reflected sound wave received within a reference time interval after the sound wave is transmitted. The reference time interval may be set empirically or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. Reflected sound waves received in the reference time interval after the sound waves are transmitted can be regarded as sound waves reflected by the object to be detected, and interference of other reflected sound waves is reduced. In a possible implementation mode, after the terminal transmits the sound waves to the object to be detected, direct sound waves for transmitting the sound waves can be received besides the reflected sound waves, the traveling path of the direct sound waves is shorter than that of the reflected sound waves, attenuation is less relative to the reflected sound waves, and therefore the direct sound waves are filtered and removed, and adverse effects of the direct sound waves are avoided.

It should be noted that the target reflected sound wave corresponding to the object to be detected may refer to a real reflected sound wave obtained by reflecting the emitted sound wave by the object to be detected, or may refer to a prepared false reflected sound wave. For example, it is assumed that in the process of performing living body detection on an object to be detected, a transmitting port of an acoustic wave transmitting device of a terminal is blocked, and reflected acoustic waves of other objects prepared in advance are artificially played at the position of the object to be detected, in this case, a target reflected acoustic wave corresponding to the object to be detected acquired by the terminal is a false reflected acoustic wave prepared in advance.

It should be noted that, for the case that the object to be detected is a real human body, if the target reflected sound wave corresponding to the object to be detected is a real reflected sound wave obtained by reflecting the sound wave emitted by the object to be detected, the target reflected sound wave corresponding to the object to be detected is a reflected sound wave reflected by a real human face. The process of obtaining the reflected sound wave reflected by the real face may be as shown in fig. 3, where in (1) in fig. 3, the terminal transmits the sound wave; when the sound waves propagate in the air and encounter an obstacle, reflection occurs, and when the sound waves propagate to a real face, as shown in (2) in fig. 3, the real face reflects the sound waves transmitted by the terminal and propagates the reflected sound waves to the terminal. It should be noted that the reflected sound wave of the real human face is formed by the sound wave obtained by reflecting the transmitted sound wave on different planes of the real human face.

In one possible implementation, the terminal may be equipped with a sound wave transmitting device and a sound wave receiving device. Under the condition, the mode that the terminal acquires the target reflected sound wave corresponding to the object to be detected is as follows: the terminal utilizes the sound wave transmitting device to transmit sound waves to the object to be detected, utilizes the sound wave receiving device to receive the sound waves, and takes the received reflected sound waves meeting the conditions as target reflected sound waves corresponding to the object to be detected. The sound wave emitting device and the sound wave receiving device are not limited in the embodiments of the present application, and the sound wave emitting device is a speaker, and the sound wave receiving device is a microphone, for example. In the embodiments of the present application, the installation positions of the acoustic wave transmitting device and the acoustic wave receiving device in the terminal are not limited, and for example, both the acoustic wave transmitting device and the acoustic wave receiving device may be installed on the top of the terminal.

In a possible implementation manner, the sound wave emitted by the terminal refers to an ultrasonic wave, and the frequency of the ultrasonic wave is high (not less than 20 kHz), so that the ultrasonic wave is difficult to be received by human ears, and the effect of non-sensitive living body detection is achieved. In a possible implementation manner, the sound wave transmitted by the terminal may be a mono sound wave or a multichannel sound wave, which is not limited in the embodiments of the present application. When the transmitted sound wave is a multi-channel sound wave, the sound wave can be transmitted by using a plurality of sound wave transmitting devices. In this case, the transmitted sound wave is also a multichannel sound wave, and can be received by a plurality of sound wave receiving devices.

Exemplarily, in order to obtain a target image and a target reflected sound wave corresponding to an object to be detected, the terminal needs to be provided with an image acquisition device, a sound wave emitting device and a sound wave receiving device. Taking a terminal as a mobile phone as an example, the installation situation of the image acquisition device, the sound wave transmitting device and the sound wave receiving device in the mobile phone can be as shown in fig. 4. Fig. 4 (1), (2), and (3) show the mounting positions of the acoustic wave receiving device 41, the acoustic wave transmitting device 42, and the image pickup device 43 in three different types of cellular phones.

It should be noted that the above description is only an exemplary description of acquiring a target image and a target reflected sound wave corresponding to an object to be detected. In the above description, the target image and the target reflected sound wave corresponding to the object to be detected are directly acquired. In a possible implementation manner, the image and the reflected sound wave directly acquired in the above description may be respectively preprocessed, the preprocessed image is used as a target image corresponding to the object to be detected, and the preprocessed reflected sound wave is used as a target reflected sound wave corresponding to the object to be detected.

In one possible implementation, preprocessing the image may refer to performing an augmentation operation on the image to increase the reliability of the image during the biopsy procedure. The embodiment of the present application does not limit the augmentation operation, and illustratively, the augmentation operation includes one or more of rotation, color change, blurring, random noise addition, centered cutting, and resolution reduction.

The reflected sound waves are preprocessed, so that the reliability of the reflected sound waves in the living body detection process can be improved. The operation of preprocessing the reflected sound wave can be set empirically, and is not limited in the embodiments of the present application. Illustratively, the operation of preprocessing the reflected acoustic wave includes at least one of filtering, normalization, and wavelet transformation. Optionally, the process of filtering the reflected sound wave may be performed by using a filter, and the embodiment of the present application does not limit the type of the filter, and the filter is, for example, a time domain filter, a frequency domain filter, or a kalman filter. Alternatively, the normalization process of the reflected sound wave may be implemented based on a sliding window.

In step 202, image features of the target image and acoustic features of the target reflected acoustic wave are extracted.

After the target image and the target reflected sound wave are obtained, the terminal can extract the image characteristics of the target image and the sound wave characteristics of the target reflected sound wave, and therefore the subsequent living body detection process can be conveniently achieved by means of the image characteristics and the sound wave characteristics.

In one possible implementation, the liveness detection process may invoke a liveness detection model implementation. The living body detection model comprises a target image feature extraction model, a target sound wave feature extraction model, a first target classification model, a second target classification model and a target discriminator model. The living body detection model is obtained through training, and the process of obtaining the living body detection model through training can be carried out at a terminal or a server, which is not limited in the embodiment of the application. The process of training to obtain the biopsy model is shown in fig. 7 for details, and will not be described herein again.

In the living body detection model, the target image feature extraction model is used for extracting the features of the image, and the embodiment of the application does not limit the model structure of the target image feature extraction model as long as the function of extracting the image features can be achieved. Illustratively, the model structure of the target image feature extraction model is a convolutional neural network. The target sound wave feature extraction model is used for the features of sound waves, and the model structure of the target sound wave feature extraction model is also not limited in the embodiment of the present application. It should be noted that, even if the model structures of the target image feature extraction model and the target acoustic wave feature extraction model are both convolutional neural networks, the convolutional neural networks have different parameters because the feature extraction objects of the target image feature extraction model and the target acoustic wave feature extraction model are different.

In one possible implementation manner, the manner of extracting the image feature of the target image and the sound wave feature of the target reflected sound wave is as follows: calling a target image feature extraction model to extract image features of a target image; and calling a target sound wave characteristic extraction model to extract the sound wave characteristics of the target reflected sound waves.

Since the image characteristics and the sound wave characteristics in the embodiment of the application are used for the living body detection, the image characteristics and the sound wave characteristics both carry information in the living body detection aspect, so that the subsequent living body detection process can be executed based on the image characteristics and the sound wave characteristics. In a possible implementation manner, the sound wave features may be vectors, and by standardizing the vectors corresponding to the sound wave features, whether the reflecting surface includes a complex plane or not and whether the complex plane includes a real face or not may be known. In one possible implementation manner, the image feature and the acoustic wave feature can both carry information on distance in addition to information on living body detection, so as to further improve the reliability of living body detection based on the image feature and the acoustic wave feature. The information on the distance is used for indicating the distance between the terminal and the object to be detected.

For example, for the acoustic wave feature, the distance between the terminal and the object to be detected refers to the distance between the acoustic wave emitting device of the terminal and the object to be detected. Optionally, the distance between the sound wave emitting device of the terminal and the object to be detected is a distance between the sound wave emitting device of the terminal and a main reflection plane corresponding to the object to be detected. For the image characteristics, the distance between the terminal and the object to be detected refers to the distance between an image acquisition device of the terminal and the object to be detected. Optionally, when the target image corresponding to the object to be detected includes a face image, the distance between the image acquisition device of the terminal and the object to be detected is the distance between the image acquisition device of the terminal and the face corresponding to the face image. In a possible implementation manner, the distance between the image acquisition device of the terminal and the face corresponding to the face image may be calculated based on the face key point ratio in the face image. It should be noted that, when the object to be detected is a real human body (living body), the distance indicated by the information on the distance carried by the image feature is close to the distance indicated by the information on the distance carried by the acoustic wave feature.

In step 203, classifying the image features, and obtaining a first living body detection result of the object to be detected according to the classification result; and classifying the sound wave characteristics, and obtaining a second living body detection result of the object to be detected according to the classification result.

The image features carry information in the aspect of living body detection, after the image features are extracted, the image features are classified, and a first living body detection result of the object to be detected is obtained according to the classification result. The first in-vivo detection result is used for indicating the in-vivo detection result of the object to be detected on the image level. The meaning of the first in-vivo detection result indication includes: the object to be detected is a living body, or the object to be detected is a non-living body.

The acoustic wave characteristics also carry information in the aspect of living body detection, after the acoustic wave characteristics are extracted, the acoustic wave characteristics are classified, and a second living body detection result of the object to be detected is obtained according to the classification result. The second in-vivo detection result is used for indicating the in-vivo detection result of the object to be detected on the reflected sound wave layer surface. The meaning of the second in-vivo-detection-result indication includes: the object to be detected is a living body, or the object to be detected is a non-living body.

It should be noted that, in the embodiment of the present application, the fact that the object to be detected is a living body means that the object to be detected has a real face, and the fact that the object to be detected is a non-living body means that the object to be detected does not have a real face.

For the condition of calling the living body detection model to realize the living body detection process, a first target classification model in the living body detection model is used for classifying image features so as to obtain a living body detection result of an image layer according to a classification result; and a second target classification model in the living body detection model is used for classifying the acoustic wave characteristics so as to obtain a living body detection result of the reflection acoustic wave layer according to the classification result. The embodiment of the application does not limit the model structures of the first target classification model and the second target classification model, and the model structures of the first target classification model and the second target classification model may be the same or different. Illustratively, the first target classification model and the second target classification model may each include an activation function layer, perform classification processing using the activation function layer, and output a living body detection result.

In one possible implementation manner, classifying the image features, and obtaining a first living body detection result of the object to be detected according to the classification result includes: and calling a first target classification model to classify the image characteristics, and obtaining a first living body detection result of the object to be detected according to the classification result. The acoustic wave characteristics are classified, and a second living body detection result of the object to be detected is obtained according to the classification result, and the method comprises the following steps: and calling a second target classification model to classify the sound wave characteristics, and obtaining a second living body detection result of the object to be detected according to the classification result.

After the features are classified, a classification result can be obtained, wherein the classification result comprises probabilities of different categories. In the embodiment of the present application, the process of the classification processing may be regarded as a two-classification process, and the classification result includes the probability of the living body class and the probability of the non-living body class. No matter the image characteristics are classified or the acoustic wave characteristics are classified, after the classification, classification results including the probability of living body types and the probability of non-living body types can be obtained, and then the corresponding living body detection result to be detected is obtained according to the classification results.

In one possible implementation manner, the image features are classified, and the manner of obtaining the first living body detection result of the object to be detected according to the classification result includes, but is not limited to, the following two manners.

Mode 1: and classifying the image features to obtain a classification result which is used as a first living body detection result of the object to be detected.

In this way 1, the first living body detection result of the object to be detected also includes the probability of the living body class and the probability of the non-living body class, and the meaning indicated by the first living body detection result is determined according to the magnitude relationship between the probability of the living body class and the probability of the non-living body class.

Mode 2: when the probability of the living body class in the classification result is greater than the probability of the non-living body class after the image features are classified, taking the object to be detected as a living body as a first living body detection result of the object to be detected; and when the probability of the living body class in the classification result is less than that of the non-living body class after the image features are classified, taking the object to be detected as a non-living body as a first living body detection result of the object to be detected.

In this manner 2, the first living body detection result directly indicates whether the object to be detected is a non-living body.

Similarly, the classification processing is performed on the acoustic wave features, and two ways of obtaining the second living body detection result of the object to be detected according to the classification result include: and classifying the sound wave characteristics to obtain a classification result which is used as a second living body detection result of the object to be detected. Or when the probability of the living body class in the classification result obtained after the acoustic wave features are classified is greater than the probability of the non-living body class, taking the object to be detected as a living body as a second living body detection result of the object to be detected; and when the probability of the living body class in the classification result is less than that of the non-living body class after the acoustic wave features are classified, taking the object to be detected as a non-living body as a second living body detection result of the object to be detected.

The first in-vivo detection result and the second in-vivo detection result are in-vivo detection results of the object to be detected on two different layers, the first in-vivo detection result and the second in-vivo detection result can be regarded as preliminary in-vivo detection results of the object to be detected, and the final in-vivo detection result of the object to be detected is difficult to accurately determine directly based on the first in-vivo detection result and the second in-vivo detection result. It is further required to determine whether the target image and the target reflected acoustic wave are from the same object based on step 204, and determine the final in vivo detection result of the object to be detected based on the first in vivo detection result and the second in vivo detection result on the basis of determining whether the target image and the target reflected acoustic wave are from the same object, and the reliability of the in vivo detection result determined in this way is high.

In step 204, the image features and the acoustic wave features are matched to obtain a matching result of the image features and the acoustic wave features.

After the image features and the acoustic wave features are obtained based on step 202, matching processing is performed on the image features and the acoustic wave features to obtain a matching result of the image features and the acoustic wave features. The matching result of the image feature and the sound wave feature is used for indicating whether the image feature and the sound wave feature are successfully matched. Whether the image features and the acoustic features match successfully can be used to indicate whether the image features and the acoustic features are from the same object. When the matching result of the image characteristic and the sound wave characteristic indicates that the image characteristic and the sound wave characteristic are successfully matched, the image characteristic and the sound wave characteristic are from the same object; when the matching result of the image characteristic and the sound wave characteristic indicates that the image characteristic and the sound wave characteristic are failed to be matched, the fact that the image characteristic and the sound wave characteristic come from different objects is indicated.

It should be noted that although the target object and the target reflected sound wave both correspond to the object to be detected, they may not both come from the object to be detected. For example, when the emission port of the acoustic wave emission device of the terminal is blocked, and the reflected acoustic waves of other objects prepared in advance are artificially played at the position of the object to be detected, the target reflected acoustic waves corresponding to the object to be detected acquired by the terminal are the false reflected acoustic waves prepared in advance. At this time, the target image comes from the object to be detected, the target reflected sound wave comes from other objects, and the target image and the target reflected sound wave come from different objects. In this case, the image feature corresponding to the target image and the acoustic wave feature corresponding to the target reflected acoustic wave come from different objects.

And for the condition of calling the living body detection model to realize the living body detection process, the target discriminator model in the living body detection model is used for matching the image characteristics and the sound wave characteristics so as to judge whether the image characteristics and the sound wave characteristics are matched. In one possible implementation manner, the image feature and the acoustic wave feature are subjected to matching processing, and the manner of obtaining the matching result of the image feature and the acoustic wave feature is as follows: and calling a target discriminator model to perform matching processing on the image characteristics and the sound wave characteristics to obtain a matching result of the image characteristics and the sound wave characteristics.

In one possible implementation, the model structure of the target discriminator model includes, but is not limited to, the following two cases.

The first condition is as follows: the target discriminator model includes a target similarity operator model.

The target similarity operator model is used to calculate a similarity value between the two features.

In a possible implementation manner, in this case, the process of calling the target discriminator model to perform matching processing on the image features and the acoustic wave features to obtain a matching result of the image features and the acoustic wave features is as follows: inputting the image characteristic and the sound wave characteristic into a target similarity operator model to obtain a similarity value between the image characteristic and the sound wave characteristic calculated by the target similarity operator model, and taking the similarity value between the image characteristic and the sound wave characteristic as a matching result of the image characteristic and the sound wave characteristic. At this time, the matching result is a similarity value.

In this case of obtaining the matching result, the matching result indicating that the image feature and the acoustic wave feature are successfully matched means that the similarity value indicates that the image feature and the acoustic wave feature are successfully matched. In one possible implementation, the similarity value indicating that the image feature and the sound wave feature are successfully matched means that the similarity value is not less than a target similarity threshold. When the image characteristics and the sound wave characteristics are matched successfully, the image characteristics and the sound wave characteristics are shown to come from the same object. Therefore, when the similarity value is not smaller than the target similarity threshold value, the image feature and the sound wave feature are both from the object to be detected. It should be noted that the target similarity threshold may be set empirically, or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application.

In a possible implementation manner, the main body of the target similarity operator model is a similarity calculation matrix, the image feature and the acoustic wave feature are vectors, in the target similarity operator model, a product among the vector corresponding to the image feature, the vector corresponding to the similarity calculation matrix and the vector corresponding to the acoustic wave feature is calculated, and the product is used as a similarity value between the image feature and the acoustic wave feature.

Case two: the target discriminator model comprises a target characteristic fusion submodel and a target classification processing submodel.

The target feature fusion sub-model is used for fusing the two features, and the target classification processing sub-model is used for classifying the fused features.

In a possible implementation manner, in this case two, the process of calling the target discriminator model to perform matching processing on the image features and the acoustic wave features to obtain a matching result of the image features and the acoustic wave features is as follows: inputting the image characteristics and the sound wave characteristics into a target characteristic fusion sub-model for fusion to obtain target fusion characteristics, calling a target classification processing sub-model to classify the target fusion characteristics to obtain a target classification result, and taking the target classification result as a matching result of the image characteristics and the sound wave characteristics. At this time, the matching result is the target classification result.

In the target feature fusion sub-model, the manner of fusing the image feature and the acoustic wave feature may refer to summing the image feature and the acoustic wave feature, or may refer to splicing the image feature and the acoustic wave feature, which is not limited in the embodiment of the present application. The target classification processing sub-model can comprise an activation function layer, and the activation function layer is used for performing classification processing on the target fusion features and obtaining a classification result.

In one possible implementation, the target classification result includes two probabilities, a first probability and a second probability, the first probability representing the probability that the image feature and the sound wave feature come from the same object, and the second probability representing the probability that the image feature and the sound wave feature come from different objects.

Under the condition of obtaining the matching result, the fact that the matching result indicates that the image feature and the sound wave feature are successfully matched means that the target classification result indicates that the image feature and the sound wave feature are successfully matched. In one possible implementation, the indication that the image feature and the acoustic feature are successfully matched by the target classification result means that the first probability is greater than the second probability in the target classification result. When the image characteristics and the sound wave characteristics are matched successfully, the image characteristics and the sound wave characteristics are shown to come from the same object. Therefore, when the first probability in the target classification result is larger than the second probability, the image feature and the sound wave feature are both from the object to be detected.

In step 205, a target in-vivo detection result of the object to be detected is determined based on the first in-vivo detection result, the second in-vivo detection result, and the matching result.

After the first and second in-vivo detection results are obtained based on step 203 and the matching result is obtained based on step 204, the target in-vivo detection result of the object to be detected is determined by comprehensively considering the first and second in-vivo detection results and the matching result. In the method, the living body detection results of the object to be detected on the image layer and the reflected sound wave layer are considered, the matching result of the image characteristic corresponding to the image and the sound wave characteristic corresponding to the reflected sound wave is also considered, the determined target living body detection result of the object to be detected is accurate, and the attack can be effectively prevented.

In one possible implementation manner, based on the first in-vivo detection result, the second in-vivo detection result, and the matching result, the target in-vivo detection result of the object to be detected is determined by: and responding to the first living body detection result indicating that the object to be detected is a living body, the second living body detection result indicating that the object to be detected is a living body, and the matching result indicating that the image characteristic and the sound wave characteristic are successfully matched, and passing the living body detection as a target living body detection result of the object to be detected. The first living body detection result indicates that the object to be detected is a living body description, the object to be detected is determined to be a living body on an image plane, the second living body detection result indicates that the object to be detected is a living body description, the object to be detected is determined to be a living body on a reflected sound wave plane, and the image feature and the sound wave feature are successfully matched to indicate that the image feature and the sound wave feature are from the same object (namely, both the image feature and the sound wave feature are from the. That is, when the object to be detected is determined to be a living body at both the image level and the reflected acoustic wave level, and it is determined that both the image feature corresponding to the target image and the acoustic wave feature corresponding to the target reflected acoustic wave are from the object to be detected, it is considered that the living body detection of the object to be detected is passed.

In one possible implementation manner, based on the first in-vivo detection result, the second in-vivo detection result, and the matching result, the target in-vivo detection result of the object to be detected is determined by: and responding to the first living body detection result indicating that the object to be detected is a living body, the second living body detection result indicating that the object to be detected is a living body, the matching result indicating that the image feature and the sound wave feature are failed to be matched, and taking the failed living body detection as a target living body detection result of the object to be detected. That is, although the image plane and the reflected acoustic wave plane both determine that the object to be detected is a living body, since the image characteristic and the acoustic wave characteristic are from different objects, the result that the object to be detected is determined to be a living body at least one plane is unreliable, and it is considered that the living body detection of the object to be detected is failed at this time.

Illustratively, the process of invoking the liveness detection model for liveness detection may be as shown in fig. 5. Inputting the target image into a target image feature extraction model 51 to obtain image features; inputting the target reflected sound wave into the target sound wave feature extraction model 52 to obtain sound wave features; classifying the image features by using a first target classification model 53, and obtaining a first living body detection result according to the classification result; classifying the acoustic wave features by using a second target classification model 54, and obtaining a second living body detection result according to the classification result; the image feature and the acoustic wave feature are input to the target discriminator model 55, and the matching result of the image feature and the acoustic wave feature is obtained. And further determining a target in-vivo detection result based on the first in-vivo detection result, the second in-vivo detection result, and the matching result.

It should be noted that, in the process of performing face recognition on the object to be detected, in addition to performing living body detection on the object to be detected, identity verification needs to be performed on the object to be detected, and when both the living body detection and the identity verification pass, it is indicated that the face recognition of the object to be detected passes. The embodiment of the application only defines the process of carrying out the living body detection on the object to be detected, and does not limit the process of carrying out the identity verification on the object to be detected. It should be noted that, the order of the living body detection and the identity verification is not limited in the embodiments of the present application. Illustratively, the living body detection is carried out on the object to be detected, and after the living body detection is passed, the identity verification is carried out on the object to be detected. Of course, the identity of the object to be detected may be verified first, and then the living body of the object to be detected may be detected after the identity verification is passed.

Next, taking an application scenario identified by the turing shield owner as an example, a living body detection process is described. The Turing shield is an application product of an electronic encryption digital economy body on payment derived from an Etherhouse intelligent contract bottom program. The binding user of the Turing shield is the master of the Turing shield. When a user inputs a password or a verification code at a mobile terminal, the face of the user is over against a camera (an image acquisition device) at the top of the mobile terminal for most of time, at the moment, image acquisition or video recording is carried out through the camera at the top of the mobile terminal to obtain an image of the user, sound waves are transmitted by a loudspeaker, and reflected sound waves are received by a microphone in the mobile terminal. And after the image and the reflected sound wave are obtained, performing living body detection on the user based on the image and the reflected sound wave. Besides the live body detection, the mobile terminal also carries out identity verification on the user, wherein the identity verification comprises two levels of verification, one level is to verify whether the facial features of the user are matched with the facial features of the owner of the Turing shield, and the other level is to verify whether the password or verification code input habits of the user are matched with the input habits of the owner of the Turing shield.

The password input interface in the process of identifying the turing shield owner is shown in fig. 6, and password guard test is carried out on the password input interface, and the test times and successful identification times are counted. Numeral 171718 in rectangular box 61 in fig. 6 is a given password, and the user needs to input the given password at the numeral input position in rectangular box 62. In the process that the user inputs the password through the numeric keyboard, the mobile terminal collects input habits of the user, such as the inclination angle of the mobile terminal in the input process, the stay time of the fingers of the user in the input process and the like, and then matches the collected input habits with the input habits of the owner of the Tulingdun which are stored in advance so as to verify whether the input habits of the password or the verification code of the user are matched with the input habits of the owner of the Tulingdun. The process can realize the non-sensitive living body detection of the user with low user interaction cost while the user inputs the password.

The in-vivo detection method provided by the embodiment of the application can be applied to any scene needing in-vivo detection, and the application fields of the in-vivo detection method provided by the embodiment of the application include, but are not limited to, identity authentication, electronic commerce, data security, financial wind control, intelligent hardware and the like.

Compared with the image-based in-vivo detection in the related art, the in-vivo detection method provided by the embodiment of the application has the advantages that the additionally required equipment cost and the calculation cost are extremely low, only the loudspeaker and the microphone are needed, the application range is wide, and the introduced reflected sound waves can be directly used for in-vivo detection, so that the in-vivo detection method can be matched with the image for detection to obtain more robust detection performance.

In the embodiment of the application, the image and the reflected sound wave corresponding to the object to be detected are obtained, the image characteristic and the sound wave characteristic are further extracted, and the living body detection result of the object to be detected is determined by comprehensively considering the living body detection result obtained based on the image characteristic, the living body detection result obtained based on the sound wave characteristic and the matching result of the image characteristic and the sound wave characteristic. In the process of the living body detection, besides information in the aspect of images, information in the aspect of reflected sound waves and matching information between the two aspects are considered, the considered information is comprehensive, the defense capability against attacks can be improved, the detection stability is high, the accuracy of the living body detection result is improved, and the living body detection effect is good.

Based on the implementation environment shown in fig. 1, the embodiment of the present application provides a method for training to obtain a living body detection model, and takes application of the method to a terminal 11 as an example. As shown in fig. 7, the method provided by the embodiment of the present application may include the following steps.

In step 701, a training sample set is obtained, where the training sample set includes at least two training sample subsets, different training sample subsets correspond to different targets, any training sample subset includes at least one training sample corresponding to any target, and any training sample corresponding to any target includes any image corresponding to any target and reflected sound waves corresponding to any image.

The training sample set is used for training the living body detection model to be trained. The training sample set comprises at least two training sample subsets, and different training sample subsets correspond to different target objects. The target is an object used for acquiring a training sample. In one possible implementation, the target object is an object in a natural scene, such as a real human face, a plant, a fruit, white paper, a tablet, a sphere, and the like. The color of the object in the natural scene is simple, and confusion caused by model learning can be avoided.

One training sample subset corresponds to one target object, and different training sample subsets correspond to different target objects. The number of training sample subsets in the training sample set is at least two, that is, the number of targets for acquiring training samples is at least two. For any training sample subset, the any training sample subset comprises at least one training sample corresponding to any target object. That is, each of the training sample subsets includes at least one training sample, the training samples included in the same training sample subset are from the same target object, and the training samples included in different training sample subsets are from different target objects. It should be noted that the number of training samples included in different training sample subsets may be the same or different, and this is not limited in this embodiment of the application.

In at least one training sample corresponding to any target object, any training sample corresponding to any target object comprises any image corresponding to any target object and reflected sound waves corresponding to any image.

According to the above analysis, the training sample set includes at least two training samples, and each training sample includes an image and a reflected sound wave corresponding to the image, although different training samples may correspond to different targets.

In one possible implementation, the acquisition process of any training sample subset includes the following steps 1 to 5.

Step 1: for any object, acquiring an image sequence of the object and a reflected sound wave reflected by the object.

Here, the image sequence of the target and the reflected acoustic wave sequence reflected by the target are both from the target. The image sequence comprises a plurality of images, and each image corresponds to a time stamp. In one possible implementation manner, the manner in which the terminal acquires the image sequence of any one target object is as follows: and the terminal acquires at least one image of any target object according to a fixed time interval, and the at least one image is sequentially sequenced according to the sequence of acquisition time to obtain an image sequence of any target object. The fixed time interval is not limited in the embodiment of the present application, and may be determined according to an image capturing frame rate of the image capturing device of the terminal, for example, if the image capturing frame rate of the image capturing device of the terminal is 50 fps (frames per second), the fixed time interval is 20 ms (milliseconds).

In one possible implementation manner, the manner in which the terminal acquires the reflected sound wave reflected by any target object is as follows: the terminal periodically transmits sound waves to any target object, and the received reflected sound waves are used as reflected sound waves reflected by any target object. Since the terminal periodically emits the sound wave, the received reflected sound wave includes the reflected sound wave of the sound wave emitted by any one of the targets for each period. In one possible implementation manner, the terminal periodically transmits the sound wave to any one target object means that the terminal transmits the sound wave to any one target object at intervals. The time interval for transmitting the sound waves twice is not limited, and the reference distance between the object to be detected of the living body and the living body detection terminal and the sound wave propagation speed can be set according to the requirement. For example, the interval duration of transmitting the sound waves twice may be set to 5 ms. The reference distance between the object to be detected and the living body detection terminal, which needs to be performed on the living body, can be obtained empirically.

Step 2: the reflected acoustic wave is divided into at least one reflected sub-acoustic wave.

Since the reflected sound wave reflected by any one target object includes the reflected sound wave of the sound wave emitted by any one target object in each period, the reflected sound wave can be divided according to the significant features to divide the reflected sound wave into at least one reflected sub sound wave, and each reflected sub sound wave is regarded as the reflected sound wave of the sound wave emitted by any one target object in each period. In a possible implementation manner, the significant characteristic of the case that the interval duration of the two times of sound wave emission is set according to the reference distance between the object to be detected of the living body and the living body detection terminal and the sound wave propagation speed can be the period duration. It should be noted that the transmitted sound wave itself has a transmission duration, and the period duration is the sum of one transmission duration and one interval duration. For example, assuming that one transmission duration is 1 ms, one interval duration is 5 ms, and the cycle duration is 6 ms. And dividing the reflected sound waves according to the cycle duration of 6 ms to obtain high-frequency sound wave segments with each reflected sub sound wave of 6 ms. After dividing the reflected sound wave into at least one reflected sub-sound wave, each reflected sub-sound wave corresponds to time information. The time information corresponding to each reflected sub-sound wave may refer to a start time stamp of each reflected sub-sound wave, may also refer to an end time stamp of each reflected sub-sound wave, may also refer to a time stamp of a certain position (for example, an intermediate position) of each reflected sub-sound wave, and the like, and may also refer to a time stamp range of each reflected sub-sound wave, and the like, which is not limited in this embodiment of the application.

And step 3: and aligning the at least one reflected sub-acoustic wave with at least one image in the image sequence to obtain reflected sub-acoustic waves respectively aligned with the at least one image.

The frame rate of collection of the reflected sub-acoustic waves is different from the frame rate of collection of the images in the image sequence, for example, assuming that each reflected sub-acoustic wave is a high-frequency acoustic wave segment of 6 ms, the frame rate of collection of the reflected sub-acoustic waves is 167 reflected sub-acoustic waves per second, and the frame rate of collection of the images is typically 30-60 images per second. The difference between the collection frame rate of the reflected sub-acoustic waves and the collection frame rate of the images in the image sequence is large, and it is necessary to align at least one of the acquired reflected sub-acoustic waves with at least one of the images in the image sequence to obtain reflected sub-acoustic waves aligned with each of the images.

In one possible implementation, the process of aligning the at least one reflected sub-acoustic wave with at least one image in the image sequence to obtain reflected sub-acoustic waves respectively aligned with the at least one image includes: and determining reflected sub-sound waves respectively aligned with at least one image according to the time stamp of at least one image in the image sequence and the time information of at least one reflected sub-sound wave. In one possible implementation, the process of determining the reflected sub-sound waves respectively aligned with at least one image according to the time stamp of at least one image in the image sequence and the time information of at least one reflected sub-sound wave is as follows: for any image, the reflected sub-sound wave whose time information matches the time stamp of the any image is taken as the reflected sub-sound wave aligned with the any image.

It should be noted that the number of the reflected sub-acoustic waves aligned with any image may be one or more, which is not limited in the embodiment of the present application. There may be a crossover or no crossover in the reflected sub-acoustic waves respectively aligned with two adjacent images, which is not limited in the embodiment of the present application. For the condition that no intersection exists in the reflected sub sound waves respectively aligned with the two adjacent images, the reflected sub sound waves which are not aligned with any image may exist, and the part of the reflected sub sound waves are removed.

The condition of determining whether the time information matches the time stamp of the image may be set empirically, for example, the condition of determining whether the time information matches the time stamp of the image may be: judging whether the absolute value of the difference value between the timestamp indicated by the time information and the timestamp of the image is not greater than a reference threshold value; when the absolute value of the difference value between the time stamp indicated by the time information and the time stamp of the image is not larger than the reference threshold value, the time information is matched with the time stamp of the image; when the absolute value of the difference between the time stamp indicated by the time information and the time stamp of the picture is greater than the reference threshold, the time information is not matched with the time stamp of the picture. It should be noted that, when the time information is a time stamp, the time stamp indicated by the time information is the time stamp; when the time information is a time stamp range, the time stamp indicated by the time information may refer to a time stamp of a certain reference position (e.g., a middle position) in the time stamp range. The reference threshold may be set empirically, or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application.

In a possible implementation manner, before the alignment processing is performed on the at least one reflected sub acoustic wave and the at least one image in the image sequence, the at least one reflected sub acoustic wave and the at least one image in the image sequence may be respectively preprocessed, and then the at least one preprocessed reflected sub acoustic wave and the at least one preprocessed image may be aligned.

In one possible implementation, preprocessing the image may refer to performing an augmentation operation on the image, and the augmentation operation is not limited in the embodiment of the present application, and exemplarily includes one or more of rotating, changing color, blurring, adding random noise, centering, and reducing resolution. The operation of preprocessing the reflected sub-acoustic waves can be set empirically, and is not limited in the embodiments of the present application. Illustratively, the operation of preprocessing the reflected sub-acoustic waves includes at least one of filtering, normalization processing, and wavelet transformation. Optionally, the process of filtering the reflected sub-acoustic wave may be performed by using a filter, and the embodiment of the present application does not limit the type of the filter, and the filter is, for example, a time domain filter, a frequency domain filter, or a kalman filter. Alternatively, the normalization process of the reflected sub-acoustic waves may be implemented based on a sliding window.

And 4, step 4: for any image in the at least one image, constructing a reflected sound wave corresponding to the any image based on the reflected sub sound wave aligned with the any image; any training sample corresponding to any target object is formed based on any image and the reflected sound wave corresponding to any image.

And after the reflected sub sound waves respectively aligned with at least one image are obtained, for any image, forming the reflected sound wave corresponding to any image based on the reflected sub sound waves aligned with any image. In one possible implementation, the reflected sound wave corresponding to any image is constructed based on the reflected sub-sound wave aligned with the image by: and splicing the reflected sub sound waves aligned with any image according to the sequence of the timestamps indicated by the time information of the reflected sub sound waves to obtain the reflected sound waves corresponding to any image. According to this way, the reflected sound waves corresponding to the at least one image can be obtained from the reflected sub sound waves aligned with the at least one image, respectively.

After the reflected sound wave corresponding to any image is obtained, any training sample corresponding to any target object is formed on the basis of any image and the reflected sound wave corresponding to any image. It should be noted that, after any training sample corresponding to any target object is formed based on any image and any image-corresponding reflected sound wave, any training sample includes a < image, reflected sound wave > pair for the same image from the same target object.

After the reflected sound waves corresponding to the at least one image are obtained, a training sample corresponding to any target object can be formed based on each image and the reflected sound waves corresponding to the image, and therefore at least one training sample corresponding to any target object formed by the at least one image is obtained.

For example, the process of obtaining the training sample corresponding to any target object may be as shown in 81 to 88 in fig. 8. 81. Acquiring reflected sound waves reflected by a target object; 82. dividing the reflected sound wave into at least one reflected sub-sound wave; 83. filtering and normalizing the reflected sub-sound waves; 84. acquiring an image sequence of a target object; 85. determining a timestamp of at least one image in the sequence of images; 86. carrying out augmentation operation on the image; 87. aligning the processed reflected sub-sound waves with the processed image to determine reflected sound waves corresponding to the image; 88. the < image, reflected sound wave > pair for the same image is taken as a training sample for the target object.

And 5: and forming any training sample subset based on at least one training sample corresponding to any target object formed by at least one image.

After obtaining at least one training sample corresponding to any target object formed by at least one image, forming any training sample subset based on at least one training sample corresponding to any target object, wherein any training sample subset is a training sample subset corresponding to any target object.

It should be noted that, the above steps 1 to 5 describe a process of obtaining any training sample subset, and for each target object, one training sample subset may be obtained according to the above steps 1 to 5, and after obtaining at least two training sample subsets, a training sample set may be obtained.

In step 702, training samples are respectively selected from at least two training sample subsets to form target training samples, and an initial image feature extraction model, an initial acoustic feature extraction model, a first initial classification model, a second initial classification model and an initial discriminator model in a to-be-trained living body detection model are trained based on the target training samples to obtain a living body detection model.

The living body detection model comprises a target image feature extraction model, a target sound wave feature extraction model, a first target classification model, a second target classification model and a target discriminator model.

The target training sample is used for training the living body detection model to be trained for one time. The training samples in the target training samples are respectively selected from at least two training sample subsets, and the target training samples comprise at least two training samples from at least two target objects because different training sample subsets correspond to different target objects. It should be noted that, in the process of forming the target training sample, the same number of training samples may be selected from different training sample subsets, and a different number of training samples may also be selected, which is not limited in the embodiment of the present application.

And after the target training sample is obtained, training the living body detection model to be trained based on the target training sample to obtain the living body detection model. In one possible implementation manner, referring to fig. 9, training an initial image feature extraction model, an initial acoustic feature extraction model, a first initial classification model, a second initial classification model, and an initial discriminator model in a to-be-trained living body detection model based on a target training sample includes the following steps 7021 to 7026.

Step 7021: and calling an initial image feature extraction model to extract initial image features of the images in the target training sample.

The initial image feature extraction model is an image feature extraction model to be trained and is used for extracting features of images. In one possible implementation manner, the process of invoking the initial image feature extraction model to extract the initial image features of the images in the target training sample is as follows: and inputting the images in the target training sample into the initial image feature extraction model to obtain the initial image features extracted by the initial image feature extraction model. It should be noted that, since the target training samples include at least two training samples from at least two target objects, and each training sample includes an image, this step 7021 invokes the initial image features extracted by the initial image feature extraction model to be the initial image features corresponding to each image. That is, this step 7021 calls for the initial image feature extraction model to extract the initial image features of which the number is at least two.

Step 7022: and calling an initial sound wave feature extraction model to extract the initial sound wave features of the reflected sound waves in the target training sample.

The initial sound wave feature extraction model is a sound wave feature extraction model to be trained and is used for extracting the features of the reflected sound waves. In one possible implementation manner, the process of calling the initial sound wave feature extraction model to extract the initial sound wave features of the reflected sound waves in the target training sample is as follows: and inputting the reflected sound waves in the target training sample into the initial sound wave feature extraction model to obtain the initial sound wave features extracted by the initial sound wave feature extraction model. It should be noted that, since the target training samples include at least two training samples from at least two target objects, and each training sample includes a reflected sound wave, this step 7022 invokes the initial sound wave feature extracted by the initial sound wave feature extraction model to be an initial sound wave feature corresponding to each reflected sound wave. That is, this step 7022 calls for the initial acoustic feature extraction model to extract at least two initial acoustic features.

Step 7023: calling a first initial classification model to classify the initial image features, and obtaining a first initial living body detection result according to a classification result; and calling a second initial classification model to classify the initial sound wave characteristics, and obtaining a second initial living body detection result according to the classification result.

The first initial classification model is a model to be trained for classifying the image features, and the second initial classification model is a model to be trained for classifying the sound wave features.

It should be noted that, in the process of calling the first initial classification model to perform classification processing on the initial image features, the first initial classification model is called to perform classification processing on each initial image feature. After the first initial classification model is called to classify each initial image feature, a first initial living body detection result can be obtained according to a classification result. Since the number of initial image features is at least two, the number of first initial living body detection results obtained in this step 7023 is also at least two.

And calling the second initial classification model to classify each initial sound wave characteristic respectively in the process of calling the second initial classification model to classify the initial sound wave characteristics. And calling a second initial classification model to classify each initial sound wave characteristic, and then obtaining a second initial living body detection result according to a classification result. Since the number of initial acoustic wave features is at least two, the number of second initial living body detection results obtained in this step 7023 is also at least two.

Step 7024: and forming sample feature groups based on the initial image features and the initial sound wave features, wherein any sample feature group is formed by one initial image feature and one initial sound wave feature.

As can be seen from

steps

7021 and 7022, the number of initial image features is at least two, and the number of initial acoustic wave features is also at least two, and sample feature groups are formed based on the at least two initial image features and the at least two initial acoustic wave features, where any one sample feature group is formed by one initial image feature and one initial acoustic wave feature.

In one possible implementation manner, the sample feature set includes a positive sample feature set and a negative sample feature set, and based on the initial image feature and the initial acoustic wave feature, the process of constructing the sample feature set is as follows: determining an initial image feature-initial sound wave feature pair meeting a positive matching condition and an initial image feature-initial sound wave feature pair meeting a negative matching condition based on the initial image feature and the initial sound wave feature; and forming a positive sample feature group based on the initial image feature-initial sound wave feature pairs meeting the positive matching condition, and forming a negative sample feature group based on the initial image feature-initial sound wave feature pairs meeting the negative matching condition. The condition that the positive matching is met means that the image corresponding to the initial image characteristic and the reflected sound wave corresponding to the initial sound wave characteristic come from the same training sample in the target training samples; the condition that the negative matching is met means that the image corresponding to the initial image feature and the reflected sound wave corresponding to the initial sound wave feature come from two training samples meeting the reference condition in the target training samples, and the two training samples meeting the reference condition mean two training samples from different training sample subsets.

One training sample subset corresponds to one target object, so that after the positive sample feature set and the negative sample feature set are obtained based on the above manner, the initial image features and the initial sound wave features in the positive sample feature set come from the same target object, and the model can learn the associated features between the image features and the sound wave features from the same target object; the initial image features and the initial sound wave features in the negative sample feature set are from different objects, so that the model can learn the associated features between the image features and the sound wave features from different objects.

It should be noted that, because the target training sample includes at least two training samples from two training sample subsets, the initial image feature corresponding to the image in each training sample and the initial acoustic wave feature corresponding to the reflected acoustic wave may both form an initial image feature-initial acoustic wave feature pair that satisfies the positive matching condition. Each initial image feature-initial acoustic wave feature pair satisfying the positive matching condition may form a positive sample feature group, and therefore, the number of the positive sample feature groups is at least two. In addition, the initial image features corresponding to the images and the initial sound wave features corresponding to the reflected sound waves in at least two training samples from the two training sample subsets can form at least two initial image feature-initial sound wave feature pairs meeting the negative matching condition, and the number of the negative sample feature groups is also at least two.

For example, assuming that the target training sample includes a training sample a from the training sample subset a and a training sample B from the training sample subset B, after step 7021, an initial image feature 1 corresponding to an image in the training sample a and an initial image feature 2 corresponding to an image in the training sample B may be obtained, and after step 7022, an initial sound wave feature 1 corresponding to a reflected sound wave in the training sample a and an initial image feature 2 corresponding to a reflected sound wave in the training sample B may be obtained. On the basis, the initial image feature-initial sound wave feature pair meeting the positive matching condition comprises: an initial image feature 1-initial acoustic feature 1 pair and an initial image feature 2-initial acoustic feature 2 pair; the initial image feature-initial acoustic wave feature pairs satisfying the positive matching condition include: an initial image feature 1-initial acoustic feature 2 pair and an initial image feature 2-initial acoustic feature 1 pair. Two positive sample feature groups (initial image feature 1, initial sound wave feature 1) and (initial image feature 2, initial sound wave feature 2) can be formed based on the initial image feature 1-initial sound wave feature 1 pair and the initial image feature 2-initial sound wave feature 2 pair; two negative sample feature groups (initial image feature 1, initial acoustic wave feature 2) and (initial image feature 2, initial acoustic wave feature 1) can be formed based on the initial image feature 1-initial acoustic wave feature 2 pair and the initial image feature 2-initial acoustic wave feature 1 pair.

It should be noted that, in the process of forming the positive sample feature group and the negative sample feature group, the initial image feature-initial acoustic feature pairs that do not satisfy the positive matching condition nor the negative matching condition are ignored, and these initial image feature-initial acoustic feature pairs come from two different training samples in the same training subset. Although the training samples in the same training subset all correspond to the same target object, because two different training samples in the same training subset are training samples corresponding to the target object at different times, the initial image feature-initial sound wave feature pair comes from the same target object at different times, and the initial image feature-initial sound wave feature pair easily interferes with the learning process of the model, so the initial image feature-initial sound wave feature pair is ignored.

Step 7025: training a first initial classification model based on the first initial in-vivo detection result; training a second initial classification model based on the second initial in-vivo detection result; training an initial discriminator model based on the sample characteristic group; training an initial image feature extraction model based on the first initial living body detection result and the sample feature group; and training the initial sound wave feature extraction model based on the second initial living body detection result and the sample feature group.

This step 7025 is a process of training each of the to-be-trained living body detection models, and since different models have different functions, data on which different models are trained are different. The manner in which each model is trained is described separately below.

1. The first initial classification model is trained based on the first initial in-vivo detection result.

And the first living body detection result is obtained after the classification processing of the first initial classification model, and the first initial classification model is trained according to the first initial living body detection result. In one possible implementation manner, each of the target training samples has a classification label, and the classification label is used to indicate whether a target object corresponding to the training sample is a living body, and a result indicated by the classification label is a real result. Each training sample corresponds to a first initial in vivo detection result, the first initial in vivo detection result is used for indicating whether a target object corresponding to the training sample is a living body at an image level, and the first initial in vivo detection result is a result predicted by the model.

In one possible implementation manner, the process of training the first initial classification model based on the first initial in-vivo detection result is as follows: acquiring a first classification loss function based on the first initial living body detection result and the classification label; parameters of the first initial classification model are updated based on the first classification loss function. It should be noted that, since each training sample corresponds to one first initial in-vivo detection result and one classification label, the process of obtaining the first classification loss function based on the first initial in-vivo detection result and the classification label may be: and respectively calculating a loss function between each first initial living body detection result and the corresponding classification label, and taking the average value of the calculated loss functions as a first classification loss function. The embodiment of the present application does not limit the type of the loss function between the first initial in-vivo detection result and the corresponding classification label, and the loss function between the first initial in-vivo detection result and the corresponding classification label is, for example, a cross-entropy loss function.

2. The second initial classification model is trained based on the second initial in-vivo detection result.

And the second living body detection result is obtained after the classification processing of the second initial classification model, and the second initial classification model is trained according to the second initial living body detection result. In one possible implementation manner, each of the target training samples has a classification label, and the classification label is used to indicate whether a target object corresponding to the training sample is a living body, and a result indicated by the classification label is a real result. Each training sample corresponds to a second initial living body detection result, the second initial living body detection result is used for indicating whether a target object corresponding to the training sample is a living body on a reflected sound wave layer, and the second initial living body detection result is a result predicted by the model.

In one possible implementation manner, the process of training the second initial classification model based on the second initial in-vivo detection result is as follows: acquiring a second classification loss function based on the second initial living body detection result and the classification label; parameters of the second initial classification model are updated based on the second classification loss function. It should be noted that, since each training sample corresponds to one second initial in-vivo detection result and one classification label, the process of obtaining the second classification loss function based on the second initial in-vivo detection result and the classification label may be: and respectively calculating a loss function between each second initial living body detection result and the corresponding classification label, and taking the average value of the calculated loss functions as a second classification loss function. The embodiment of the present application does not limit the type of the loss function between the second initial in-vivo detection result and the corresponding classification label, and the loss function between the second initial in-vivo detection result and the corresponding classification label is, for example, a cross-entropy loss function.

3. And training the initial discriminator model based on the sample characteristic group.

Each sample feature group comprises an initial image feature and an initial sound wave feature, the initial discriminator model is used for matching the image feature and the sound wave feature in the sample feature group, and the initial discriminator model can be trained based on the sample feature group. In one possible implementation, the set of sample features includes a set of positive sample features and a set of negative sample features. In one possible implementation, the structure of the initial discriminator model includes, but is not limited to, the following two cases.

Case 1: the initial discriminator model includes an initial similarity operator model.

In one possible implementation, in this case 1, the process of training the initial discriminator model based on the sample feature set is: calling an initial similarity calculation operator model, and calculating a first similarity value between an initial image feature and an initial sound wave feature in the positive sample feature group and a second similarity value between the initial image feature and the initial sound wave feature in the negative sample feature group; calculating a first target loss function based on the first similarity value, the second similarity value and a first reference threshold; parameters of the initial similarity operator model are updated based on the first objective loss function.

In one possible implementation manner, the main body of the initial similarity operator model is a similarity calculation matrix, the initial image feature and the initial sound wave feature in each sample feature group are vectors, in the initial similarity operator model, a product between a vector corresponding to the initial image feature, the similarity calculation matrix and a vector corresponding to the initial sound wave feature is calculated, and the product is used as a similarity value between the image feature and the sound wave feature.

Each positive sample feature set corresponds to a first similarity value, and each negative sample feature set corresponds to a second similarity value. In a possible implementation manner, in the process of calculating the first target loss function based on the first similarity value, the second similarity value and the first reference threshold, all the first similarity values and all the second similarity values may be directly used, or only the first similarity value satisfying the first similarity condition and the second similarity value satisfying the second similarity condition may be used, which is not limited in the embodiment of the present application. Satisfying the first similarity condition may mean that the similarity value is not greater than a first similarity threshold, and satisfying the second similarity condition may mean that the similarity value is not less than a second similarity threshold. The first similarity threshold is greater than the second similarity threshold, and the first similarity threshold and the second similarity threshold may be set empirically, which is not limited in this embodiment of the present application.

The first similarity value is a similarity value corresponding to the positive sample feature group, and the initial image feature and the initial sound wave feature in the positive sample feature group are from the same target object, so the first similarity value is required to be larger; the second similarity value is a similarity value corresponding to the negative sample feature set, and the initial image feature and the initial sound wave feature in the negative sample feature set are from different targets, so the second similarity value should be small. The first similarity value which is not larger than the first similarity threshold value and the second similarity value which is not smaller than the second similarity threshold value are predicted inaccurate similarity values, so that the first target loss function obtained through calculation can update the model parameters more pertinently by only using the first similarity value which meets the first similarity condition and the second similarity value which meets the second similarity condition, and the model training effect is better.

In one possible implementation, the first objective loss function may be calculated based on the following equation 1:

(formula 1)

Wherein the content of the first and second substances,

representing a first target loss function;

represents a first reference threshold;

representing the initial sound wave characteristics corresponding to the reflected sound wave sig;

representing the initial image characteristics corresponding to the image img;

representing a similarity calculation matrix;

and representing the initial sound wave characteristic corresponding to the reflected sound wave j, wherein the reflected sound wave j is the other reflected sound waves except the reflected sound wave sig. In one possible implementation, the initial acoustic features may be row vectors, the initial image features may be column vectors,

may be a linear transformation matrix. In the above-mentioned formula 1,

representing a first similarity value involved in the calculation of the first objective loss function;

representing the second similarity value involved in the calculation of the first objective loss function. The optimization direction embodied by the first objective loss function calculated by using the above formula 1 is: it is desirable that the correlation between the initial image feature and the initial acoustic feature from the same object is as high as possible, the correlation between the initial image feature and the initial acoustic feature from different objects is as low as possible, and it is desirable that a difference between the correlation between the initial image feature and the initial acoustic feature from the same object and the correlation between the initial image feature and the initial acoustic feature from different objects can be larger than a first reference threshold, and data not satisfying such an optimization direction is punished.

And after the first target loss function is obtained, updating the parameters of the initial similarity calculation operator model based on the first target loss function, thereby achieving the purpose of training the initial discriminator model.

Case 2: the initial discriminator model comprises an initial characteristic fusion submodel and an initial classification processing submodel.

In one possible implementation, the positive sample feature set corresponds to a first tag, and the negative sample feature set corresponds to a second tag, where the first tag is used to indicate that the initial image feature and the initial acoustic feature are from the same object, and the second tag is used to indicate that the initial image feature and the initial acoustic feature are from different objects. In this case 2, the process of training the initial discriminator model based on the sample feature group includes the following steps a to c.

Step a: inputting the initial image characteristic and the initial sound wave characteristic in the positive sample characteristic group into an initial characteristic fusion sub-model for fusion to obtain a first fusion characteristic; and calling the initial classification processing submodel to perform classification processing on the first fusion characteristics to obtain a first classification result.

The number of the positive sample feature groups is at least two, the initial image features and the initial sound wave features in the positive sample feature groups are input into the initial feature fusion submodel for fusion, and the process of obtaining the first fusion features is as follows: and respectively inputting the initial image characteristic and the initial sound wave characteristic in each positive sample characteristic group into the initial characteristic fusion sub-model for fusion to obtain a first fusion characteristic corresponding to each positive sample characteristic group. Calling the initial classification processing submodel to perform classification processing on the first fusion characteristics, wherein the process of obtaining a first classification result comprises the following steps: and calling the initial classification processing submodel to classify each first fusion feature respectively to obtain a first classification result corresponding to each first fusion feature. And the first classification result corresponding to each first fusion feature is used for indicating whether the initial image feature and the initial sound wave feature which form the first fusion feature are from the same target object or not, and the first classification result is a predicted result.

Step b: inputting the initial image features and the initial sound wave features in the negative sample feature group into the initial feature fusion submodel for fusion to obtain second fusion features, and calling the initial classification processing submodel to classify the second fusion features to obtain a second classification result.

The number of the negative sample feature groups is at least two, the image features and the sound wave features in the negative sample feature groups are input into the initial feature fusion submodel for fusion, and the process of obtaining the second fusion features is as follows: and inputting the image characteristics and the sound wave characteristics in each negative sample characteristic group into the initial characteristic fusion sub-model for fusion to obtain second fusion characteristics corresponding to each negative sample characteristic group. Calling the initial classification processing submodel to perform classification processing on the second fusion characteristics, wherein the process of obtaining a second classification result comprises the following steps: and calling the initial classification processing submodel to classify each second fusion feature respectively to obtain a second classification result corresponding to each second fusion feature. And the second classification result corresponding to each second fusion feature is used for indicating whether the initial image feature and the initial sound wave feature which form the second fusion feature are from the same target object or not, and the second classification result is a predicted result.

In a possible implementation manner, the manner of fusing the image feature and the acoustic wave feature may refer to summing the image feature and the acoustic wave feature, or may refer to stitching the image feature and the acoustic wave feature, which is not limited in this embodiment of the present application. The initial classification processing sub-model may include an activation function layer, and the activation function layer is used to perform classification processing on the fusion features and obtain a classification result.

Step c: calculating a second target loss function based on a loss function between the first classification result and the first label and a loss function between the second classification result and the second label; and updating the parameters of the initial feature fusion submodel and the initial classification processing submodel based on the second target loss function.

The first classification result is a prediction result corresponding to the positive sample feature set, the first label is a real result corresponding to the positive sample feature set, and the loss function between the first classification result and the first label is a loss function between each first classification result and the first label. The embodiment of the present application does not limit the type of the loss function between the first classification result and the first label, and the loss function between the first classification result and the first label is, for example, a cross entropy loss function.

The second classification result is a prediction result corresponding to the negative sample feature set, the second label is a real result corresponding to the negative sample feature set, and the loss function between the second classification result and the second label is a loss function between each second classification result and the second label. The embodiment of the present application does not limit the type of the loss function between the second classification result and the second label, and the loss function between the second classification result and the second label is, for example, a cross entropy loss function.

After obtaining a loss function between the first classification result and the first label and a loss function between the second classification result and the second label, a second target loss function is calculated based on the loss function between the first classification result and the first label and the loss function between the second classification result and the second label. In one possible implementation, based on the loss function between the first classification result and the first tag, and the loss function between the second classification result and the second tag, the process of calculating the second target loss function is: calculating a first average loss function of the loss functions between each first classification result and the first label and a second average loss function of the loss functions between each second classification result and the second label; a second target loss function is calculated based on the first average loss function and the second average loss function.

In one possible implementation, based on the first average loss function and the second average loss function, the process of calculating the second target loss function is: setting a first weight for the first average loss function and a second weight for the second average loss function; the sum of the first product and the second product is taken as a second target loss function. Wherein the first product is a product of a first average loss function and a first weight, and the second product is a product of a second average loss function and a second weight. The first weight and the second weight may be set empirically, which is not limited in the embodiment of the present application, for example, the first weight is set to 0.5, and the second weight is also set to 0.5.

And after the second target loss function is obtained, updating the parameters of the initial feature fusion submodel and the initial classification processing submodel based on the second target loss function, thereby achieving the purpose of training the initial discriminator model.

4. And training the initial image feature extraction model based on the first initial living body detection result and the sample feature group.

The features extracted by the initial image feature extraction model have two purposes, namely, the features are used for obtaining a first initial living body detection result of an image layer on one hand, and are used for forming a sample feature group needing to be input into a discriminator model on the other hand. Therefore, the initial image feature extraction model is trained based on the first initial in-vivo detection result and the sample feature group.

In one possible implementation manner, the sample feature group includes a positive sample feature group and a negative sample feature group, two different model structures corresponding to the initial discriminator model, and the process of training the initial image feature extraction model based on the first initial living body detection result and the sample feature group includes the following two processes.

The first process is as follows: acquiring a first classification loss function based on the first initial living body detection result and the classification label; calling an initial similarity calculation operator model, and calculating a first similarity value between an initial image feature and an initial sound wave feature in the positive sample feature group and a second similarity value between the initial image feature and the initial sound wave feature in the negative sample feature group; calculating a first target loss function based on the first similarity value, the second similarity value and a first reference threshold; parameters of the initial image feature extraction model are updated based on the first classification loss function and the first target loss function.

This process occurs when the initial discriminator model includes an initial similarity operator model. The process of obtaining the first classification loss function may refer to a process of training the first initial classification model based on the first initial in-vivo detection result, and details are not repeated here. The process of calculating the first target loss function may refer to case 1 in the process of training the initial discriminator model based on the sample feature set, and details are not repeated here.

And a second process: acquiring a first classification loss function based on the first initial living body detection result and the classification label; inputting the initial image characteristic and the initial sound wave characteristic in the positive sample characteristic group into an initial characteristic fusion sub-model for fusion to obtain a first fusion characteristic; calling an initial classification processing sub-model to perform classification processing on the first fusion characteristics to obtain a first classification result; inputting the initial image features and the initial sound wave features in the negative sample feature group into an initial feature fusion submodel for fusion to obtain second fusion features, and calling the initial classification processing submodel to classify the second fusion features to obtain a second classification result; calculating a second target loss function based on a loss function between the first classification result and the first label and a loss function between the second classification result and the second label; and updating parameters of the initial image feature extraction model based on the first classification loss function and the second target loss function.

The second process occurs when the initial discriminator model comprises an initial feature fusion submodel and an initial classification processing submodel. The process of obtaining the first classification loss function may refer to a process of training the first initial classification model based on the first initial in-vivo detection result, and details are not repeated here. The process of calculating the second target loss function may refer to case 2 in the process of training the initial discriminator model based on the sample feature set, and details are not repeated here.

According to the first process and the second process, the first classification loss function is used for updating parameters of the initial image feature extraction model in addition to parameters of the first initial classification model; the objective loss function (first objective loss function or second objective loss function) is used to update the parameters of the initial image feature extraction model in addition to the parameters of the sub-models in the initial discriminator model.

5. And training the initial sound wave feature extraction model based on the second initial living body detection result and the sample feature group.

The features extracted by the initial acoustic wave feature extraction model have two purposes, namely, the features are used for obtaining a second initial living body detection result of a reflected acoustic wave layer on one hand, and are used for forming a sample feature group needing to be input into a discriminator model on the other hand. Therefore, the initial acoustic feature extraction model is trained based on the second initial in-vivo detection result and the sample feature group.

In one possible implementation manner, the sample feature group includes a positive sample feature group and a negative sample feature group, two different model structures corresponding to the initial discriminator model, and the process of training the initial acoustic wave feature extraction model based on the second initial living body detection result and the sample feature group includes the following two processes.

Process 1: acquiring a second classification loss function based on the second initial living body detection result and the classification label; calling an initial similarity calculation operator model, and calculating a first similarity value between an initial image feature and an initial sound wave feature in the positive sample feature group and a second similarity value between the initial image feature and the initial sound wave feature in the negative sample feature group; calculating a first target loss function based on the first similarity value, the second similarity value and a first reference threshold; and updating the parameters of the initial sound wave feature extraction model based on the second classification loss function and the first target loss function.

Such process 1 occurs where the initial discriminator model comprises an initial similarity operator model. The process of obtaining the second classification loss function may refer to a process of training the second initial classification model based on the second initial in-vivo detection result, and details are not repeated here. The process of calculating the first target loss function may refer to case 1 in the process of training the initial discriminator model based on the sample feature set, and details are not repeated here.

And (2) a process: acquiring a second classification loss function based on the second initial living body detection result and the classification label; inputting the initial image characteristic and the initial sound wave characteristic in the positive sample characteristic group into an initial characteristic fusion sub-model for fusion to obtain a first fusion characteristic; calling an initial classification processing sub-model to perform classification processing on the first fusion characteristics to obtain a first classification result; inputting the initial image features and the initial sound wave features in the negative sample feature group into an initial feature fusion submodel for fusion to obtain second fusion features, and calling the initial classification processing submodel to classify the second fusion features to obtain a second classification result; calculating a second target loss function based on a loss function between the first classification result and the first label and a loss function between the second classification result and the second label; and updating parameters of the initial sound wave feature extraction model based on the first classification loss function and the second target loss function.

This process 2 occurs when the initial discriminator model includes an initial feature fusion submodel and an initial classification processing submodel. The process of obtaining the second classification loss function may refer to a process of training the second initial classification model based on the second initial in-vivo detection result, and details are not repeated here. The process of calculating the second target loss function may refer to case 2 in the process of training the initial discriminator model based on the sample feature set, and details are not repeated here.

According to the

above processes

1 and 2, the second classification loss function is used to update parameters of the initial acoustic wave feature extraction model in addition to parameters of the second initial classification model; the objective loss function (the first objective loss function or the second objective loss function) is used to update parameters of the initial acoustic feature extraction model in addition to parameters of the submodel in the initial discriminator model.

According to the training process, the first classification loss function is used for updating parameters of the first initial classification model and parameters of the initial image feature extraction model, the second classification loss function is used for updating parameters of the second initial classification model and parameters of the initial sound wave feature extraction model, and the target loss function (the first target loss function or the second target loss function) is used for updating parameters of the submodel in the initial discriminator model, parameters of the initial image feature extraction model and parameters of the initial sound wave feature extraction model.

It should be noted that the above-mentioned process of training each model is performed simultaneously, that is, each model in the to-be-trained living body detection model is trained once simultaneously according to the target training sample. And after training once, judging whether the training termination condition is met, when the training termination condition is not met, selecting the training samples from the at least two training sample subsets again to form new target training samples, training each model in the living body detection model which does not meet the training termination condition again according to the new target training samples, and circulating the process until the training termination condition is met. The method comprises the steps of taking an image feature extraction model obtained when a training termination condition is met as a target image feature model, taking an acoustic feature extraction model obtained when the training termination condition is met as a target acoustic feature model, taking a first classification model obtained when the training termination condition is met as a first target classification model, taking a second classification model obtained when the training termination condition is met as a second target classification model, and taking a discriminator model obtained when the training termination condition is met as a target discriminator model. At the moment, a target image feature extraction model, a target sound wave feature extraction model, a first target classification model, a second target classification model and a target discriminator model are obtained through training.

It should be noted that the new target training sample for training again may be the same as or different from the target training sample used in the previous training process, and this is not limited in the embodiment of the present application.

In one possible implementation, satisfying the training termination condition includes, but is not limited to, the following three cases.

In case 1, the iterative training times reach a threshold number.

The number threshold may be set empirically, or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application.

In case 2, the loss function is less than the loss threshold.

It should be noted that different loss functions may correspond to the same loss threshold or different loss thresholds, which is not limited in the embodiment of the present application.

Case 3, the loss function both converges.

The convergence of the loss function refers to the fluctuation of the loss function in the training result of the reference times along with the increase of the iterative training timesThe ranges are within the reference range. For example, assume a reference range of-10^-3~10^-3Assume that the reference number is 10. If the loss function has a fluctuation range of-10 in 10 times of iterative training results^-3~10^-3And (4) considering the loss function to be converged.

Step 7026: and responding to the training to obtain a target image feature extraction model, a target sound wave feature extraction model, a first target classification model, a second target classification model and a target discriminator model, and obtaining a living body detection model.

And when a target image feature extraction model, a target sound wave feature extraction model, a first target classification model, a second target classification model and a target discriminator model are obtained through training, a living body detection model can be obtained. The living body detection model obtained at this time can more accurately extract the image feature and the sound wave feature containing information in the aspect of living body detection, so that whether a certain object is a living body can be more accurately judged on the image level and the reflected sound wave level respectively, and whether the image feature and the sound wave feature are from the same object can be more accurately judged.

It should be noted that, in order to avoid confusion caused by model learning, the training sample set is composed of training samples corresponding to simple objects in a natural scene, and the training sample set enables the features extracted by the model in the natural scene to have strong correlation. Complex scenes (such as goods with advertisements often found in supermarkets) can have a large impact on the learning of the model. For example, many commodities print the photo of a speaker on the commodity itself (for example, a star on a milk powder can), and the shape and the surface plane of the commodity itself do not generate any transformation, so that the commodity is the same object from a reflected sound wave level, but because of the existence of surface advertisements, the difference is very large from an image level, the surface image (or color) and the real surface plane structure are decoupled, so that the association degree with the reflected sound wave is very low, and adding the surface advertisement into a training stage will mislead model learning, so that a training sample set adopted in the training process is composed of training samples corresponding to simple objects in a natural scene. The main objective of the in-vivo detection model is to achieve stability under the condition of being attacked, so that after the in-vivo detection model is obtained, the in-vivo detection model can be optimized by using an attack sample, so that the in-vivo detection of the optimized in-vivo detection model can be stably achieved under a complex scene.

In one possible implementation, referring to fig. 10, the procedure of optimizing the in-vivo detection model includes the following steps 1001 to 1004.

Step 1001: the method comprises the steps of obtaining an attack sample set, wherein the attack sample set comprises at least one attack sample subset, the different attack sample subsets correspond to different attackers, any attack sample subset comprises at least one attack sample corresponding to any attackers, and any attack sample corresponding to any attackers comprises any image corresponding to any attackers and reflected sound waves corresponding to any image.

The attack sample set is used for optimizing the in-vivo detection model obtained through training, the attack sample set comprises at least one attack sample subset, and each attack sample subset corresponds to an attack object. The aggressor indicates an object used for collecting an attack sample, and in a possible implementation manner, the aggressor refers to an object that is common in a visual live detection attack scene, for example, a printed face picture, a face photo or video on a mobile phone or a tablet computer, a face mask hidden in front of a real face, and the like. The samples collected by the attacking object and the real human face from the image level are difficult to distinguish, but the reflecting plane of the attacking object is greatly different from the real human face, so that the attacking object has a large distinguishing degree from the reflecting sound wave level.

And aiming at each attacking object, acquiring an attack sample subset, wherein each attack sample subset comprises at least one attack sample corresponding to one attacking object, and each attack sample corresponding to one attacking object comprises an image corresponding to the attacking object and a reflected sound wave corresponding to the image. The process of obtaining any attack sample subset can refer to step 1 to step 5 in step 701, which is not described herein again.

Step 1002: selecting attack samples from the attack sample set to form target attack samples, and calling a target image feature extraction model to extract target image features of images in the target attack samples; and calling a target sound wave characteristic extraction model to extract the target sound wave characteristics of the reflected sound waves in the target attack sample.

The attack sample set includes at least one attack sample subset, and selecting an attack sample in the attack sample set may refer to selecting an attack sample in one attack sample subset, or selecting an attack sample in a plurality of attack sample subsets, which is not limited in the embodiment of the present application. And forming a target attack sample by using the selected attack sample, wherein the target attack sample is used for optimizing the living body detection model once. It should be noted that the target attack sample may include one attack sample or may include a plurality of attack samples, which is not limited in the embodiment of the present application.

The target image feature extraction model is a trained image feature extraction model and is used for extracting the features of the image. In one possible implementation manner, the process of invoking the target image feature extraction model to extract the target image features of the images in the target attack sample is as follows: and inputting the image in the target attack sample into the target image feature extraction model to obtain the target image feature extracted by the target image feature extraction model. It should be noted that the target attack sample may include one or more attack samples, each attack sample includes an image, and the target image features extracted by invoking the target image feature extraction model are target image features corresponding to each image.

The target sound wave characteristic extraction model is a trained sound wave characteristic extraction model and is used for extracting the characteristics of the reflected sound waves. In one possible implementation manner, the process of invoking the target sound wave feature extraction model to extract the target sound wave feature of the reflected sound wave in the target attack sample is as follows: and inputting the reflected sound waves in the target attack sample into the target sound wave characteristic extraction model to obtain the target sound wave characteristics extracted by the target sound wave characteristic extraction model. It should be noted that the target attack sample may include one or more attack samples, each attack sample includes a reflected sound wave, and the target sound wave feature extracted by invoking the target sound wave feature extraction model is a target sound wave feature corresponding to each reflected sound wave.

Step 1003: and forming attack characteristic groups based on the target image characteristics and the target sound wave characteristics, wherein any attack characteristic group is formed by one target image characteristic and one target sound wave characteristic.

In one possible implementation, based on the target image feature and the target acoustic wave feature, the attack feature group is formed by: determining a target image feature-target sound wave feature pair which meets a reference matching condition based on the target image feature and the target sound wave feature; and forming an attack characteristic group based on the target image characteristic-target sound wave characteristic pairs which meet the reference matching condition. The condition that the reference matching condition is met means that the image corresponding to the target image characteristic and the reflected sound wave corresponding to the target sound wave characteristic come from the same attack sample in the target attack samples. The number of attack signature sets is consistent with the number of attack samples included in the target attack sample.

Step 1004: and optimizing a target discriminator model in the living body detection model based on the attack characteristic group, and obtaining the optimized living body detection model based on the optimized target discriminator model.

And after the attack characteristic group is obtained, optimizing a target discriminator model in the living body detection model based on the attack characteristic group. In one possible implementation manner, the process of optimizing the target discriminator model in the living body detection model based on the attack feature group includes the following two processes corresponding to two different model structures of the target discriminator model.

The process a: acquiring a reference feature group, wherein any reference feature group is composed of a reference image feature and a reference sound wave feature; calling a target similarity measurement operator model, and calculating a third similarity value between the reference image feature and the reference sound wave feature in the reference feature group and a fourth similarity value between the target image feature and the target sound wave feature in the attack feature group; calculating a third target loss function based on the third similarity value, the fourth similarity value and a second reference threshold; and updating the parameters of the target similarity degree operator model based on the third target loss function.

This process a occurs in the case where the object discriminator model includes an object similarity operator model. The reference feature group refers to a feature group used for forward guidance, which means that the similarity value between the reference image feature and the reference acoustic wave feature in the reference feature group is made high. The reference image features and the reference acoustic wave features in the reference feature set are from the same non-attacking object. The non-attack object refers to an object which can obtain the same living body detection result on an image level and a reflected sound wave level, the non-attack object may refer to a target object used for acquiring a training sample set in a model training process, and may also be an object in other simple scenes, and the embodiment of the present application does not limit this.

The reference feature set may be obtained based on the positive sample feature set obtained in the model training process, or may be obtained again. For the case where the reference feature set is obtained based on the positive sample feature set obtained in the model training process, part or all of the positive sample feature set may be used as the reference feature set. For the case that the reference feature group is obtained again, the process of obtaining the reference feature group may refer to the process of obtaining the positive sample feature group in the model training process, and details are not repeated here.

The image corresponding to the target image feature and the reflected sound wave corresponding to the target sound wave feature in the attack feature group come from the same attack object, and the same attack object has different living body detection results on an image layer and a reflected sound wave layer, so that the attack feature group is used as negative direction guidance, and the negative direction guidance means that the similarity value between the target image feature and the target sound wave feature in the attack feature group is low.

After the reference feature group is obtained, a target similarity degree operator model is called, a third similarity value between the reference image feature and the reference sound wave feature in the reference feature group and a fourth similarity value between the target image feature and the target sound wave feature in the attack feature group are calculated, and then a third target loss function is calculated based on the third similarity value, the fourth similarity value and a second reference threshold value.

The third similarity value is a similarity value corresponding to the reference feature group, and the third similarity value is required to be larger because the reference feature group is used for forward guidance; fourth degree of similarityThe value is a similarity value corresponding to the attack feature set, and since the attack feature set is used for negative direction guidance, the fourth similarity value should be small. In one possible implementation, the third objective loss function may be calculated based on equation 1. It should be noted that, in this step, the method in formula 1

Represents a second reference threshold;

representing a third similarity value involved in the calculation of the third objective loss function;

representing a fourth similarity value involved in the calculation of the third objective loss function. The optimization direction embodied by the third objective loss function calculated by using the above formula 1 is as follows: it is desirable that the degree of correlation between the reference image feature and the reference acoustic wave feature from the reference feature group is as high as possible, the degree of correlation between the target image feature and the target acoustic wave feature from the attack feature group is as low as possible, and it is desirable that a difference between the degree of correlation between the reference image feature and the reference acoustic wave feature from the reference feature group and the degree of correlation between the target image feature and the target acoustic wave feature from the attack feature group can be larger than a second reference threshold value.

The second reference threshold may be the same as or different from the first reference threshold, and is not limited in this embodiment of the application. In a possible implementation manner, the second reference threshold is larger than the first reference threshold, and the correlation between the target image feature and the target sound wave feature in the attack feature group can be deliberately reduced by adopting a larger reference threshold, so that the attack object can be better identified in the living body detection process.

And after the third target loss function is obtained, updating the parameters of the target similarity calculation operator model based on the third target loss function, thereby achieving the purpose of optimizing the target discriminator model.

And a process b: inputting the target image characteristics and the target sound wave characteristics in the attack characteristic group into a target characteristic fusion sub-model for fusion to obtain third fusion characteristics; calling a target classification processing sub-model to classify the third fusion characteristics to obtain a third classification result; and updating parameters of the target feature fusion sub-model and the target classification processing sub-model based on a loss function between the third classification result and the second label.

This process b occurs in the case where the target discriminator model includes a target feature fusion submodel and a target classification processing submodel. The attack sample group corresponds to a second label, the second label is used for indicating that the image characteristics and the sound wave characteristics come from different objects, the target image characteristics and the target sound wave characteristics in the attack samples are regarded as coming from different objects, and the attack objects can be better identified by utilizing the second label for supervision.

The third classification result is a prediction result corresponding to the attack characteristic group, the second label is a real result corresponding to the attack group, and the loss function between the third classification result and the second label is a loss function between each third classification result and the second label. The type of the loss function between the third classification result and the second label is not limited in the embodiment of the present application, and the loss function between the third classification result and the second label is, for example, a cross entropy loss function.

And after a loss function between the third classification result and the second label is obtained, updating parameters of the target feature fusion sub-model and the target classification processing sub-model based on the loss function between the third classification result and the second label, so that the aim of optimizing the target discriminator model is fulfilled.

The objective of optimizing the target discriminator model can be achieved in both the above-described process a and the above-described process b. It should be noted that, in the process of optimizing the target discriminator model in the living body detection model based on the attack feature group, other model parameters in the living body detection model are kept unchanged.

For example, as shown in fig. 11, the process of optimizing the target discriminator model may first acquire data for an attack object, and perform image preprocessing on the acquired image data to obtain an image in an attack sample; performing reflected sound wave preprocessing on the collected reflected sound wave data to obtain reflected sound waves in the attack sample; inputting the image into a target image feature extraction model 111 to obtain target image features; inputting the reflected sound waves into a target sound wave feature extraction model 112 to obtain target sound wave features; the target discriminator model 113 is optimized based on the target image features and the target acoustic wave features. It should be noted that, since the first object classification model and the second object classification model are not involved in the process of optimizing the object discriminator model, the first object classification model and the second object classification model are not shown in fig. 11.

And optimizing a target discriminator model in the living body detection model based on the attack characteristic group to obtain the optimized target discriminator model. An optimized in vivo detection model may then be obtained based on the optimized target discriminator model. In one possible implementation manner, based on the optimized target discriminator model, the process of obtaining the optimized in-vivo detection model is as follows: and in the living body detection model, replacing the target discriminator model before optimization by the optimized target discriminator model to obtain the optimized living body detection model. In the case that the biopsy model is further optimized after the biopsy model is obtained, the optimized biopsy model is called to perform the biopsy in the actual process of performing the biopsy on the object to be detected, so as to improve the stability and reliability of the biopsy result.

The target discriminator is optimized according to the attack sample set, so that the performance of the in-vivo detection model, which is more robust against the attack object, is improved, and the in-vivo detection model can resist the attack of the attack object to a greater extent.

Illustratively, the process of acquiring the optimized in-vivo detection model may be as illustrated by 1201 to 1204 in fig. 12. 1201. Collecting and processing image data and reflected sound wave data to obtain a training sample set and an attack sample set; 1202. training a living body detection model to be trained by utilizing a training sample set to obtain a living body detection model; 1203. optimizing the in-vivo detection model by using the attack sample set; 1204. and outputting the optimized in-vivo detection model.

In the embodiment of the application, because the images and the reflected sound waves can introduce more abundant information, a higher-precision and more robust living body detection effect can be realized, and besides training by using the training sample set, the living body detection model provided by the embodiment of the application can be optimized by using the attack sample, so that the living body detection model can have more excellent performance in some common visual attack scenes.

In the embodiment of the application, the in-vivo detection model is obtained through a training mode, in the process of in-vivo detection based on the in-vivo detection model obtained through training, besides information in the aspect of images, information in the aspect of reflected sound waves and matching information between the two aspects are considered, the considered information is comprehensive, the defense capability against attacks can be improved, the detection stability is high, the accuracy of in-vivo detection results is improved, and the in-vivo detection effect is good.

Referring to fig. 13, an embodiment of the present application provides a living body detection apparatus, including:

the acquiring unit 1301 is used for acquiring a target image and a target reflected sound wave corresponding to an object to be detected;

an extraction unit 1302 for extracting an image feature of the target image and an acoustic wave feature of the target reflected acoustic wave;

the classification processing unit 1303 is used for performing classification processing on the image features and obtaining a first living body detection result of the object to be detected according to the classification result; classifying the sound wave characteristics, and obtaining a second living body detection result of the object to be detected according to the classification result;

the matching processing unit 1304 is used for matching the image characteristics and the sound wave characteristics to obtain a matching result of the image characteristics and the sound wave characteristics;

a determining unit 1305 for determining a target live body detection result of the object to be detected based on the first live body detection result, the second live body detection result, and the matching result.

In one possible implementation manner, the extracting unit 1302 is configured to invoke a target image feature extraction model to extract an image feature of a target image; calling a target sound wave feature extraction model to extract sound wave features of target reflected sound waves;

the classification processing unit 1303 is used for calling the first target classification model to perform classification processing on the image features and obtaining a first living body detection result of the object to be detected according to the classification result; calling a second target classification model to classify the sound wave characteristics, and obtaining a second living body detection result of the object to be detected according to the classification result;

and the matching processing unit 1304 is used for calling the target discriminator model to perform matching processing on the image characteristics and the sound wave characteristics to obtain a matching result of the image characteristics and the sound wave characteristics.

In one possible implementation manner, the determining unit 1305 is configured to pass the live body detection through a target live body detection result as the object to be detected in response to the first live body detection result indicating that the object to be detected is a live body, the second live body detection result indicating that the object to be detected is a live body, and the matching result indicating that the image feature and the acoustic wave feature are successfully matched.

In a possible implementation manner, the obtaining unit 1301 is further configured to obtain a training sample set, where the training sample set includes at least two training sample subsets, different training sample subsets correspond to different target objects, any training sample subset includes at least one training sample corresponding to any target object, and any training sample corresponding to any target object includes any image corresponding to any target object and a reflected sound wave corresponding to any image;

referring to fig. 14, the apparatus further comprises:

the training unit 1306 is configured to select training samples from the at least two training sample subsets to form target training samples, and train an initial image feature extraction model, an initial acoustic wave feature extraction model, a first initial classification model, a second initial classification model, and an initial discriminator model in a to-be-trained living body detection model based on the target training samples to obtain a living body detection model, where the living body detection model includes the target image feature extraction model, the target acoustic wave feature extraction model, the first target classification model, the second target classification model, and the target discriminator model.

In one possible implementation manner, the training unit 1306 is configured to invoke an initial image feature extraction model to extract initial image features of images in a target training sample; calling an initial sound wave feature extraction model to extract initial sound wave features of reflected sound waves in a target training sample; calling a first initial classification model to classify the initial image features, and obtaining a first initial living body detection result according to a classification result; calling a second initial classification model to classify the initial sound wave characteristics, and obtaining a second initial living body detection result according to the classification result; forming a sample feature group based on the initial image features and the initial sound wave features, wherein any sample feature group is formed by one initial image feature and one initial sound wave feature; training a first initial classification model based on the first initial in-vivo detection result; training a second initial classification model based on the second initial in-vivo detection result; training an initial discriminator model based on the sample characteristic group; training an initial image feature extraction model based on the first initial living body detection result and the sample feature group; training an initial sound wave feature extraction model based on a second initial living body detection result and the sample feature group; and responding to the training to obtain a target image feature extraction model, a target sound wave feature extraction model, a first target classification model, a second target classification model and a target discriminator model, and obtaining a living body detection model.

In one possible implementation manner, the sample feature set includes a positive sample feature set and a negative sample feature set, and the training unit 1306 is further configured to determine, based on the initial image feature and the initial acoustic wave feature, an initial image feature-initial acoustic wave feature pair that satisfies a positive matching condition and an initial image feature-initial acoustic wave feature pair that satisfies a negative matching condition; forming a positive sample feature group based on the initial image feature-initial sound wave feature pairs meeting the positive matching condition, and forming a negative sample feature group based on the initial image feature-initial sound wave feature pairs meeting the negative matching condition; the condition that the positive matching is met means that the image corresponding to the initial image characteristic and the reflected sound wave corresponding to the initial sound wave characteristic come from the same training sample in the target training samples; the condition that the negative matching is met means that the image corresponding to the initial image feature and the reflected sound wave corresponding to the initial sound wave feature come from two training samples meeting the reference condition in the target training samples, and the two training samples meeting the reference condition mean two training samples from different training sample subsets.

In one possible implementation manner, the initial discriminator model includes an initial similarity operator model, and the training unit 1306 is further configured to invoke the initial similarity operator model, calculate a first similarity value between an initial image feature and an initial acoustic wave feature in the positive sample feature set, and calculate a second similarity value between an initial image feature and an initial acoustic wave feature in the negative sample feature set; calculating a first target loss function based on the first similarity value, the second similarity value and a first reference threshold; parameters of the initial similarity operator model are updated based on the first objective loss function.

In one possible implementation manner, the positive sample feature group corresponds to a first label, the negative sample feature group corresponds to a second label, and the initial discriminator model comprises an initial feature fusion submodel and an initial classification processing submodel; the training unit 1306 is further configured to input the initial image feature and the initial acoustic wave feature in the positive sample feature group into the initial feature fusion submodel for fusion, so as to obtain a first fusion feature; calling an initial classification processing sub-model to perform classification processing on the first fusion characteristics to obtain a first classification result; inputting the initial image characteristic and the initial sound wave characteristic in the negative sample characteristic group into an initial characteristic fusion sub-model for fusion to obtain a second fusion characteristic; calling the initial classification processing sub-model to classify the second fusion characteristics to obtain a second classification result; calculating a second target loss function based on a loss function between the first classification result and the first label and a loss function between the second classification result and the second label; and updating the parameters of the initial feature fusion submodel and the initial classification processing submodel based on the second target loss function.

In a possible implementation manner, the obtaining unit 1301 is further configured to obtain an attack sample set, where the attack sample set includes at least one attack sample subset, where different attack sample subsets correspond to different attackers, each attack sample subset includes at least one attack sample corresponding to any attacker, and any attack sample corresponding to any attacker includes any image corresponding to any attacker and reflected sound waves corresponding to any image;

referring to fig. 14, the apparatus further comprises:

an optimizing unit 1307, configured to select attack samples in the attack sample set to form target attack samples, and invoke a target image feature extraction model to extract target image features of images in the target attack samples; calling a target sound wave feature extraction model to extract target sound wave features of reflected sound waves in a target attack sample; forming attack characteristic groups based on the target image characteristics and the target sound wave characteristics, wherein any attack characteristic group is formed by one target image characteristic and one target sound wave characteristic; and optimizing a target discriminator model in the living body detection model based on the attack characteristic group, and obtaining the optimized living body detection model based on the optimized target discriminator model.

In a possible implementation manner, the target discriminator model includes a target similarity degree operator model, the optimizing unit 1307 is further configured to obtain a reference feature group, where any reference feature group is composed of one reference image feature and one reference acoustic wave feature; calling a target similarity measurement operator model, and calculating a third similarity value between the reference image feature and the reference sound wave feature in the reference feature group and a fourth similarity value between the target image feature and the target sound wave feature in the attack feature group; calculating a third target loss function based on the third similarity value, the fourth similarity value and a second reference threshold; and updating the parameters of the target similarity degree operator model based on the third target loss function.

In a possible implementation mode, the attack feature group corresponds to a second label, and the target discriminator model comprises a target feature fusion submodel and a target classification processing submodel; the optimizing unit 1307 is further configured to input the target image feature and the target sound wave feature in the attack feature group into the target feature fusion submodel for fusion, so as to obtain a third fusion feature; calling a target classification processing sub-model to classify the third fusion characteristics to obtain a third classification result; and updating parameters of the target feature fusion sub-model and the target classification processing sub-model based on a loss function between the third classification result and the second label.

In a possible implementation manner, the obtaining unit 1301 is further configured to obtain, for any object, an image sequence of any object and a reflected sound wave reflected by any object; dividing the reflected sound wave into at least one reflected sub-sound wave; aligning at least one reflected sub-sound wave with at least one image in the image sequence to obtain reflected sub-sound waves aligned with the at least one image respectively; for any image in the at least one image, constructing a reflected sound wave corresponding to the any image based on the reflected sub sound wave aligned with the any image; forming any training sample corresponding to any target object based on any image and the reflected sound wave corresponding to any image; and forming any training sample subset based on at least one training sample corresponding to any target object formed by at least one image.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 15 is a schematic structural diagram of a server according to an embodiment of the present application, where the server may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 1501 and one or more memories 1502, where at least one program code is stored in the one or more memories 1502, and the at least one program code is loaded and executed by the one or more processors 1501 to implement the living body detection method according to the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

Fig. 16 is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal may be: a smartphone, a tablet, a laptop, or a desktop computer. A terminal may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

Generally, a terminal includes: a processor 1601, and a memory 1602.

Processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Processor 1601 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing content that the display screen needs to display. In some embodiments, the processor 1601 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1602 may include one or more computer-readable storage media, which may be non-transitory. The memory 1602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1602 is used to store at least one instruction for execution by the processor 1601 to implement the liveness detection method provided by the method embodiments of the present application.

In some embodiments, the terminal may further include: peripheral interface 1603 and at least one peripheral. Processor 1601, memory 1602 and peripheral interface 1603 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 1603 via buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1604, a touch screen display 1605, a camera assembly 1606, audio circuitry 1607, a positioning assembly 1608, and a power supply 1609.

Peripheral interface 1603 can be used to connect at least one I/O (Input/Output) related peripheral to processor 1601 and memory 1602. In some embodiments, processor 1601, memory 1602, and peripheral interface 1603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1601, the memory 1602 and the peripheral device interface 1603 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 1604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 1604 converts the electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1604 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1604 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display 1605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1605 is a touch display screen, the display screen 1605 also has the ability to capture touch signals on or over the surface of the display screen 1605. The touch signal may be input to the processor 1601 as a control signal for processing. At this point, the display 1605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1605 may be one, disposed on the front panel of the terminal; in other embodiments, the display screens 1605 may be at least two, respectively disposed on different surfaces of the terminal or in a folded design; in still other embodiments, the display 1605 may be a flexible display disposed on a curved surface or on a folded surface of the terminal. Even further, the display 1605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 1605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or other materials.

The camera assembly 1606 is used to capture images or video. Optionally, camera assembly 1606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1606 can also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1601 for processing or inputting the electric signals to the radio frequency circuit 1604 to achieve voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones can be arranged at different parts of the terminal respectively. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1601 or the radio frequency circuit 1604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1607 may also include a headphone jack.

The positioning component 1608 is used to locate the current geographic Location of the terminal to implement navigation or LBS (Location based service). The positioning component 1608 may be a positioning component based on the united states GPS (Global positioning system), the chinese beidou system, the russian graves system, or the european union galileo system.

A power supply 1609 is used to power the various components in the terminal. Power supply 1609 may be alternating current, direct current, disposable or rechargeable. When power supply 1609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal also includes one or more sensors 1610. The one or more sensors 1610 include, but are not limited to: acceleration sensor 1611, gyro sensor 1612, pressure sensor 1613, fingerprint sensor 1614, optical sensor 1615, and proximity sensor 1616.

The acceleration sensor 1611 may detect the magnitude of acceleration on three coordinate axes of a coordinate system established with the terminal. For example, the acceleration sensor 1611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1601 may control the touch display screen 1605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1611. The acceleration sensor 1611 may also be used for acquisition of motion data of a game or a user.

The gyroscope sensor 1612 can detect the body direction and the rotation angle of the terminal, and the gyroscope sensor 1612 can cooperate with the acceleration sensor 1611 to acquire the 3D action of the user on the terminal. From the data collected by the gyro sensor 1612, the processor 1601 may perform the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 1613 may be disposed on a side bezel of the terminal and/or on an underlying layer of the touch display 1605. When the pressure sensor 1613 is disposed on the side frame of the terminal, a user's holding signal to the terminal may be detected, and the processor 1601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1613. When the pressure sensor 1613 is disposed at the lower layer of the touch display 1605, the processor 1601 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 1605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1614 is configured to collect a fingerprint of the user, and the processor 1601 is configured to identify the user based on the fingerprint collected by the fingerprint sensor 1614, or the fingerprint sensor 1614 is configured to identify the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 1601 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 1614 may be disposed on the front, back, or side of the terminal. When a physical key or vendor Logo is provided on the terminal, the fingerprint sensor 1614 may be integrated with the physical key or vendor Logo.

The optical sensor 1615 is used to collect ambient light intensity. In one embodiment, the processor 1601 may control the display brightness of the touch display screen 1605 based on the ambient light intensity collected by the optical sensor 1615. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1605 is increased; when the ambient light intensity is low, the display brightness of the touch display 1605 is turned down. In another embodiment, the processor 1601 may also dynamically adjust the shooting parameters of the camera assembly 1606 based on the ambient light intensity collected by the optical sensor 1615.

The proximity sensor 1616, also called a distance sensor, is generally provided on the front panel of the terminal. The proximity sensor 1616 is used to collect the distance between the user and the front surface of the terminal. In one embodiment, the processor 1601 controls the touch display 1605 to switch from the bright screen state to the dark screen state when the proximity sensor 1616 detects that the distance between the user and the front surface of the terminal is gradually decreased; when the proximity sensor 1616 detects that the distance between the user and the front surface of the terminal gradually becomes larger, the touch display 1605 is controlled by the processor 1601 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 16 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

In an exemplary embodiment, a computer device is also provided that includes a processor and a memory having at least one program code stored therein. The at least one program code is loaded into and executed by one or more processors to implement any of the above-described liveness detection methods.

In an exemplary embodiment, there is also provided a computer-readable storage medium having at least one program code stored therein, the at least one program code being loaded and executed by a processor of a computer device to implement any of the above-described liveness detection methods.

Alternatively, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

It is noted that the terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of in vivo detection, the method comprising:

2. The method of claim 1, wherein said extracting image features of the target image and acoustic features of the target reflected acoustic waves comprises:

calling a target image feature extraction model to extract image features of the target image;

calling a target sound wave feature extraction model to extract the sound wave features of the target reflected sound waves;

classifying the image features, and obtaining a first living body detection result of the object to be detected according to a classification result; classifying the sound wave characteristics, and obtaining a second living body detection result of the object to be detected according to the classification result, wherein the method comprises the following steps:

calling a first target classification model to classify the image features, and obtaining a first living body detection result of the object to be detected according to a classification result; calling a second target classification model to classify the sound wave characteristics, and obtaining a second living body detection result of the object to be detected according to the classification result;

the matching processing of the image features and the sound wave features to obtain the matching result of the image features and the sound wave features comprises the following steps:

and calling a target discriminator model to perform matching processing on the image characteristics and the sound wave characteristics to obtain a matching result of the image characteristics and the sound wave characteristics.

3. The method according to claim 1 or 2, wherein the determining the target in-vivo detection result of the object to be detected based on the first in-vivo detection result, the second in-vivo detection result and the matching result comprises:

and responding to the first living body detection result indicating that the object to be detected is a living body, the second living body detection result indicating that the object to be detected is a living body, the matching result indicating that the image characteristic and the sound wave characteristic are successfully matched, and passing living body detection as a target living body detection result of the object to be detected.

4. The method of claim 2, wherein prior to said invoking a target image feature extraction model to extract image features of the target image, the method further comprises:

acquiring a training sample set, wherein the training sample set comprises at least two training sample subsets, the different training sample subsets correspond to different target objects, any training sample subset comprises at least one training sample corresponding to any target object, and any training sample corresponding to any target object comprises any image corresponding to any target object and a reflected sound wave corresponding to any image;

respectively selecting training samples from at least two training sample subsets to form target training samples, and training an initial image feature extraction model, an initial sound wave feature extraction model, a first initial classification model, a second initial classification model and an initial discriminator model in a to-be-trained living body detection model based on the target training samples to obtain the living body detection model, wherein the living body detection model comprises a target image feature extraction model, a target sound wave feature extraction model, a first target classification model, a second target classification model and a target discriminator model.

5. The method according to claim 4, wherein the training of an initial image feature extraction model, an initial acoustic feature extraction model, a first initial classification model, a second initial classification model and an initial discriminator model in a to-be-trained in-vivo detection model based on the target training sample to obtain the in-vivo detection model comprises:

calling an initial image feature extraction model to extract initial image features of images in the target training sample;

calling an initial sound wave feature extraction model to extract initial sound wave features of reflected sound waves in the target training sample;

calling a first initial classification model to classify the initial image features, and obtaining a first initial living body detection result according to a classification result; calling a second initial classification model to classify the initial sound wave characteristics, and obtaining a second initial living body detection result according to a classification result;

forming sample feature groups based on the initial image features and the initial sound wave features, wherein any sample feature group is formed by one initial image feature and one initial sound wave feature;

training the first initial classification model based on the first initial in-vivo detection result; training the second initial classification model based on the second initial in-vivo detection result; training the initial discriminator model based on the sample feature group; training the initial image feature extraction model based on the first initial in-vivo detection result and the sample feature group; training the initial acoustic feature extraction model based on the second initial in-vivo detection result and the sample feature group;

and responding to the training to obtain a target image feature extraction model, a target sound wave feature extraction model, a first target classification model, a second target classification model and a target discriminator model, and obtaining a living body detection model.

6. The method of claim 5, wherein the sample feature set comprises a positive sample feature set and a negative sample feature set, and wherein constructing the sample feature set based on the initial image features and the initial acoustic features comprises:

determining an initial image feature-initial sound wave feature pair satisfying a positive matching condition and an initial image feature-initial sound wave feature pair satisfying a negative matching condition based on the initial image feature and the initial sound wave feature;

forming a positive sample feature group based on the initial image feature-initial sound wave feature pairs meeting the positive matching condition, and forming a negative sample feature group based on the initial image feature-initial sound wave feature pairs meeting the negative matching condition;

the image corresponding to the initial image features and the reflected sound waves corresponding to the initial sound wave features come from the same training sample in the target training samples; the image corresponding to the initial image features and the reflected sound waves corresponding to the initial sound wave features come from two training samples meeting the reference conditions in the target training samples, and the two training samples meeting the reference conditions are two training samples from different training sample subsets.

7. The method of claim 6, wherein the initial discriminator model comprises an initial similarity operator model, and wherein training the initial discriminator model based on the set of sample features comprises:

calling the initial similarity calculation operator model, and calculating a first similarity value between an initial image feature and an initial sound wave feature in the positive sample feature group and a second similarity value between the initial image feature and the initial sound wave feature in the negative sample feature group;

calculating a first target loss function based on the first similarity value, the second similarity value, and a first reference threshold; updating parameters of the initial similarity operator model based on the first objective loss function.

8. The method of claim 6, wherein the positive sample feature set corresponds to a first label, the negative sample feature set corresponds to a second label, and the initial discriminator model comprises an initial feature fusion submodel and an initial classification processing submodel; the training of the initial discriminator model based on the sample feature set comprises:

inputting the initial image feature and the initial sound wave feature in the positive sample feature group into the initial feature fusion sub-model for fusion to obtain a first fusion feature; calling the initial classification processing submodel to perform classification processing on the first fusion characteristics to obtain a first classification result;

inputting the initial image feature and the initial sound wave feature in the negative sample feature group into the initial feature fusion sub-model for fusion to obtain a second fusion feature; calling the initial classification processing submodel to perform classification processing on the second fusion characteristics to obtain a second classification result;

calculating a second target loss function based on a loss function between the first classification result and the first label and a loss function between the second classification result and the second label; and updating the parameters of the initial feature fusion sub-model and the initial classification processing sub-model based on the second target loss function.

9. The method of any of claims 4-8, wherein after obtaining the in vivo test model, the method further comprises:

acquiring an attack sample set, wherein the attack sample set comprises at least one attack sample subset, the different attack sample subsets correspond to different attackers, any attack sample subset comprises at least one attack sample corresponding to any attackers, and any attack sample corresponding to any attackers comprises any image corresponding to any attackers and reflected sound waves corresponding to any image;

selecting attack samples from the attack sample set to form target attack samples, and calling the target image feature extraction model to extract target image features of images in the target attack samples; calling the target sound wave feature extraction model to extract the target sound wave features of the reflected sound waves in the target attack sample;

based on the target image characteristics and the target sound wave characteristics, attack characteristic groups are formed, and any attack characteristic group is formed by one target image characteristic and one target sound wave characteristic;

and optimizing a target discriminator model in the in-vivo detection model based on the attack feature group, and obtaining the optimized in-vivo detection model based on the optimized target discriminator model.

10. The method of claim 9, wherein the target discriminator model comprises a target similarity operator model, and wherein optimizing the target discriminator model in the live detection model based on the attack feature set comprises:

acquiring a reference feature group, wherein any reference feature group is composed of a reference image feature and a reference sound wave feature;

calling the target similarity degree operator model, and calculating a third similarity value between the reference image feature and the reference sound wave feature in the reference feature group and a fourth similarity value between the target image feature and the target sound wave feature in the attack feature group;

calculating a third target loss function based on the third similarity value, the fourth similarity value, and a second reference threshold; updating parameters of the target similarity degree operator model based on the third target loss function.

11. The method of claim 9, wherein the attack feature set corresponds to a second tag, and the target discriminator model comprises a target feature fusion submodel and a target classification processing submodel; optimizing a target discriminator model in the in-vivo detection model based on the attack feature group, including:

inputting the target image characteristics and the target sound wave characteristics in the attack characteristic group into the target characteristic fusion sub-model for fusion to obtain third fusion characteristics; calling the target classification processing sub-model to classify the third fusion characteristics to obtain a third classification result;

and updating the parameters of the target feature fusion submodel and the target classification processing submodel based on the loss function between the third classification result and the second label.

12. The method according to any one of claims 4-8, wherein the obtaining of any one of the training sample subsets comprises:

for any target object, acquiring an image sequence of the target object and a reflected sound wave reflected by the target object;

dividing the reflected acoustic wave into at least one reflected sub-acoustic wave;

aligning the at least one reflected sub-sound wave with at least one image in the image sequence to obtain reflected sub-sound waves aligned with the at least one image respectively;

for any image in the at least one image, constructing a reflected sound wave corresponding to the any image based on the reflected sub sound waves aligned with the any image; forming any training sample corresponding to any target object based on any image and the reflected sound wave corresponding to any image;

and forming any training sample subset based on at least one training sample corresponding to any target object formed by the at least one image.

13. A living body detection apparatus, the apparatus comprising:

14. A computer device, characterized in that the computer device comprises a processor and a memory, wherein at least one program code is stored in the memory, and the at least one program code is loaded and executed by the processor to implement the liveness detection method according to any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that at least one program code is stored therein, which is loaded and executed by a processor, to implement the liveness detection method according to any one of claims 1 to 12.