CN115546908A

CN115546908A - Living body detection method, device and equipment

Info

Publication number: CN115546908A
Application number: CN202211192532.0A
Authority: CN
Inventors: 武文琦
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2022-12-30

Abstract

The embodiment of the specification discloses a method, a device and equipment for detecting a living body. The scheme can comprise the following steps: and obtaining semantic feature information and visual feature information of the image to be detected, and performing feature fusion processing on the obtained semantic feature information and visual feature information to obtain a fusion feature vector of the image to be detected. And then, a living body detection model built based on an attention mechanism is used for processing the fusion characteristic vector, and further, a living body detection result aiming at the image to be detected is obtained.

Description

Living body detection method, device and equipment

Technical Field

The application relates to the technical field of deep learning, in particular to a method, a device and equipment for detecting a living body.

Background

With the development of computer technology and optical imaging technology, more and more scenes begin to identify the identity of a user by using image processing technology. For example, in a scenario of performing identity authentication on a user, in order to impersonate a legitimate user, an attacker may place a photo, a recorded video, a wax image, or the like of a legitimate user in front of a camera that collects an image of the user for cheating, and therefore, it is necessary to perform live detection on the collected image of the user to reduce the risk in the process of identifying the identity of the user. At present, a large number of training samples are usually relied on to train a convolutional neural network model so as to perform in vivo detection by using the trained convolutional neural network model, and the scheme relies on reasonable distribution of the training samples, so that the stability and the accuracy of an in vivo detection result are influenced.

Therefore, how to improve the stability and accuracy of the generated in-vivo detection result becomes a technical problem to be solved urgently.

Disclosure of Invention

The method, the device and the equipment for detecting the living body provided by the embodiment of the specification can improve the stability and the accuracy of the generated living body detection result.

In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:

the living body detection method provided by the embodiment of the specification comprises the following steps:

and obtaining semantic feature information and visual feature information of the image to be detected.

And performing feature fusion processing on the visual feature information and the semantic feature information to obtain a fusion feature vector of the image to be detected.

And processing the fusion characteristic vector by using a living body detection model built based on an attention mechanism to obtain a living body detection result aiming at the image to be detected.

An embodiment of this specification provides a living body detection device, includes:

the first acquisition module is used for acquiring semantic feature information and visual feature information of the image to be detected.

And the fusion module is used for carrying out feature fusion processing on the visual feature information and the semantic feature information to obtain a fusion feature vector of the image to be detected.

And the processing module is used for processing the fusion characteristic vector by using a living body detection model built based on an attention mechanism to obtain a living body detection result aiming at the image to be detected.

The living body detection device provided by the embodiment of the specification comprises:

at least one processor, and,

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

At least one embodiment provided in the present specification can achieve the following advantageous effects:

the method comprises the steps of performing feature fusion processing on semantic feature information and visual feature information of an image to be detected to obtain a fusion feature vector of the image to be detected, and processing the fusion feature vector by using a living body detection model built based on an attention mechanism to obtain a living body detection result aiming at the image to be detected. According to the scheme, on one hand, living body identification is carried out by combining multi-modal image information to improve the stability and accuracy of a living body detection result, on the other hand, the living body detection model can be focused on a living body and an attacking strong difference representation area by utilizing an attention mechanism, so that better generalization and universality can be achieved, the dependence degree on the model training data selection accuracy is reduced, and the stability and accuracy of the living body detection result are further improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a schematic view of an application scenario of a method for detecting a living body according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of a method for detecting a living organism according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of lanes corresponding to a method for detecting a living body in FIG. 2, according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a configuration of an active detection device corresponding to FIG. 2 according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a living body detection apparatus corresponding to fig. 2 provided in an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of one or more embodiments of the present disclosure more apparent, the technical solutions of one or more embodiments of the present disclosure will be clearly and completely described below with reference to specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort fall within the scope of protection of one or more embodiments of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

In the prior art, in-vivo detection on an acquired user image is mainly realized based on a trained convolutional neural network model, however, a large number of training samples are usually required for training the convolutional neural network model, and the in-vivo prediction accuracy of the trained convolutional neural network model has high dependence on the reasonable distribution of the training samples, so that the stability and the accuracy of a generated in-vivo detection result are influenced.

In order to solve the defects in the prior art, the scheme provides the following embodiments:

fig. 1 is a schematic view of an application scenario of a method for detecting a living body according to an embodiment of the present disclosure.

As shown in fig. 1, a preset language model 101, a preset convolutional neural network model 102, and a living body detection model 103 may be mounted at the living body detection apparatus 100. The preset language model 101 can extract semantic feature information of an image to be detected from the image to be detected, the preset convolutional neural network model 102 can extract visual feature information of the image to be detected from the image to be detected, and fusion feature vectors of the image to be detected can be obtained by performing position coding and feature fusion on the semantic feature information of the image to be detected and the visual feature information of the image to be detected.

Subsequently, an encoder built based on the self-attention mechanism in the living body detection model 103 may encode the fused feature vector of the image to be detected into a coding vector, and input the coding vector into the feature processing sub-model to perform channel attention processing and spatial attention processing, and finally, input the coding vector obtained after processing into a decoder built based on the self-attention mechanism in the living body detection model 103, and generate a living body detection result for the image to be detected based on the coding vector by the decoder.

It should be noted that the number of layers of neurons, the connection result, and the number of neurons in each layer of each model shown in fig. 1 are only used as examples, and are not intended to specifically limit each model.

According to the scheme in the figure 1, on one hand, living body identification is carried out by combining multi-modal image information so as to improve the stability and the accuracy of a living body detection result, on the other hand, a living body detection model can focus on a strong difference representation area of a living body and an attack by utilizing an attention mechanism, so that better generalization and universality can be realized, the dependence degree on the model training data selection accuracy is reduced, and the stability and the accuracy of the living body detection result are further improved.

Next, a method for detecting a living body provided in an embodiment of the specification will be described in detail with reference to the accompanying drawings:

fig. 2 is a schematic flow chart of a method for detecting a living body according to an embodiment of the present disclosure. From a procedural point of view, the execution subject of the flow may be the living body detection device, or an application program loaded at the living body detection device. As shown in fig. 2, the process may include the following steps:

step 202: and obtaining semantic feature information and visual feature information of the image to be detected.

In this embodiment of the present description, the image to be detected may be a user image acquired for a user performing living body recognition, and the image to be detected may generally include face information of the user, and of course, the image to be detected may also include limb movement information, peripheral environment information, and the like of the user.

In this embodiment, the semantic feature information may be feature information for reflecting image description information for an image to be detected. Such as: when the image to be detected is an image acquired by a user who holds a face picture by hand, the image description information for the image to be detected can be used for reflecting that 'fingers exist at the edge of the picture', and can also be used for reflecting that 'the background around the face is not consistent with the background on the spot'. In the embodiment of the present specification, the content of the image description information corresponding to the semantic feature information may be set according to a requirement, which is not specifically limited.

In the embodiment of the present specification, the visual feature information may be image feature information extracted from the image to be detected to reflect the visual information included in the image. The obtained visual characteristic information can be used for reflecting characteristic information such as facial characteristics of human five sense organs, hair style characteristics of human body, whether a mask is worn, whether glasses are matched, whether a hat is worn and the like in the image to be detected. The type of the specific feature information reflected by the visual feature information is not limited here, and may be set as needed.

In the embodiment of the present specification, the extraction order of extracting the semantic feature information and the visual feature information for the image to be detected is not limited, and the semantic feature information may be extracted first, the visual feature information may be extracted first, or the semantic feature information and the visual feature information may be extracted simultaneously.

Step 204: and performing feature fusion processing on the visual feature information and the semantic feature information to obtain a fusion feature vector of the image to be detected.

In the embodiment of the present specification, feature fusion refers to a combination of features from different layers or branches, and is a common operation in modern network architectures, and in feature fusion of image processing, for general multi-scale fusion, they may be directly added or spliced together.

In the embodiment of the present specification, semantic feature information extracted from an image to be detected and visual feature information extracted from the image to be detected are fused according to a preset rule. The fusion characteristic vector of the image to be detected obtained through the characteristic fusion processing can reflect the description information of the image to be detected and the carried visual image information at the same time.

Step 206: and processing the fusion characteristic vector by using a living body detection model built based on an attention mechanism to obtain a living body detection result aiming at the image to be detected.

In the examples of the present specification, attention Mechanism (Attention Mechanism) is derived from the study of human vision. In cognitive science, humans selectively focus on a portion of all information while ignoring other visible information due to bottlenecks in information processing. The above mechanism is commonly referred to as the attentional mechanism. The attention mechanism mainly comprises two aspects: the first aspect is to decide which part of the input needs attention, and the second aspect is to allocate limited information processing resources to the important part.

In the embodiment of the present specification, the living body detection model is a detection model for determining the real physiological characteristics of the object in some authentication scenarios to verify whether the user is the real living body. The main used scenes of the in-vivo detection model can be scenes such as mobile phone face brushing unlocking, face brushing payment and remote identity verification.

In this embodiment, an attention-based classification process may be performed on the fusion feature vector by using a living body detection model built based on an attention-based mechanism, so as to obtain a living body classification result for an image to be detected. If the living body probability value corresponding to the living body classification result is larger than a preset threshold value, a living body detection result which reflects that a user who performs living body detection and is contained in the image to be detected is a living body can be generated, and the living body detection result can indicate that the user who performs the living body detection by performing the operation behavior of compliance; or, if the non-living body probability value corresponding to the living body classification result is greater than the preset threshold, a living body detection result reflecting that the user performing the living body detection included in the image to be detected is a non-living body can be generated, and the living body detection result can indicate that the user performing the living body detection cheats by performing a fraudulent operation behavior.

In the embodiment of the specification, two kinds of information, namely semantic feature information and visual feature information, are extracted from an image to be detected, and more information of the image to be detected is input into a living body detection model so as to improve the accuracy of an output result of the living body detection model. And the living body detection model is built based on an attention mechanism, and the attention mechanism enables the living body detection model to focus on a strong difference representation area between a living body and an attack, so that the living body detection model has better generalization and universality, the dependence on a training sample is reduced, and the stability and the accuracy of a living body detection result can be further improved.

Based on the method in fig. 2, some specific embodiments of the method are also provided in the examples of this specification, which are described below.

In the embodiment of the present specification, a language model may be used to extract semantic feature information from an image to be detected. Based on this, obtaining the semantic feature information of the image to be detected may specifically include:

performing feature extraction processing on the image to be detected by using a preset language model to obtain a semantic feature vector of the image to be detected; wherein the preset language model is a deep learning model for generating image description information for an input image.

In the embodiment of the present specification, the preset language model may be a language model that a user can build by himself according to a requirement, for example, the preset language model may be implemented by a generic Pre-Training-3 model (english abbreviation: GPT3 model). The constructed preset language model is used for carrying out feature extraction processing on the image to be detected, so that the semantic feature vector of the image to be detected can be obtained. Specifically, semantic feature vectors can be extracted from the convolutional layers of the preset language model according to user requirements, and the obtained semantic feature vectors are two-dimensional feature vectors. Or, semantic feature vectors may be extracted from a full-link layer of a preset language model, and the obtained semantic feature vectors are one-dimensional feature vectors. However, if the semantic feature vector extracted from the predetermined language model is a two-dimensional feature vector, the semantic feature vector needs to be converted into a one-dimensional feature vector for the position encoding process.

In the embodiment of the specification, the preset language model is a deep learning model which needs to be trained, a user needs to specify output information of the preset language model, and a proper training sample and a label associated with the output information are selected according to the specified output information to train the preset language model, so that the language model which can output information required by the user is obtained.

In the embodiment of the specification, the preset language model is set up by a user according to the requirement, and meanwhile, training samples are reasonably selected according to the requirement of the user on output information to train the preset language model. Therefore, the user can extract the required semantic feature vector from the image to be detected according to the preset language model so as to improve the satisfaction degree of the user.

The semantic feature vector obtained by feature extraction of the image to be detected by the preset language model can be used for reflecting the multi-dimensional image description information of the image to be detected, and under the normal condition, the multi-dimensional image description information can reflect different contents according to the arrangement sequence of different dimensions. Such as: when a person wearing the mask receives a live examination, the image description information corresponding to the semantic feature vectors extracted by the preset language model can be 'the mask is worn on the mouth', if the position of the dimension information representing the mouth and the position of the dimension information representing the mask are not limited, the text information described by the extracted semantic feature vectors can be changed into 'the mouth is worn on the mask', and thus the live examination model can consider that the detected mask is printed with a mouth pattern instead of a live face wearing the mask when receiving the semantic feature vectors, so that the detection output result of the live examination model is wrong, and therefore the position of each semantic feature vector extracted aiming at the image to be detected needs to be limited.

Based on this, for the accuracy of guarantee semantic feature information, obtain the semantic feature information of waiting to detect the image, can also include:

and carrying out position coding processing on the semantic feature vector of the image to be detected to obtain a position-coded target semantic feature vector.

In the embodiments of the present specification, the mainstream position coding methods are mainly classified into two categories, namely absolute position coding and relative position coding. The absolute position code works by informing the network architecture of the position of each element in the input sequence, which is similar to marking each element of the input sequence with a "position tag" to indicate its absolute position. The relative position code informs the distance between every two elements of the network architecture.

In this embodiment of the present specification, performing position encoding processing on a semantic feature vector of an image to be detected may configure a position tag for each piece of dimensional information in the semantic feature vector to obtain a position encoded vector of the semantic feature vector, where the dimension of the semantic feature vector is the same as the dimension of the position encoded vector, and perform fusion processing on the semantic feature vector and the position encoded vector with the same dimension to implement position encoding operation on each semantic feature vector. In practical application, ADD operation can be performed on the semantic feature vector and the position coding vector, and the data dimensions of the finally obtained target semantic feature vector and the position coding vector can be the same. It should be noted that, generally, the target semantic feature vectors need to be sorted according to the order reflected by the position encoding vectors corresponding to the target semantic feature vectors, so that the sorted target semantic feature vectors accurately reflect the image description text information of the image to be detected.

In this embodiment of the present specification, the performing position encoding processing on the semantic feature vector of the image to be detected may also be performing encoding on a distance between two pieces of dimensional information in the semantic feature vector to obtain a position encoded vector of the semantic feature vector, where the dimension of the semantic feature vector is the same as the dimension of the position encoded vector, performing ADD operation on the semantic feature vector and the position encoded vector with the same dimension, and finally obtaining a data dimension of the target semantic feature vector unchanged. Similarly, generally, each target semantic feature vector needs to be sorted according to the sequence reflected by the position coding vector corresponding to the target semantic feature vector, so that the image description text information of the image to be detected is accurately reflected by using each sorted target semantic feature vector.

In the embodiment of the description, each semantic feature vector of the image to be detected can be sequenced by performing position coding operation on each semantic feature vector, so that each sequenced semantic feature vector can reflect accurate text description information, and the accuracy of the input information of the in-vivo detection model is improved.

In the embodiment of the present description, it is also necessary to perform extraction and position coding of visual feature information on an image to be detected, so as to facilitate subsequent feature fusion processing. Based on this, the acquiring of the visual characteristic information of the image to be detected may specifically include:

performing feature extraction processing on the image to be detected by using a preset convolutional neural network model to obtain a visual feature vector of the image to be detected; wherein the preset convolutional neural network model is a model for performing a processing operation including convolution processing with respect to an input image.

And carrying out position coding processing on the visual characteristic vector of the image to be detected to obtain a target visual characteristic vector after position coding.

In this embodiment of the present specification, the operation of extracting the visual feature information of the image to be detected may be completed by using a preset convolutional neural network model. The preset convolutional neural network model can build network structures such as convolutional layers, pooling layers, full-connection layers and output layers according to the requirements of users, and the number of neurons required to be built by each layer of network structure can be 1 or more than one. And performing feature extraction processing on the image to be detected by using the built preset convolutional neural network model to obtain a visual feature vector of the image to be detected. According to the user requirements, visual feature vectors can be extracted from the built convolutional layer of the preset convolutional neural network model, and the obtained visual feature vectors are two-dimensional feature vectors. Visual feature vectors can also be extracted from the fully connected layer of the built preset convolutional neural network model, and the obtained visual feature vectors are one-dimensional feature vectors. However, if the extracted visual feature vector is a two-dimensional feature vector, the visual feature vector needs to be converted into a one-dimensional feature vector for the position encoding process.

In the embodiment of the specification, the process of extracting the visual characteristic information of the image to be detected by the preset convolutional neural network model is to process pixel values of different parts of the image to be detected, perform characteristic extraction according to the pixel values of the different parts to obtain the visual characteristic vector of the image to be detected, and extract the pixel values of different parts in the input image without learning and training the convolutional layer of the preset convolutional neural network model, so that the preset convolutional neural network model does not need to be trained in advance. Of course, the preset convolutional neural network model may also be subjected to model training in advance, so as to extract the visual feature vector by using the trained preset convolutional neural network model, which is not particularly limited.

In the embodiment of the present specification, in the process of extracting the visual feature vector of the image to be detected, the preset convolutional neural network model extracts different feature vectors for different portions of the image to be detected, that is, the extracted visual feature vector is a multi-dimensional vector capable of representing pixel information of different portions of the image to be detected, and in general, the arrangement positions of different pieces of dimensional information are different in sequence, and the content of the reflected overall dimensional information is different. For example, when the image to be detected is a human face, the eyes, nose, mouth, etc. of the human face all have respective fixed position information, and if the positions of five sense organs of the human face are not limited, the positions of the eyes and the mouth may be reversed. Therefore, the visual feature vector extracted from the image to be detected should be subjected to position coding processing to obtain the target visual feature vector after the position coding processing.

In this embodiment of the present specification, performing position encoding processing on the visual feature vector of the image to be detected may be to configure a position tag for each dimension information in the visual feature vector to obtain a position encoding vector of the visual feature vector, where the dimension of the visual feature vector is the same as the dimension of the position encoding vector, and perform feature fusion processing (for example, ADD operation) on the visual feature vector and the position encoding vector with the same dimension to obtain a target visual feature vector. At this time, the data dimension of the finally obtained target visual feature vector is not changed. In practical application, generally, the target visual feature vectors are sequenced according to the sequence reflected by the position coding vectors corresponding to the target visual feature vectors, so that the sequenced target visual feature vectors are used for accurately reflecting the visual information carried by the image to be detected.

In this embodiment of the present specification, the performing position encoding processing on the visual feature vector of the image to be detected may also be performing encoding on a distance between two pieces of dimensional information in the visual feature vector to obtain a position encoding vector of the visual feature vector, where the dimension of the visual feature vector is the same as the dimension of the position encoding vector, and performing ADD operation on the visual feature vector and the position encoding vector with the same dimension, where a data dimension of the finally obtained target visual feature vector is unchanged. Similarly, generally, each target visual feature vector needs to be sorted according to the sequence reflected by the position coding vector corresponding to the target visual feature vector, so that the sorted target visual feature vectors are used to accurately reflect the visual information carried by the image to be detected.

In the embodiment of the present description, the visual feature vector after position coding can accurately reflect image visual information, so that when the visual feature vector after position coding is used for live examination, the accuracy of the information input by the live examination model is improved, and the accuracy of the live examination result output by the live examination model is improved.

In the embodiment of the present specification, it is generally necessary to perform fusion processing on semantic feature information and visual feature information extracted from an image to be detected, so that a living body detection model performs living body identification based on the fused feature information. Based on this, step 204: the performing feature fusion processing on the visual feature information and the semantic feature information to obtain a fusion feature vector of the image to be detected may specifically include:

and performing feature splicing processing on the target semantic feature vector and the target visual feature vector to obtain a fusion feature vector of the image to be detected.

Wherein the dimension of the fusion feature vector is the sum of the dimension of the target semantic feature vector and the dimension of the target visual feature vector; alternatively, the first and second electrodes may be,

the dimension of the fusion feature vector, the dimension of the target semantic feature vector and the dimension of the target visual feature vector are all equal.

In the embodiment of the present specification, the semantic feature vector obtained according to the preset language model and the visual feature vector obtained according to the preset convolutional neural network model generally have the same feature dimensions, for example, if the semantic feature vector includes six features with different dimensions, the visual feature vector should also include six features with different dimensions, so as to ensure consistency of position coding for the semantic feature vector and the visual feature vector. Based on the method, when a user builds a preset language model and a preset convolutional neural network model, the dimension characteristics of each layer of output data can be planned in advance, so that semantic characteristic vectors and visual characteristic vectors which need to be used are extracted from corresponding network structures according to the requirements of the user.

In the embodiment of the present specification, two ways may be adopted for performing feature stitching processing on the semantic feature vector and the visual feature vector of the image to be detected.

The first method is as follows: the semantic feature vector is added before or after the visual feature vector so that the dimension of the resulting fused feature vector information is the sum of the semantic feature vector information dimension and the visual feature vector information dimension. Such as: arranging the semantic feature vectors with N dimensions behind or in front of the visual feature vectors with M dimensions, wherein the dimension of the finally obtained fusion feature vector is N + M, namely the dimension of the fusion feature vector obtained through feature splicing processing is the sum of the dimension of the semantic feature vector and the dimension of the visual feature vector. In practical applications, the feature splicing in the first mode can be processed by using a dstack function, a stack function, and the like. And feature dimension expansion is carried out, and the information quantity carried by the fusion feature vector is improved.

The second method comprises the following steps: the semantic feature vectors and the visual feature vectors of the corresponding dimensions may be added to obtain a fusion feature vector having the same dimensions as the semantic feature vector and the visual feature vector. For example, the ith dimension semantic feature vector and the ith dimension visual feature vector may be added to obtain the fusion feature vector of the ith dimension, where i is greater than or equal to 1, and i is less than or equal to the number of dimensions of the semantic feature vector and the visual feature vector. In practical application, the add function can be used for processing by adopting the second mode to perform feature splicing, and convenience and rapidness are achieved.

In the embodiment of the present specification, after the target semantic feature vector and the target visual feature vector are spliced to obtain the fusion feature vector, a corresponding in-vivo detection model needs to be built according to a requirement to detect the fusion feature vector. The self-attention mechanism can enable people to pay more attention to the features with higher correlation with the processed features, so that the running stability and accuracy of the network model can be improved, and based on the self-attention mechanism, the living body detection model can be built.

Specifically, the living body detection model built based on the attention mechanism may include: an encoder built based on the self-attention mechanism, and a decoder built based on the self-attention mechanism. So that the model can pay more attention to the differential characteristic region between the living body and the attack through the encoder and the decoder built based on the self-attention mechanism.

Correspondingly, step 206: the processing the fusion feature vector by using the living body detection model built based on the attention mechanism to obtain a living body detection result for the image to be detected specifically may include:

and inputting the fusion characteristic vector into the encoder to obtain an encoding vector output by the encoder.

And generating a living body detection result aiming at the image to be detected based on the coding vector by utilizing the decoder.

In the embodiment of the present description, the encoder and the decoder built based on the self-attention mechanism may form a self-encoder, but an output layer of the decoder may be used to implement a classification task, so the in-vivo detection model built in the embodiment of the present description may be a classification model implemented based on the self-encoder principle and the self-attention mechanism principle, so as to improve the operation stability and accuracy of the in-vivo detection model.

Specifically, after the encoder built based on the self-attention mechanism receives the fusion feature vector, the fusion feature vector can be encoded to obtain a low-dimensional encoding vector, so as to enhance the difference feature region and weaken the similar feature region. Subsequently, a decoder built based on the self-attention mechanism can be used for decoding processing based on the coding vector, so that the living body detection model can pay more attention to the difference characteristic region between the living body and the attack, and the accuracy and the stability of the output result of the living body detection model can be improved.

In general, when a living body detection model built based on a self-attention mechanism is used for detecting an image to be detected, in order to limit a self-attention area to a main area which is recognized by people, such as a mobile phone frame, a paper background texture and the like, other attention mechanisms need to be considered when the living body detection model is built.

Based on this, the living body detection model built based on the attention mechanism may further include: and a feature processing sub-model built based on the channel attention mechanism and the space attention mechanism.

Correspondingly, the generating, by the decoder, a living body detection result for the image to be detected based on the encoding vector may specifically include:

and constructing a two-dimensional feature matrix according to the coding vector to obtain a feature map to be processed.

And inputting the characteristic diagram to be processed into the characteristic processing submodel to obtain a target characteristic diagram output by the characteristic processing submodel.

And segmenting the target feature graph to obtain a plurality of target feature sub-graphs.

And splicing the feature vectors extracted from the target feature subgraphs to obtain the target feature vector.

And inputting the target characteristic vector into the decoder to obtain a living body detection result output by the decoder and aiming at the image to be detected.

In the embodiment of the specification, the feature processing sub-model built based on the channel attention mechanism and the space attention mechanism can enable the living body detection model to pay more attention to the sensitive area of the image to be detected. In practical application, the feature processing submodel built based on the channel attention mechanism and the space attention mechanism can be realized by adopting a conditional block attention Module (abbreviated as CBAM) technology, and mainly comprises a large number of Convolutional layers.

Specifically, the encoder built based on the attention-machine system encodes the fusion feature vector to obtain a one-dimensional vector, but the input vector of the convolutional layer is usually a two-dimensional vector, so a two-dimensional feature matrix needs to be constructed for the encoding vector output by the encoder to convert the one-dimensional encoding vector into a two-dimensional feature map to be processed. Subsequently, the two-dimensional feature map to be processed is input into a feature processing sub-model built based on a channel attention mechanism and a space attention mechanism, and the feature processing sub-model performs feature fusion processing on the two-dimensional feature map to be processed in two dimensions of a channel and a space to obtain a target feature map output by the feature processing sub-model, wherein the target feature map is still a two-dimensional feature vector. Therefore, before the two-dimensional target feature vector is input into the decoder, the two-dimensional target feature vector needs to be converted into a one-dimensional target feature vector, so that the decoder can be used for decoding processing and living body classification processing according to the one-dimensional target feature vector, and therefore the living body detection result of the image to be detected is output.

In practical application, the living body detection model can be a model built based on a Transformer. Based on this, the encoder is an encoder in a Transformer model, and the decoder is a decoder in the Transformer model; the feature processing submodel may specifically include: the Channel Attention Module (CAM) and the Spatial Attention Module (SAM).

Specifically, the output of the encoder may be connected to the input of the channel attention module, the output of the channel attention module may be connected to the input of the spatial attention module, and the output of the spatial attention module may be connected to the input of the decoder.

In the embodiment of the specification, the Transformer model is a model which completely gets rid of the dependence of the circular network and the convolutional neural network by using a self-attention mechanism, is also the most basic technical support of the BERT model, and has better operation stability and accuracy. In general, the basic framework of the transform model mainly comprises an input part, an encoding part, a decoding part and an output part.

In this embodiment of the present disclosure, since the encoder of the transform model, the channel attention module, the spatial attention module, and the decoder of the transform model need to be sequentially connected, and the encoder of the transform model outputs a one-dimensional encoded vector, the one-dimensional encoded vector needs to be converted into a two-dimensional vector (i.e., the aforementioned feature map to be processed), and the two-dimensional vector is input into the channel attention module, so that the channel attention module can constrain the two-dimensional vector based on the channel attention mechanism, and input the constrained feature vector into the spatial attention module, and the spatial attention module performs re-constraint based on the spatial attention mechanism. Subsequently, the two-dimensional vector (i.e., the target feature map) obtained after the processing by the spatial attention module needs to be flattened into a one-dimensional vector (i.e., the target feature vector), so as to perform decoding processing and classification processing on the one-dimensional vector by using a decoder in the transform model, and an output part of the transform model outputs a detection result of the image to be detected.

In summary, in the embodiment of the present specification, the living body detection model may be built based on multiple attention mechanisms, such as a self-attention mechanism of a transform model, a channel attention mechanism of a channel attention module, and a space attention mechanism of a space attention module, so as to be beneficial to ensuring stability and accuracy of a living body detection result generated by the living body detection model and reducing dependency on accuracy of distribution of a training sample.

In the embodiment of the present specification, the living body detection model built based on the self-attention mechanism, the channel attention mechanism, and the space attention mechanism may be a two-classification model. The output result of the image to be detected after being detected by the living body detection model can be in a form of 'the probability that the image to be detected is a living body image is 85% and the probability of a non-living body image is 15%', or in a form of 'the image to be detected is a living body image or the image to be detected is a non-living body image'.

In the embodiment of the specification, a two-classification living body detection model is built based on an attention mechanism, so that the living body detection model can output a detection result of whether an image to be detected is a living body image at the first time, and the working efficiency of the living body detection model is improved.

In order to improve the accuracy of the in-vivo detection model, after the in-vivo detection model is built based on the attention mechanism, the in-vivo detection model needs to be trained and learned. Based on this, step 206: before the processing the fusion feature vector by using the living body detection model built based on the attention mechanism to obtain the living body detection result for the image to be detected, the method may further include:

acquiring a training sample set, wherein training samples in the training sample set are fusion feature vector samples obtained by performing feature fusion processing on visual feature vector samples and semantic feature vector samples of sample images; the training samples carry classification label data indicating whether a specified object in the sample images is a living body.

And training the initial living body detection model built based on the attention mechanism by using the training sample set to obtain the trained living body detection model built based on the attention mechanism.

In the embodiment of the present specification, a training sample set for training a living body detection model generally includes a plurality of training samples, the number of the training samples is determined according to an actual situation, and is not limited herein as long as the detection accuracy of the living body detection model reaches a preset result. Each training sample is composed of a fusion feature vector sample obtained by performing feature fusion processing on a visual feature vector sample and a semantic feature vector sample of a sample image, and the training sample has classification label data indicating whether a specified object in the sample image is a living body. And training the initial in-vivo detection model by using the training sample set, and obtaining the final in-vivo detection model required by the user when the accuracy of the detection result of the in-vivo detection model meets the preset requirement.

In the embodiments of the present specification, the sample image may include a live image sample and an attack image sample (i.e., a non-live image sample). The attack image sample mainly comprises an image which is acquired by a user when the user uses an attack means such as a photo, a video, a face changing, a mask, a shelter, a 3D animation and a screen reproduction. And the living body image sample can be an image which is acquired by a user who does not use an attack means. The training samples generated from the live body image samples may carry classification label data indicating that the designated object in the sample images is a live body, and the training samples generated from the attack image samples may carry classification label data indicating that the designated object in the sample images is a non-live body.

It should be noted that the visual feature vector sample and the semantic feature vector sample of the sample image are feature vectors obtained after position coding, that is, the generation principle of the visual feature vector sample and the generation principle of the target visual feature vector may be the same, and the generation principle of the semantic feature vector sample and the generation principle of the target semantic feature vector may also be the same, which is not described herein again.

In the embodiment of the specification, an initial in-vivo detection model established based on an attention mechanism is trained in advance, and the in-vivo detection result of the image to be detected is generated by using the in-vivo detection model obtained after training, so that the stability and the accuracy of the in-vivo detection result are improved.

FIG. 3 is a schematic lane flow chart corresponding to the in-vivo detection method in FIG. 2 provided in an embodiment of the present disclosure. As shown in fig. 3, the living body detection procedure may involve executing subjects such as a preset language model, a preset convolutional neural network model, a living body detection model, and the like.

In the feature extraction stage, a preset language model is used for carrying out feature extraction processing on an image to be detected to obtain a semantic feature vector of the image to be detected, and the semantic feature vector of the image to be detected is subjected to position coding processing to obtain a target semantic feature vector after position coding. And simultaneously, carrying out feature extraction processing on the image to be detected by utilizing a preset convolutional neural network model to obtain a visual feature vector of the image to be detected, and carrying out position coding processing on the visual feature vector of the image to be detected to obtain a target visual feature vector after position coding. And splicing and fusing the target semantic feature vector and the target visual feature vector to obtain a fused feature vector.

In the detection stage, the fused feature vector is input into an encoder of a living body detection model to obtain an encoding vector output by the encoder, channel attention constraint and space attention constraint processing are carried out on the encoding vector by utilizing a feature processing sub-model realized based on a CBAD technology to obtain a target feature vector, the target feature vector is input into a decoder of the living body detection model, and the decoder carries out decoding processing and classification processing on the target feature vector to obtain a living body detection result of an image to be detected.

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method. Fig. 4 is a schematic structural diagram of a living body detecting apparatus corresponding to fig. 2 provided in an embodiment of the present disclosure. As shown in fig. 4, the apparatus may include:

The examples of this specification also provide some specific embodiments of the apparatus based on the apparatus of fig. 4, which is described below.

Optionally, the first obtaining module may specifically include:

the first extraction unit is used for performing feature extraction processing on the image to be detected by using a preset language model to obtain a semantic feature vector of the image to be detected; wherein the preset language model is a deep learning model for generating image description information for an input image.

Optionally, the first obtaining module may further include:

and the first coding unit is used for carrying out position coding processing on the semantic feature vector of the image to be detected to obtain a target semantic feature vector after position coding.

Optionally, the first obtaining module may specifically include:

the second extraction unit is used for performing feature extraction processing on the image to be detected by using a preset convolutional neural network model to obtain a visual feature vector of the image to be detected; wherein the preset convolutional neural network model is a model for performing a processing operation including convolution processing with respect to an input image.

And the second coding unit is used for carrying out position coding processing on the visual characteristic vector of the image to be detected to obtain a target visual characteristic vector after position coding.

Optionally, the fusion module may specifically include:

and the splicing unit is used for carrying out feature splicing processing on the target semantic feature vector and the target visual feature vector to obtain a fusion feature vector of the image to be detected.

Wherein the dimension of the fusion feature vector is the sum of the dimension of the target semantic feature vector and the dimension of the target visual feature vector; alternatively, the first and second liquid crystal display panels may be,

Optionally, the living body detection model built based on the attention mechanism may include: an encoder built based on the self-attention mechanism, and a decoder built based on the self-attention mechanism.

Optionally, the processing module may specifically include:

and the input unit is used for inputting the fusion characteristic vector into the encoder to obtain an encoding vector output by the encoder.

And the generation unit is used for generating a living body detection result aiming at the image to be detected based on the coding vector by utilizing the decoder.

Optionally, the living body detection model built based on the attention mechanism may further include: and (4) processing a sub-model based on the characteristics built by the channel attention mechanism and the space attention mechanism.

Optionally, the generating unit may specifically include:

and the construction subunit is used for constructing a two-dimensional feature matrix according to the coding vector to obtain a feature map to be processed.

And the first input subunit is used for inputting the characteristic diagram to be processed into the characteristic processing submodel to obtain a target characteristic diagram output by the characteristic processing submodel.

And the molecule cutting unit is used for cutting the target characteristic graph to obtain a plurality of target characteristic subgraphs.

And the splicing subunit is used for splicing the feature vectors extracted from the target feature subgraphs to obtain the target feature vectors.

And the second input subunit is used for inputting the target characteristic vector into the decoder to obtain the living body detection result output by the decoder and aiming at the image to be detected.

Optionally, the encoder is an encoder in a Transformer model, and the decoder is a decoder in the Transformer model; the feature processing submodel includes: a channel attention module and a spatial attention module.

The output end of the encoder is connected with the input end of the channel attention module, the output end of the channel attention module is connected with the input end of the space attention module, and the output end of the space attention module is connected with the input end of the decoder

Optionally, the living body detection model built based on the attention mechanism is a classification model; the in-vivo detection result output by the decoder aiming at the image to be detected is a classification result; and the classification result is used for indicating whether the target object in the image to be detected is a living body.

Optionally, the apparatus in fig. 4 may further include:

and the second acquisition module is used for acquiring a training sample set, wherein the training samples in the training sample set are fusion feature vector samples obtained by performing feature fusion processing on visual feature vector samples and semantic feature vector samples of sample images. The training sample carries classification label data indicating whether a specified object in the sample image is a living body.

And the training module is used for training the initial living body detection model built based on the attention mechanism by using the training sample set to obtain the trained living body detection model built based on the attention mechanism.

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method.

Fig. 5 is a schematic structural diagram of a living body detection apparatus corresponding to fig. 2 provided in an embodiment of the present disclosure. As shown in fig. 5, the apparatus 500 may include:

at least one processor 510; and the number of the first and second groups,

a memory 530 communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory 530 stores instructions 520 executable by the at least one processor 510 to cause the at least one processor 510 to:

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus shown in fig. 5, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical blocks. For example, a Programmable Logic Device (PLD) (e.g., a Field Programmable Gate Array (FPGA)) is an integrated circuit whose Logic functions are determined by a user programming the Device. A digital character system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip. Furthermore, nowadays, instead of manually manufacturing an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development, but the original code before compiling is also written in a specific Programming Language, which is called Hardware Description Language (HDL), and the HDL is not only one kind but many kinds, such as abll (Advanced boot Expression Language), AHDL (alternate hard Description Language), traffic, CUPL (computer universal Programming Language), HDCal (Java hard Description Language), lava, lola, HDL, PALASM, software, rhydl (Hardware Description Language), and vhul-Language (vhyg-Language), which is currently used in the field. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be conceived to be both a software module implementing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more pieces of software and/or hardware in the practice of the present application.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims

1. A method of in vivo detection comprising:

obtaining semantic feature information and visual feature information of an image to be detected;

performing feature fusion processing on the visual feature information and the semantic feature information to obtain a fusion feature vector of the image to be detected;

2. The method according to claim 1, wherein the obtaining semantic feature information of the image to be detected specifically comprises:

3. The method according to claim 2, wherein the obtaining semantic feature information of the image to be detected further comprises:

4. The method according to claim 3, wherein the acquiring of the visual characteristic information of the image to be detected specifically comprises:

performing feature extraction processing on the image to be detected by using a preset convolutional neural network model to obtain a visual feature vector of the image to be detected; wherein the preset convolutional neural network model is a model for performing a processing operation including convolutional processing with respect to an input image;

5. The method according to claim 4, wherein the performing feature fusion processing on the visual feature information and the semantic feature information to obtain a fusion feature vector of the image to be detected specifically comprises:

performing feature splicing processing on the target semantic feature vector and the target visual feature vector to obtain a fusion feature vector of the image to be detected;

and the dimensionality of the fusion feature vector, the dimensionality of the target semantic feature vector and the dimensionality of the target visual feature vector are equal.

6. The method of any one of claims 1-5, wherein the attention-based mechanism-built in vivo detection model comprises: an encoder built based on the self-attention mechanism, and a decoder built based on the self-attention mechanism;

the method for detecting the living body of the image to be detected comprises the following steps of processing the fusion characteristic vector by using a living body detection model built based on an attention mechanism to obtain a living body detection result aiming at the image to be detected, and specifically comprises the following steps:

inputting the fusion feature vector into the encoder to obtain an encoding vector output by the encoder;

7. The method of claim 6, the attentive mechanism-based in vivo detection model further comprising: a characteristic processing sub-model built based on a channel attention mechanism and a space attention mechanism;

the generating, by the decoder, a living body detection result for the image to be detected based on the encoding vector specifically includes:

constructing a two-dimensional feature matrix according to the coding vector to obtain a feature map to be processed;

inputting the characteristic graph to be processed into the characteristic processing submodel to obtain a target characteristic graph output by the characteristic processing submodel;

segmenting the target feature graph to obtain a plurality of target feature sub-graphs;

splicing the feature vectors extracted from the target feature subgraphs to obtain target feature vectors;

and inputting the target characteristic vector into the decoder to obtain a living body detection result which is output by the decoder and aims at the image to be detected.

8. The method of claim 7, the encoder is an encoder in a Transformer model and the decoder is a decoder in the Transformer model; the feature processing submodel comprises: a channel attention module and a space attention module;

the output end of the encoder is connected with the input end of the channel attention module, the output end of the channel attention module is connected with the input end of the space attention module, and the output end of the space attention module is connected with the input end of the decoder.

9. The method of claim 8, wherein the living body detection model built based on the attention mechanism is a classification model; the in-vivo detection result output by the decoder aiming at the image to be detected is a classification result; the classification result is used for indicating whether the target object in the image to be detected is a living body.

10. The method according to claim 9, wherein before the processing the fusion feature vector by using the living body detection model built based on the attention mechanism to obtain the living body detection result for the image to be detected, the method further comprises:

acquiring a training sample set, wherein training samples in the training sample set are fusion feature vector samples obtained by performing feature fusion processing on visual feature vector samples and semantic feature vector samples of sample images; the training sample carries classification label data used for indicating whether a specified object in the sample image is a living body;

11. A living body detection device, comprising:

the first acquisition module is used for acquiring semantic feature information and visual feature information of an image to be detected;

the fusion module is used for carrying out feature fusion processing on the visual feature information and the semantic feature information to obtain a fusion feature vector of the image to be detected;

12. The apparatus according to claim 11, wherein the first obtaining module specifically includes:

13. The apparatus of claim 12, the first acquisition module, further comprising:

14. The apparatus of claim 13, the first acquisition module, further comprising:

the second extraction unit is used for performing feature extraction processing on the image to be detected by using a preset convolutional neural network model to obtain a visual feature vector of the image to be detected; wherein the preset convolutional neural network model is a model for performing a processing operation including convolutional processing with respect to an input image;

15. The apparatus according to claim 14, wherein the fusion module specifically comprises:

the splicing unit is used for carrying out feature splicing processing on the target semantic feature vector and the target visual feature vector to obtain a fusion feature vector of the image to be detected;

16. The apparatus of any one of claims 11-15, the attention-based construction of the liveness detection model comprising: the encoder is built based on the self-attention mechanism, and the decoder is built based on the self-attention mechanism;

the processing module specifically comprises:

the input unit is used for inputting the fusion characteristic vector into the encoder to obtain a coding vector output by the encoder;

17. The apparatus of claim 16, the attentive mechanism-based in vivo testing model further comprising: a feature processing sub-model built based on the channel attention mechanism and the space attention mechanism;

the generating unit specifically includes:

the construction subunit is used for constructing a two-dimensional feature matrix according to the coding vector to obtain a feature map to be processed;

the first input subunit is used for inputting the feature map to be processed into the feature processing submodel to obtain a target feature map output by the feature processing submodel;

the molecule cutting unit is used for cutting the target characteristic graph to obtain a plurality of target characteristic subgraphs;

the splicing subunit is used for splicing the feature vectors extracted from the target feature subgraphs to obtain target feature vectors;

18. The apparatus of claim 17, the encoder is an encoder in a Transformer model and the decoder is a decoder in the Transformer model; the feature processing submodel includes: a channel attention module and a spatial attention module;

19. The device of claim 18, wherein the living body detection model built based on the attention mechanism is a classification model; the in-vivo detection result output by the decoder aiming at the image to be detected is a classification result; and the classification result is used for indicating whether the target object in the image to be detected is a living body.

20. The apparatus of claim 19, further comprising:

the second acquisition module is used for acquiring a training sample set, wherein training samples in the training sample set are fusion feature vector samples obtained by performing feature fusion processing on visual feature vector samples and semantic feature vector samples of sample images; the training sample carries classification label data used for indicating whether a specified object in the sample image is a living body;

21. A living body examination apparatus comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to: