CN115705758A

CN115705758A - Living body identification method, living body identification device, electronic device, and storage medium

Info

Publication number: CN115705758A
Application number: CN202110926412.8A
Authority: CN
Inventors: 孙婧
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shiyuan Artificial Intelligence Innovation Research Institute Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shiyuan Artificial Intelligence Innovation Research Institute Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2023-02-17

Abstract

The embodiment of the invention discloses a living body identification method, a living body identification device, electronic equipment and a storage medium, wherein the method is used for acquiring multiple frames of images with continuous preset quantity in the living body detection process, and respectively intercepting eye region images from each frame of image; calculating eye aspect ratio corresponding to the eye region image, and adjusting the eye region image to a preset size; inputting the adjusted eye region image into a pre-trained feature extraction network to obtain corresponding image features; and inputting the image characteristics and the corresponding eye aspect ratio into a pre-trained space Transformer network and a time sequence Transformer network, and carrying out classification and identification on whether the image characteristics are living bodies. According to the scheme, the eye features in continuous multi-frame images are extracted, the multi-frame images are subjected to feature classification through a space Transformer network, overall features are obtained through a time sequence Transformer network based on the feature classification and the eye aspect ratio, the identification of the living body classification based on the overall features is finally completed, and the quick living body classification identification based on the time sequence information of the video stream is realized.

Description

Living body recognition method, living body recognition device, electronic device, and storage medium

Technical Field

The embodiment of the invention relates to the technical field of image processing, in particular to a living body identification method, a living body identification device, electronic equipment and a storage medium.

Background

With the rapid development of face recognition technology, more and more face recognition applications appear in the market, meanwhile, the attack behaviors aiming at face recognition through videos and the like are more and more frequent, and the safety problem of face recognition is also greatly emphasized. For these attack behaviors, the living body recognition problem in the face recognition process, namely how to distinguish the picture of the legal user from the real user, is very important for the safety of the face recognition system. At present, living body recognition mainly comprises two types, one type needs the cooperation of a user to make corresponding expression or posture change, and the other type does not need the cooperation of the user. In the fitting living body recognition, it is a common method to detect whether a user blinks, and therefore how to improve the blink recognition accuracy is a hot issue to be studied.

The current implementation schemes of blink detection are classified into two categories, namely, the video stream is judged to be in an eye state (open or closed) frame by frame, and whether the whole video stream blinks or not is judged through logic; secondly, after different characteristics are extracted from the video stream, the video stream is sent to a CNN (Convolutional Neural Networks) for judgment; thirdly, predicting the video stream by using sequence models such as LSTM (Long Short-Term Memory, long Short-Term Memory network) and the like. The above schemes can only respectively realize certain advantages, and cannot quickly complete result judgment based on the time sequence information of the video stream.

Disclosure of Invention

The invention provides a living body identification method, a living body identification device, electronic equipment and a storage medium, which are used for solving the technical problem that the result judgment cannot be quickly completed based on the time sequence information of a video stream in the living body detection in the existing face identification process.

In a first aspect, an embodiment of the present invention provides a living body identification method, including:

acquiring continuous preset number of multi-frame images in the in-vivo detection process, and respectively intercepting eye region images from each frame of image;

calculating the eye aspect ratio corresponding to the eye region image, and adjusting the eye region image to a preset size;

inputting the adjusted eye region image into a pre-trained feature extraction network to obtain corresponding image features;

and inputting the image features and the corresponding eye aspect ratio into a pre-trained space Transformer network and a time sequence Transformer network, and carrying out classification identification on whether the image features are living bodies.

In a second aspect, an embodiment of the present invention further provides a living body identification apparatus, including:

the image acquisition unit is used for acquiring a plurality of continuous preset number of frames of images in the living body detection process and respectively intercepting eye region images from each frame of image;

the image processing unit is used for calculating the eye aspect ratio corresponding to the eye region image and adjusting the eye region image to a preset size;

the feature extraction unit is used for inputting the adjusted eye region image into a pre-trained feature extraction network to obtain corresponding image features;

and the classification and identification unit is used for inputting the image characteristics and the corresponding eye aspect ratio into a pre-trained space Transformer network and a time sequence Transformer network and performing classification and identification on whether the image characteristics are living bodies.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the electronic device is caused to implement the living body identification method according to the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the living body identification method according to the first aspect.

In the method, multiple frames of images with continuous preset number in the living body detection process are obtained, and eye region images are respectively intercepted from each frame of image; calculating the eye aspect ratio corresponding to the eye region image, and adjusting the eye region image to a preset size; inputting the adjusted eye region image into a pre-trained feature extraction network to obtain corresponding image features; and inputting the image features and the corresponding eye aspect ratio into a pre-trained space Transformer network and a time sequence Transformer network, and carrying out classification identification on whether the image features are living bodies. According to the scheme, the eye features in continuous multi-frame images are extracted, the multi-frame images are subjected to feature classification through a space Transformer network, overall features are obtained through the time sequence Transformer network based on the feature classification of single-frame images and the eye aspect ratio with time sequence relation among the images, the identification of the living body classification based on the overall features is finally completed, and the quick living body classification identification based on the time sequence information of video streams is realized.

Drawings

Fig. 1 is a flowchart of a method for identifying a living body according to an embodiment of the present invention;

FIG. 2 is a schematic illustration of eye key points for open eyes;

FIG. 3 is a schematic diagram of eye key points of eye closure;

FIG. 4 is a schematic view of a captured eye region image;

FIG. 5 is a schematic structural diagram of a spatial Transformer network;

FIG. 6 is a schematic diagram of a data processing process for feature extraction and living body recognition of an eye region image;

fig. 7 is a schematic structural diagram of a living body identification apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are for purposes of illustration and not limitation. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

It should be noted that, for the sake of brevity, this description does not exhaust all alternative embodiments, and it should be understood by those skilled in the art after reading this description that any combination of features may constitute an alternative embodiment as long as the features are not mutually inconsistent.

The following examples are described in detail.

Fig. 1 is a flowchart of a method of a living body identification method provided in an embodiment of the present invention, where the living body identification method is used in an electronic device, and as shown in the figure, the living body identification method includes:

step S110: acquiring continuous preset number of multi-frame images in the living body detection process, and respectively intercepting eye region images from each frame of image.

The blinking process of a person is a relatively fast process, generally lasts for 100ms to 400ms, and corresponds to image acquisition in the face recognition process, the number of image frames corresponding to 100ms to 400ms is determined by the frame rate of video streams acquired by a camera, and is generally 7 to 17 frames, that is, 7 to 17 frames of images are required for recording in one blinking process. In this embodiment, a detailed description will be given by taking 13 frames as an example. The method is characterized in that a preset number of continuous multi-frame images are used for carrying out living body recognition on 13 continuous frames of images, namely, the 0 th to 12 th frames, the 1 st to 13 th frames and the 2 nd to 14 th frames in the process of acquiring the face images, wherein the 8230is used as a group of multi-frame images, and the recognition process in the scheme is correspondingly executed until the recognition is finished, namely, the face recognition is confirmed to be successful or the face recognition is confirmed to be failed.

The method is mainly used for confirming whether the face is a real face. Confirming whether the face of the corresponding user is the face or not, wherein a complete face image is needed; in the scheme, the face is determined to be a real face or not, and only the image of the eye region is needed. Therefore, face detection, face 68 keypoint detection and face alignment are performed on each frame of image, and further, keypoint information of eyes is obtained, specifically, there are 6 keypoints for each eye, as shown in fig. 2, 6 keypoints for one eye in the eye-open state (including P1 and P4 at both ends of the corner of the eye, P2 and P3 at the upper edge of the eye, and P5 and P6 at the lower edge of the eye), as shown in fig. 3, 6 keypoints for each eye in the eye-closed state.

The eye region image is captured according to the positions of the key points, a set number of pixels (for example, 8 pixels or 10 pixels) are respectively expanded to the outside of the eyes for a total of 12 key points corresponding to the two eyes, from the maximum y coordinate, the minimum y coordinate, the maximum x coordinate and the minimum x coordinate of the 12 key points, and then the image in the range determined by the four coordinate values is captured to obtain the eye region image, and the captured eye region image is shown in fig. 4.

Step S120: and calculating the eye aspect ratio corresponding to the eye region image, and adjusting the eye region image to a preset size.

In the blinking process, the distance between upper and lower key points of the eyes is constantly changed, the distance is the smallest when the eyes are closed, the distance is the largest when the eyes are opened, and the blinking process is from opening to closing to opening. Taking fig. 2 as an example, the eye aspect ratio is defined as follows:

in the process of obtaining the eye region image, if the eye region image and the corresponding EAR data in the continuous 13 frames of images are confirmed to be obtained, the behavior of blinking is carried out according to the eye region image and the EAR data, if the number of the frames is not 13, the eye region image in the next frame of image is continuously obtained until the region interception of the 13 frames of images and the calculation of the eye aspect ratio are completed.

In the subsequent identification process, feature extraction and judgment can be performed only on the basis of images with the same size, but for face images, the size proportion of eye regions of different people and the distance from image acquisition to a lens may be different, which also results in different original sizes of the intercepted eye region images, and based on the result, the images are processed into the same size according to the preset size for subsequent identification.

Step S130: and inputting the adjusted eye region image into a pre-trained feature extraction network to obtain a corresponding image feature.

The process of classifying the eye region image by the classification network is a process of classifying the image features corresponding to the eye region image. The feature extraction of the eye region image may be implemented by various existing feature extraction networks, such as a convolutional neural network, a deep neural network, and the like. In a specific feature extraction process, the feature extraction network obtains multi-dimensional feature degrees, and expands the feature map correspondingly to obtain a desired image feature, specifically, step S130 may be implemented by step S131 and step S132.

Step S131: inputting the adjusted eye region images into a pre-trained feature extraction network, and correspondingly outputting a feature map of each eye region image;

step S132: and respectively expanding the characteristic graph of each eye region image to correspondingly obtain the image characteristics with preset dimensionality.

On a specific feature extraction architecture, inputting a pre-trained feature extraction network into the whole of an eye region image corresponding to each 13 frames of images to obtain a feature map of the eye region image corresponding to each frame of image, then respectively expanding and correspondingly obtaining image features with preset dimensions, and finally taking the image features corresponding to the 13 frames of images as a whole to perform subsequent judgment. Or inputting the eye region images corresponding to 13 frames of images into a pre-trained feature extraction network frame by frame, sequentially obtaining the feature images of the eye region images corresponding to each frame of image and the corresponding image features, and finally performing subsequent judgment by taking the image features of the continuous 13 frames of images as a whole, so that repeated feature extraction of a single frame of image can be reduced. Of course, the image features of the preset dimension can also be directly output by the feature extraction network without performing a separate expansion operation on the feature map.

Step S140: and inputting the image features and the corresponding eye aspect ratio into a pre-trained space Transformer network and a time sequence Transformer network, and carrying out classification and identification on whether the image features are living bodies.

In the scheme, in addition to the feature recognition of the eye region image in the single frame image, the comprehensive judgment of the feature recognition results in all the images is also carried out, wherein the former is realized by a space Transformer network, and the latter is realized by a time sequence Transformer. In the process of feature classification and identification, it is necessary to ensure that each feature is at a corresponding position to complete accurate determination, specifically, the initial marking position and the recording feature may be used, and the marking position and the recording feature may also be used in the process of feature classification and identification. If the corresponding mark position and the recorded feature are identified in the feature classification process, step S140 can be specifically implemented by steps S141 to S145:

step S141: adding a first classification dimension to the image features, and adding a first position code to obtain a first combined feature, wherein the first position code records position information corresponding to the image features;

step S142: inputting the first combined feature into a pre-trained space Transformer network to obtain a first output feature consisting of outputs corresponding to the first classification dimension, wherein the output of each dimension of the image feature is correspondingly recorded in the first classification dimension;

step S143: adding a second classification dimension, an eye aspect ratio corresponding to the eye region image and a second position code to the first output feature to obtain a second combined feature, wherein the second position code records position information corresponding to the first output feature;

step S144: inputting the second combined feature into a pre-trained time sequence Transformer network to obtain a second output feature consisting of outputs corresponding to the second classification dimension, wherein the outputs corresponding to the second classification dimension are fused with information of other dimension features in the second combined feature;

step S145: and inputting the second output characteristic into a full-connection layer with the category number of 2, and outputting the identification result of whether the second output characteristic is a living body.

The Transformer structure is widely applied to the field of natural language processing, and comprises an encoder and a decoder, wherein in the scheme, the encoder part is only used for realizing the identification judgment of whether a living body exists. As shown in fig. 5, the spatial Transformer network includes a plurality of encoder layers connected in sequence, each of the encoder layers including a forward neural network module and a plurality of self-attention modules. The structure of the Transformer consists of a self-attention mechanism and a forward neural network, a trainable neural network based on the Transformer can be built in a form of stacking the Transformer, and the number of stacked Transformer layers is the depth of the network. The Transformer structure solves two problems, firstly, the relation between any two positions in a frame sequence is calculated by using an attention mechanism, and secondly, the Transformer structure has better parallelism and conforms to the existing GPU (Graphics Processing Unit) framework. Based on this, step S142 includes:

step S1421: taking the first combined characteristic input into the spatial Transformer network as an input characteristic of a first encoder layer, and taking the output of a forward neural network module in a previous encoder layer as an input characteristic of a subsequent encoder layer;

step S1422: the input features corresponding to each encoder layer are subjected to self-attention calculation with preset number respectively, the output of each self-attention calculation is spliced, and the output weight matrix is restored to the original size of the first combined features and then output to the corresponding forward neural network module;

step S1423: and extracting the output corresponding to the first classification dimension from the output of the last neural network module to form a first output characteristic.

In a specific implementation process, the first position code is a parameter matrix with a preset size initialized based on Gaussian distribution; the first position code is used for training and updating in the spatial Transformer network along with the first combined feature. The second position code is a trigonometric function-based position code; the second position code remains unchanged in the time-sequential Transformer network.

The detailed image processing procedure in step S130 and step S140 may refer to fig. 6. In the scheme, a spatial Transformer network and a time sequence Transformer network are used, wherein the spatial Transformer network mainly acquires the characteristics of each eye region image and shares 12 layers; the time sequence Transformer network acquires the features between the eye region image and the eye region image, and shares 6 layers, and only uses the encoder part. As shown in fig. 6, a response feature map is obtained by performing feature extraction on 13 eye region images by a CNN (Convolutional Neural Networks). The CNN here is selected as Resnet18, the input picture size is 32 × 128, and the output feature map size is 2 × 8 × 256; the obtained 2 × 8 × 256 feature map is developed into image features of 16 × 256 size, so that each 256-dimensional image feature corresponds to a small block on the original image. Because each small block has one output after being calculated by the space Transformer network, and each eye region image only needs one output in the subsequent calculation, the method has two modes, one is that a dimension, called class token, is additionally added in the input, and the feature of the dimension is only needed to be taken in the subsequent calculation; second, multiple outputs of the same frame are averaged. The first, here choosing, is to add a one-dimensional class token (not shown in fig. 6) to the image feature, the image feature size becomes 17 × 256.

Since the image features themselves contain position information, but the position information is discarded during the transform calculation, and the order of changing features does not cause a change in the result, it is necessary to add position codes to the features in order to retain the position information after the features are expanded. A position code of size 17x256 is added. The position coding has two adding modes, namely, a parameter matrix with a specific size is initialized by Gaussian distribution and is trained and updated along with a network. And secondly, the position code is obtained through the trigonometric function and cannot be updated in the training process. The encoding formula of the trigonometric function is as follows:

where pos represents the position of the image, i represents the ith dimension of the image feature, and dmodel represents the dimension size of the image feature. The first coding mode is selected in the space Transformer network, and the second coding mode is selected in the time sequence Transformer network.

The following calculation is performed for each encoder layer of the spatial Transformer network:

through the self-attention calculation for the encoder layers, a multi-head attention implementation of h =8 is adopted, which is equivalent to that each encoder layer is integrated by h different self-attention modules. And respectively carrying out h different self-attention calculations on the same input to obtain respective outputs, splicing the outputs together, and recovering the outputs to the original size through an output weight matrix. Through the calculation of the forward neural network module of the encoder layer, each eye region image correspondingly has 1x256 feature output, and by adding the EAR value calculated by each frame, 13 frames have 13x (256 + 1) features. The output of the last layer of the spatial Transformer network corresponding to the first classification dimension is used as the input of the timing Transformer network.

Similar to the spatial Transformer network, a class token needs to be added to the data input to the time sequence Transformer network for final classification and position coding, except that the position coding here selects a non-learned position coding, i.e., the second coding scheme described above. And (3) calculating h =8 multi-head self-attribute and a forward neural network module on each layer of the time sequence Transformer network, taking the output size of the last layer of class token position as 1x257, fusing information of other dimension characteristics input into the time sequence Transformer network in the output of the dimension, and fusing the whole characteristic information of 13 frames of images in the output of the dimension. Based on the overall feature information fused in the class token, the final classification result is obtained by passing the output of the time sequence Transformer network (namely the output of the corresponding dimension of the class token) through the full-connection layer with the class number of 2 (living body and non-living body). The classification of the full connection layer is obtained by training the sample, and the specific training process is realized already and is not described in detail here.

It should be noted that, in the description of the present solution based on fig. 6, each parameter is only an exemplary illustration and does not represent a unique limitation, for example, the spatial Transformer network may be 8 layers or 10 layers. Other similar parameters are not necessarily illustrated.

The method comprises the steps of obtaining multiple frames of images with continuous preset number in the living body detection process, and respectively intercepting eye region images from each frame of image; calculating the eye aspect ratio corresponding to the eye region image, and adjusting the eye region image to a preset size; inputting the adjusted eye region image into a pre-trained feature extraction network to obtain corresponding image features; and inputting the image features and the corresponding eye aspect ratio into a pre-trained space Transformer network and a time sequence Transformer network, and carrying out classification identification on whether the image features are living bodies. According to the scheme, the eye features in continuous multi-frame images are extracted, the multi-frame images are subjected to feature classification through a space Transformer network, overall features are obtained through the time sequence Transformer network based on the feature classification of single-frame images and the eye aspect ratio with time sequence relation among the images, the identification of the living body classification based on the overall features is finally completed, and the quick living body classification identification based on the time sequence information of video streams is realized.

Fig. 7 is a schematic structural diagram of a living body identification apparatus according to an embodiment of the present invention. The living body recognition apparatus is used for a liquid crystal display device whose backlight is divided into a plurality of backlight regions, and referring to fig. 7, the living body recognition apparatus includes an image obtaining unit 210, an image processing unit 220, a feature extraction unit 230, and a classification recognition unit 240.

The system comprises an image acquisition unit 210 unit, a data acquisition unit and a data processing unit, wherein the image acquisition unit is used for acquiring multiple frames of images with continuous preset number in the living body detection process and respectively intercepting eye region images from each frame of image; an image processing unit 220, configured to calculate an eye aspect ratio corresponding to the eye region image, and adjust the eye region image to a preset size; a feature extraction unit 230, configured to input the adjusted eye region image into a pre-trained feature extraction network to obtain a corresponding image feature; and a classification and identification unit 240, configured to input the image features and the corresponding eye aspect ratios into a pre-trained spatial transform network and a time sequence transform network, and perform classification and identification on whether the image features are living bodies.

On the basis of the above embodiment, the classification identifying unit 240 includes:

the feature processing module is used for adding a first classification dimension to the image features and adding a first position code to obtain a first combined feature, wherein the first position code records position information corresponding to the image features;

a spatial processing module, configured to input the first combined feature into a pre-trained spatial transform network to obtain a first output feature composed of outputs corresponding to the first classification dimension, where an output of each dimension of the image feature is recorded in the first classification dimension correspondingly;

a feature combining module, configured to add a second classification dimension, an eye aspect ratio corresponding to the eye region image, and a second position code to the first output feature to obtain a second combined feature, where the second position code records position information corresponding to the first output feature;

the time processing module is used for inputting the second combined feature into a pre-trained time sequence Transformer network to obtain a second output feature consisting of outputs corresponding to the second classification dimension, and the outputs corresponding to the second classification dimension are fused with information of other dimension features in the second combined feature;

and the characteristic classification module is used for inputting the second output characteristic into the full-connection layer with the category number of 2 and outputting the identification result of whether the second output characteristic is a living body.

On the basis of the above embodiment, the spatial Transformer network comprises a plurality of layers of encoder layers which are sequentially connected, wherein each layer of encoder layer comprises a forward neural network module and a plurality of self-attention modules;

the spatial processing module comprises:

the encoder layer input submodule is used for taking the first combined characteristic input into the spatial Transformer network as the input characteristic of a first encoder layer and taking the output of a forward neural network module in a previous encoder layer as the input characteristic of a subsequent encoder layer;

the encoder layer processing submodule is used for performing self-attention calculation with preset number on the input features corresponding to each encoder layer, splicing the output of each self-attention calculation, recovering the original size of the first combined features through an output weight matrix, and outputting the original size to the corresponding forward neural network module;

and the first feature output submodule is used for extracting the output of the last neural network module, and the output corresponding to the first classification dimension forms a first output feature.

On the basis of the above embodiment, the first position code is a parameter matrix of a preset size initialized based on gaussian distribution; the first position code trains an update in the spatial Transformer network following the first combined feature.

On the basis of the above embodiment, the second position code is a position code based on a trigonometric function; the second position code remains unchanged in the time-sequential Transformer network.

On the basis of the above embodiment, the feature extraction unit 230 includes:

the feature map generation module is used for inputting the adjusted eye region images into a pre-trained feature extraction network and correspondingly outputting a feature map of each eye region image;

and the characteristic diagram unfolding module is used for respectively unfolding the characteristic diagram of each eye region image to correspondingly obtain the image characteristics with preset dimensionality.

The living body identification device provided by the embodiment of the invention is contained in the electronic equipment of the equipment, can be used for executing any living body identification method provided by the embodiment, and has corresponding functions and beneficial effects.

It should be noted that, in the embodiment of the living body identification device, the included units and modules are merely divided according to the functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 8, the electronic device includes a processor 310, a memory 320, an input device 330, an output device 340, and a communication device 350; the number of the processors 310 in the electronic device may be one or more, and one processor 310 is taken as an example in fig. 8; the processor 310, the memory 320, the input device 330, the output device 340 and the communication device 350 in the electronic apparatus may be connected by a bus or other means, and fig. 8 illustrates an example of connection by a bus.

The memory 320, as a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the living body identification method in the embodiment of the present invention (for example, the image obtaining unit 210, the image processing unit 220, the feature extraction unit 230, and the classification recognition unit 240 in the living body identification apparatus). The processor 310 executes various functional applications of the electronic device and data processing by executing software programs, instructions, and modules stored in the memory 320, that is, implements the living body identification method described above.

The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 320 may further include memory located remotely from the processor 310, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 340 may include a display device such as a display screen.

The electronic equipment comprises the living body identification device, can be used for executing any living body identification method, and has corresponding functions and beneficial effects.

Embodiments of the present invention further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform relevant operations in the living body identification method provided in any embodiment of the present application, and have corresponding functions and beneficial effects.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product.

Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional identical elements in the process, method, article, or apparatus comprising the element.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A living body identification method, comprising:

2. The method of claim 1, wherein the inputting the image features and corresponding eye aspect ratios into a pre-trained spatial Transformer network and a temporal Transformer network to perform classification and identification on whether the image features are living bodies comprises:

adding a first classification dimension to the image features, and adding a first position code to obtain a first combined feature, wherein the first position code records position information corresponding to the image features;

inputting the first combined feature into a pre-trained space Transformer network to obtain a first output feature consisting of outputs corresponding to the first classification dimension, wherein the output of each dimension of the image feature is correspondingly recorded in the first classification dimension;

adding a second classification dimension, an eye aspect ratio corresponding to the eye region image and a second position code to the first output feature to obtain a second combined feature, wherein the second position code records position information corresponding to the first output feature;

inputting the second combined feature into a pre-trained time sequence Transformer network to obtain a second output feature consisting of outputs corresponding to the second classification dimension, wherein the outputs corresponding to the second classification dimension are fused with information of other dimension features in the second combined feature;

and inputting the second output characteristic into a full-connection layer with the category number of 2, and outputting the identification result of whether the second output characteristic is a living body.

3. The method of claim 2, wherein the spatial Transformer network comprises sequentially connected layers of encoders, each layer of the encoders comprising a forward neural network module and a plurality of self-attention modules;

inputting the first combined feature into a pre-trained spatial Transformer network to obtain a first output feature composed of outputs corresponding to the first classification dimension, including:

taking the first combined features input into the spatial Transformer network as input features of a first encoder layer, and taking the output of a forward neural network module in a previous encoder layer as input features of a subsequent encoder layer;

the input features corresponding to each encoder layer are subjected to self-attention calculation with preset number respectively, the output of each self-attention calculation is spliced, and the output weight matrix is restored to the original size of the first combined features and then output to the corresponding forward neural network module;

and extracting the output corresponding to the first classification dimension from the output of the last neural network module to form a first output characteristic.

4. The method of claim 2, wherein the first position code is a parameter matrix of a preset size initialized based on a gaussian distribution; the first position code trains an update in the spatial Transformer network following the first combined feature.

5. The method of claim 2, wherein the second position encoding is a trigonometric function based position encoding; the second position code remains unchanged in the time-sequential Transformer network.

6. The method of claim 1, wherein inputting the adjusted eye region image into a pre-trained feature extraction network to obtain corresponding image features comprises:

inputting the adjusted eye region images into a pre-trained feature extraction network, and correspondingly outputting a feature map of each eye region image;

and respectively expanding the characteristic graph of each eye region image to correspondingly obtain the image characteristics with preset dimensionality.

7. A living body identification device, comprising:

8. The apparatus of claim 7, wherein the classification identifying unit comprises:

the characteristic processing module is used for adding a first classification dimension to the image characteristic and adding a first position code to obtain a first combined characteristic, wherein the first position code records position information corresponding to the image characteristic;

the spatial processing module is used for inputting the first combined feature into a pre-trained spatial transform network to obtain a first output feature consisting of outputs corresponding to the first classification dimension, and the output of each dimension of the image feature is correspondingly recorded in the first classification dimension;

the feature compounding module is used for adding a second classification dimension, an eye aspect ratio corresponding to the eye region image and a second position code to the first output feature to obtain a second combined feature, wherein the second position code records position information corresponding to the first output feature;

a time processing module, configured to input the second combination feature into a pre-trained time sequence Transformer network to obtain a second output feature formed by outputs corresponding to the second classification dimension, where the outputs corresponding to the second classification dimension are fused with information of other dimension features in the second combination feature;

9. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the electronic device to implement the living body identification method of any one of claims 1-6.

10. A computer-readable storage medium on which a computer program is stored, characterized in that the program, when executed by a processor, implements the living body identification method according to any one of claims 1 to 6.