CN114612979A

CN114612979A - Living body detection method and device, electronic equipment and storage medium

Info

Publication number: CN114612979A
Application number: CN202210234381.4A
Authority: CN
Inventors: 李茜萌; 陆进; 朱禹萌; 刘玉宇; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-10
Anticipated expiration: 2042-03-09
Also published as: CN114612979B

Abstract

A method and a device for detecting living body, electronic equipment and a storage medium are provided, the method comprises the following steps: a plurality of video image frames of a target object are acquired, and a plurality of image pairs are generated from the plurality of video image frames. And extracting the spatial features of the image pair to obtain spatial feature information, wherein the spatial feature information comprises spatial features corresponding to a plurality of first image channels. And performing time sequence feature extraction on the image pair to obtain time sequence feature information, wherein the time sequence feature information comprises time sequence features of a plurality of second image channels. And generating respective channel attention of the first image channel and the second image channel according to the spatial characteristic information and the time sequence characteristic information. The method comprises the steps of determining a target image channel with the channel attention meeting a preset attention condition from a plurality of first image channels and a plurality of second image channels, obtaining target characteristic information corresponding to the target image channel to perform in-vivo detection, retaining important time and space characteristic information from the dimension of the image channel, and facilitating improvement of accuracy of in-vivo detection.

Description

Living body detection method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a device for detecting a living body, electronic equipment and a storage medium.

Background

In face recognition, live body detection is an important anti-fraud means for recognizing whether a target object is a real human. The existing living body detection method mainly performs image feature analysis on a shot single-frame static image so as to obtain a living body identification result. However, with the development of screen display and paper imaging technologies, this approach has great limitations, so that the accuracy of the living body detection is low.

Disclosure of Invention

The application provides a living body detection method and device, electronic equipment and a storage medium, and mainly aims to improve the accuracy of living body detection.

In order to achieve the above object, an embodiment of the present application provides a method for detecting a living body, the method including the following steps:

acquiring a plurality of video image frames of a target object, and generating a plurality of image pairs according to the plurality of video image frames;

extracting spatial features of the image pair to obtain spatial feature information, wherein the spatial feature information comprises a plurality of first image channels and spatial features corresponding to the first image channels;

performing time sequence feature extraction on the image pair to obtain time sequence feature information, wherein the time sequence feature information comprises a plurality of second image channels and time sequence features corresponding to the second image channels;

generating channel attention of the first image channel and channel attention of the second image channel according to the spatial feature information and the time sequence feature information;

determining a target image channel with channel attention satisfying a preset attention condition from the plurality of first image channels and the plurality of second image channels;

acquiring target characteristic information corresponding to the target image channel according to the spatial characteristic information and the time sequence characteristic information;

and performing living body detection according to the target characteristic information to obtain a living body detection result.

In some embodiments, the performing temporal feature extraction on the image pair to obtain temporal feature information includes: inputting the image pair into a first preset model for time sequence feature extraction to obtain time sequence feature information;

wherein the training step of the first preset model comprises:

acquiring a certain number of image pair samples and target labeling data of the image pair samples; training a first preset model by using the image to obtain training information; generating optical flow characteristic information according to the training information; and adjusting parameters of the first preset model according to the optical flow characteristic information, the training information and the target marking data until the first preset model meets the training end condition.

In some embodiments, the training a first preset model with the image to obtain training information includes:

inputting the image pair sample into a first preset model for N times of feature extraction to obtain training information, wherein the training information comprises first extraction information corresponding to the N-m times of feature extraction and second extraction information corresponding to the N times of feature extraction, N and m are positive integers, and m belongs to [1, N-1 ];

generating optical flow feature information according to the training information, including:

and generating optical flow characteristic information according to the first extraction information.

In some embodiments, the target annotation data comprises time-series feature annotation data for the image sample and optical flow annotation data; the adjusting the parameters of the first preset model according to the optical flow characteristic information, the training information and the target labeling data until the first preset model meets the training end condition includes:

calculating a first loss value according to the optical flow characteristic information and the optical flow marking data; calculating a second loss value according to the second extraction information and the time sequence feature marking data; verifying whether the first preset model meets a training end condition or not according to the first loss value and the second loss value; if the first preset model meets the training ending condition, ending the training; if the first preset model does not meet the training end condition, adjusting parameters of the first preset model according to the first loss value and the second loss value, increasing the number of samples and re-executing the training step.

In some embodiments, the determining, from the plurality of first image channels and the plurality of second image channels, a target image channel whose channel attention satisfies a preset attention condition includes:

for each first image channel, taking a second image channel corresponding to the first image channel from the plurality of second image channels as a comparison channel; if the channel attention of the first image channel is greater than the channel attention of the contrast channel, taking the first image channel as a target image channel; and if the channel attention of the first image channel is smaller than the channel attention of the contrast channel, taking the contrast channel as a target image channel.

In some embodiments, the determining, from the plurality of first image channels and the plurality of second image channels, a target image channel for which a channel attention satisfies a preset attention condition includes:

acquiring a preset channel number K, wherein K is a positive integer; and determining K target image channels with the maximum channel attention values from the plurality of first image channels and the plurality of second image channels according to the preset channel number K.

In some implementations, the generating a plurality of image pairs from the plurality of video image frames includes:

acquiring attribute information of each video image frame aiming at the attribute type according to a preset attribute type; according to the attribute information of each video image frame, screening a plurality of first image frames from the plurality of video image frames according to a preset rule; acquiring a second image frame meeting a preset pairing condition with the first image frame from the plurality of video image frames, and taking the first image frame and the second image frame as an image pair, wherein the preset pairing condition comprises: and the attribute information of the first image frame and the attribute information of the second image frame accord with a preset difference condition.

In order to achieve the above object, an embodiment of the present application further provides a living body detection apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of video image frames of a target object and generating a plurality of image pairs according to the video image frames;

the first extraction module is used for extracting spatial features of the image pair to obtain spatial feature information, wherein the spatial feature information comprises a plurality of first image channels and spatial features corresponding to the first image channels;

the second extraction module is used for performing time sequence feature extraction on the image pair to obtain time sequence feature information, wherein the time sequence feature information comprises a plurality of second image channels and time sequence features corresponding to the second image channels;

a generating module, configured to generate a channel attention of the first image channel and a channel attention of the second image channel according to the spatial feature information and the time sequence feature information;

a determining module, configured to determine, from the plurality of first image channels and the plurality of second image channels, a target image channel whose channel attention satisfies a preset attention condition;

the second acquisition module is used for acquiring target characteristic information corresponding to the target image channel according to the spatial characteristic information and the time sequence characteristic information;

and the detection module is used for carrying out living body detection according to the target characteristic information to obtain a living body detection result.

In order to achieve the above object, an embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a program, and the program implements the steps of the foregoing method when executed by the processor.

To achieve the above object, the present application provides a storage medium for a computer-readable storage, the storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the aforementioned method.

The living body detection method and device, the electronic equipment and the storage medium can acquire a plurality of video image frames of the target object, and the video image frames contain video dynamic information. Based on the method, a plurality of image pairs are generated according to a plurality of video image frames, and then the image pairs are respectively subjected to spatial feature and time sequence feature extraction, so that rich feature information, namely a plurality of first image channels and corresponding spatial features thereof, and a plurality of second image channels and corresponding time sequence features thereof, can be obtained in time and spatial dimensions at the same time. And then, according to the spatial feature information and the time sequence feature information, generating channel attention of the first image channel and channel attention of the second image channel, determining a target image channel with channel attention meeting a preset attention condition from the plurality of first image channels and the plurality of second image channels, reserving the more critical image channel, and abandoning the image channel with smaller contribution, thereby obtaining target feature information corresponding to the target image channel for living body detection. Therefore, the method and the device effectively utilize the inconsistency of the living body and the non-living body in the dynamic moving process, solve the detection limitation problem of the single-frame static image, retain important time and space characteristic information from the dimensionality of the image channel, and improve the accuracy of the subsequent living body detection.

Drawings

Fig. 1 is a block diagram of an electronic device to which an embodiment of the present application is applied;

FIG. 2 is a schematic flow chart of a method for detecting a living organism according to an embodiment of the present disclosure;

FIG. 3 is a detailed flowchart of step S210 in the embodiment shown in FIG. 2;

FIG. 4 is a schematic diagram of a training process of a first predetermined model in the embodiment of the present application;

FIG. 5 is a schematic structural diagram of a first pre-set model and an optical flow supervision module in an embodiment of the present application;

FIG. 6 is a schematic diagram of a structure of an object model in an embodiment of the present application;

fig. 7 is a block diagram of a living body detection apparatus to which an embodiment of the present application is applied.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no peculiar meaning by themselves. Thus, "module", "component" or "unit" may be used mixedly.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. The artificial intelligence software technology mainly includes several directions such as a computer vision technology (such as face recognition), a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, and machine learning/deep learning.

In order to solve the above problems, the present application provides a method for detecting a living body, which is applied to an electronic device. Referring to fig. 1, fig. 1 is a block diagram of an electronic device to which an embodiment of the present application is applied.

In the embodiment of the present application, the electronic device may be a terminal device having an arithmetic function, such as a server, a smart phone, a tablet computer, a portable computer, and a desktop computer.

The electronic device includes: memory 11, processor 12, network interface 13, and data bus 14.

The memory 11 includes at least one type of readable storage medium, which may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device, such as a hard disk of the electronic device. In other embodiments, the readable storage medium may be an external memory of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device.

In the present embodiment, the readable storage medium of the memory 11 is generally used for storing a living body detection program installed in the electronic device, a plurality of sample sets, a pre-trained model, and the like. The memory 11 may also be used to temporarily store data that has been output or is to be output.

The processor 12 may be a Central Processing Unit (CPU), microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing a living body detection program.

The network interface 13 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication link between the electronic device and other electronic devices.

The data bus 14 is used to enable connection communication between these components.

Optionally, the electronic device may further include a user interface, the user interface may include an input unit such as a Keyboard (Keyboard), a voice input device such as a microphone (microphone) or other devices with voice recognition function, a voice output device such as a sound box, a headset, or other devices, and optionally, the user interface may further include a standard wired interface or a wireless interface.

Optionally, the electronic device may further include a display, which may also be referred to as a display screen or a display unit. In some embodiments, the display device may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, or the like. The display is used for displaying information processed in the electronic device and for displaying a visualized user interface.

Optionally, the electronic device further comprises a touch sensor. The area provided by the touch sensor for the user to perform touch operation is referred to as a touch area. Further, the touch sensor here may be a resistive touch sensor, a capacitive touch sensor, or the like. The touch sensor may include not only a contact type touch sensor but also a proximity type touch sensor. Further, the touch sensor may be a single sensor, or may be a plurality of sensors arranged in an array, for example.

In addition, the area of the display of the electronic device may be the same as or different from the area of the touch sensor. Optionally, the display is stacked with the touch sensor to form a touch display screen. The device detects touch operation triggered by a user based on the touch display screen.

The following describes a method for detecting a living body disclosed in the embodiments of the present application.

As shown in fig. 2, fig. 2 is a schematic flow chart of a method for detecting a living body according to an embodiment of the present disclosure. Based on the electronic apparatus shown in fig. 1, the processor 12 implements steps S200 to S270 as follows when executing the program stored in the memory 11.

Step S200: a plurality of video image frames of a target object is acquired.

In the embodiment of the application, video data shot for a target object can be acquired, and then a plurality of video image frames are extracted from the video data according to a video time sequence. The video data acquisition mode includes, but is not limited to, any one of the following: shooting a video of a target object by utilizing a shooting device of the electronic equipment; calling pre-stored video data from a designated database or other storage modules; receiving video data uploaded to the electronic device through a user interface; and receiving video data sent by other equipment (such as an entrance guard photographic device or a road monitoring device) to the electronic equipment. It is understood that each video image frame contains a target object, which is any object with specific properties (including shape, gray scale, texture, etc.). For example, in the case of face recognition, the target object may be a human face or other non-living object, including but not limited to a photograph, an electronic screen, a work card, and the like.

Step S210: a plurality of image pairs is generated from the plurality of video image frames.

In an embodiment of the present application, a pair of images may include two video image frames. As an alternative implementation, as shown in fig. 3, fig. 3 is a schematic specific flowchart of step S210 in the embodiment shown in fig. 2. Step S210 may include steps S211 to S213.

Step S211: and acquiring attribute information of each video image frame aiming at the attribute type according to the preset attribute type.

The preset attribute type may be specified and adjusted by human, and includes, but is not limited to, at least one of image attribute information such as pixels, capturing timing, and components of the video image frame. Illustratively, in step S211, if the attribute type is a shooting timing, timing information of the video image frame is acquired; and if the attribute type is a pixel, acquiring pixel information of the video image frame.

Step S212: and screening a plurality of first image frames from the plurality of video image frames according to the attribute information of each video image frame and a preset rule.

Specifically, the preset rule includes, but is not limited to, the following two implementation manners: in one implementation, an initial time and an end time are determined according to timing information of each video image frame, and a first image frame is extracted from a plurality of video image frames at a preset first time interval from the initial time. The first time interval may be artificially specified, for example, 1 second, or the first time interval may be determined according to an initial time and an end time, for example, the first time interval ═ (end time-initial time) ÷ specified number of image frames S, which is not particularly limited. In another implementation, S first image frames may be randomly selected from the plurality of video image frames. Therefore, a certain number of video image frames are selected for feature extraction, and the load pressure of actual deployment can be reduced.

Step S213: and acquiring a second image frame meeting a preset matching condition with the first image frame from the plurality of video image frames, and taking the first image frame and the second image frame as an image pair.

Wherein the preset pairing conditions at least comprise: and the attribute information of the first image frame and the attribute information of the second image frame accord with a preset difference condition. The preset rules and the preset difference conditions can be related to the attribute types and are manually adjusted according to actual requirements. As can be seen, based on steps S211 to S213, an image pair meeting the dynamic difference condition is constructed according to the image attributes of the video image frames themselves and the attribute pairing relationship between different video image frames, so as to facilitate accurate analysis of the dynamic characteristics of living bodies and non-living bodies.

In one implementation, if the attribute type is a shooting timing, the preset pairing condition may be: the time interval between the first image frame and the second image frame is greater than or equal to a preset second time interval, wherein the time interval is calculated according to the time sequence information of the first image frame and the time sequence information of the second image frame, and the second time interval is specified artificially, for example, the second time interval is 0.2 second or 0.4 second, and the like, and is not particularly limited.

In another implementation, if the attribute type is a pixel, the preset pairing condition may be: and the pixel difference between the first image frame and the second image frame is greater than or equal to a specified pixel threshold value, wherein the pixel difference is calculated according to the pixel information of the first image frame and the pixel information of the second image frame.

Step S220: and extracting the spatial features of the image pair to obtain spatial feature information.

In this embodiment of the present application, the spatial feature information includes a plurality of first image channels and spatial features corresponding to the first image channels. It will be appreciated that each image channel corresponds to a different feature type (e.g., color, background, detail, etc.), so the spatial features of each image channel represent spatial feature data extracted from the image pair according to the feature type to which the image channel corresponds. In practical applications, the type and number of the first image channels may be adjusted according to actual needs.

In some optional embodiments, a second preset model may be constructed and trained in advance, and the second preset model includes, but is not limited to, convolutional neural network models such as a ResNet model and a VGG model. In step S220, the first image frame or the second image frame in the image pair may be input into a second preset model, so as to obtain spatial feature information. Or, the first image frame and the second image frame in the image pair are input into the second preset model after being overlapped.

Specifically, the second preset model may adopt a network structure of ResNet34, and a convolution module stacked with a residual structure is used to perform feature extraction and size scaling on the image frames input to the second preset model, so as to obtain a spatial feature map (i.e., spatial feature information) with a dimension of W × H × C, where W is a width of the spatial feature map, H is a height of the spatial feature map, C represents the number of first image channels, and a specific value of C is related to the number of output channels of the second preset model.

Step S230: and performing time sequence feature extraction on the image pair to obtain time sequence feature information.

In this embodiment of the present application, the timing characteristic information includes a plurality of second image channels and timing characteristics corresponding to the second image channels. In practical application, the second image channel and the first image channel may have a one-to-one correspondence relationship, which is convenient for realizing alignment of the spatial feature and the timing feature. In some optional embodiments, a first preset model may be pre-constructed and trained, and in step S220, the image pair is input into the first preset model to perform timing characteristic extraction, so as to obtain timing characteristic information.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating a training process of a first default model according to an embodiment of the present application. As shown in fig. 4, the training step of the first preset model at least includes, but is not limited to, steps S400 to S430:

step S400: a number of image pair samples and target annotation data for the image pair samples are obtained.

Step S410: and training a first preset model by using the image to the sample to obtain training information.

Step S420: and generating optical flow characteristic information according to the training information.

The optical flow feature information represents pixel motion information between two image frames in the image pair sample, such as pixel moving direction and moving speed. Algorithms for generating optical flow feature information may include, but are not limited to, the Lucas-Kanade optical flow algorithm, the Horn-Schunck optical flow algorithm, the LK optical flow method based on pyramid hierarchy, the optical flow estimation algorithm based on deep learning, and the like.

Step S430: and adjusting parameters of the first preset model according to the optical flow characteristic information, the training information and the target marking data until the first preset model meets the training end condition.

As can be seen, based on steps S400 to S430, in the training process of the first preset model, the training information is taken to generate the optical flow feature information, so that the optical flow feature information is used to perform optical flow supervision on the training process of the first preset model, and the overall motion trend between the pair of learning images of the first preset model can be effectively guided, so that the first preset model can accurately extract the key dynamic features according to the motion trend. In addition, the global features in the image frame can be extracted, partial local details can be acquired, and the accuracy of dynamic feature extraction on the image frame is improved.

Further, in some optional embodiments, step S410 specifically includes: inputting the image pair sample into a first preset model to perform feature extraction for N times to obtain training information, wherein the training information comprises first extraction information corresponding to the feature extraction for the Nth time and second extraction information corresponding to the feature extraction for the Nth time, N and m are positive integers, and m belongs to [1, N-1 ]. Correspondingly, step S420 specifically includes: optical flow feature information is generated based on the first extraction information. Therefore, optical flow supervision can be involved in specific nodes in the time sequence feature extraction process, so that supervision modes are more flexible and diversified.

Optionally, the first preset model includes N convolution layers connected or stacked, and convolution parameters of the convolution layers gradually decrease until the last convolution layer outputs a time sequence feature diagram with a dimension W × H × C, that is, the nth feature extracts corresponding second extraction information. Therefore, the first extraction information corresponding to the N-m times of feature extraction can be the time sequence feature diagram output by the N-m convolutional layers.

For example, please refer to fig. 5, fig. 5 is a schematic structural diagram of a first default model and an optical flow monitoring module according to an embodiment of the present application. As shown in fig. 5, the first preset model includes 5 convolutional layers connected in sequence, and exemplarily, the convolution parameter W1 × H1 of convolutional layer 1 is 112 × 112, the convolution parameter W2 × H2 of convolutional layer 2 is 56 × 56, the convolution parameter W3 × H3 of convolutional layer 3 is 28 × 28, the convolution parameter W4 × H4 of convolutional layer 4 is 14 × 14, and the convolution parameter W5 × H5 of convolutional layer 5 is 7 × 7, so that size scaling of the time series feature map is achieved.

It can be understood that the time-series feature map and the spatial feature map obtained in step S220 keep the same size and the number of image channels, for example, the size of the spatial feature map and the size of the time-series feature map are both 7 × 7, and the number of image channels is both 512.

Alternatively, an optical flow supervision module may be pre-constructed, the optical flow supervision module comprising at least one deconvolution layer and an optical flow prediction layer. And inputting the first extracted information into an optical flow monitoring module, amplifying the size of the feature map through a deconvolution layer, and outputting optical flow feature information through an optical flow prediction layer. For example, as shown in fig. 5, the optical flow supervision module may include 2 connected deconvolution layers and an optical flow prediction layer, and if the first extracted information is the time-series feature map (with a size of 14 × 14) output by the convolution layer 4 in the first preset model, the time-series feature map is input into the optical flow supervision module, the size of the time-series feature map is expanded to 28 × 28 by the 1 st deconvolution layer in the optical flow supervision module, and then expanded to 56 × 56 by the 2 nd deconvolution layer, and finally the optical flow prediction layer performs optical flow prediction on the time-series feature map with a size of 56 × 56.

Therefore, when optical flow supervision is introduced after the (N-m) th convolutional layer in the first preset model, the optical flow supervision result can assist in guiding the feature extraction learning process of the (1) th to (N-m-1) th convolutional layers and ensure that the subsequent convolutional layers can extract key dynamic features more accurately, so that the smaller the value of m is, the more the convolutional layers guided by the optical flow supervision are, and the larger the influence of the optical flow supervision is.

In one implementation, the target annotation data includes time-series feature annotation data of the image sample and optical flow annotation data (such as an optical flow image), then step S430 may specifically be: and calculating a first loss value according to the optical flow characteristic information and the optical flow marking data. And calculating a second loss value according to the second extraction information and the time sequence characteristic marking data. The loss functions used for calculating the first loss value include, but are not limited to, I2 norm, WARP loss, smoothness loss function, and the like, and the loss functions used for calculating the second loss value include, but are not limited to, L1 norm, MSE loss, cross entropy loss function, and the like. Based on the first loss value and the second loss value, whether the first preset model meets the training ending condition is verified. If the first preset model meets the training ending condition, ending the training; and if the first preset model does not meet the training end condition, adjusting the parameters of the first preset model according to the first loss value and the second loss value, increasing the number of samples and re-executing the training step. Therefore, the loss value of optical flow supervision is used as an auxiliary item of the loss function of the first preset model, and the effect of multi-scale auxiliary loss is achieved.

Optionally, the training end condition includes, but is not limited to, a specified loss threshold, and then verifying whether the first preset model meets the training end condition according to the first loss value and the second loss value includes: and calculating a target loss value according to the first loss value and the second loss value, wherein if the target loss value is less than or equal to a loss threshold value, the first preset model meets the training ending condition, and if the target loss value is greater than the loss threshold value, the first preset model does not meet the training ending condition. Ways to calculate the target loss value include, but are not limited to: the target loss value is defined as a first weight × a first loss value + a second weight × a second loss value, the first weight being a weight set for the first preset model, and the second weight being a weight set for optical flow supervision. In some implementations, the second weight can be related to the value of N-m, for example, in a direct relationship.

Step S240: and generating the channel attention of the first image channel and the channel attention of the second image channel according to the spatial characteristic information and the time sequence characteristic information.

In the embodiment of the present application, the channel attention represents the contribution weight of the corresponding image channel to the living body identification. Step S240 may specifically generate channel attention based on an attention mechanism, that is: and adding the spatial characteristic information and the time sequence characteristic information to obtain first characteristic information, and performing pooling operation on the first characteristic information to obtain second characteristic information to realize size scaling of the first characteristic information. And then, carrying out full connection and softmax classification operation on the second characteristic information in sequence to obtain the channel attention of each first image channel and the channel attention of each second image channel.

Optionally, an attention network may also be pre-constructed and trained, where the attention network may specifically include a pooling layer, a full-link layer, and a softmax classification layer that are connected in sequence, and the spatial feature information and the timing feature information are input into the attention network, so that the channel attention of the first image channel and the channel attention of the second image channel may be obtained. In some implementations, the pooling layer may specifically be a global average pooling layer.

It can be seen that, because the contributions of the spatial features and the time sequence features to living body identification are not necessarily equal in different image channel dimensions, the contribution weights of the spatial features and the time sequence features can be generated based on the attention mechanism, which is convenient for analyzing the feature identification emphasis of living body identification.

Step S250: and determining a target image channel with the channel attention meeting a preset attention condition from the plurality of first image channels and the plurality of second image channels.

In one implementation manner, step S250 may specifically be: and aiming at each first image channel, taking a second image channel corresponding to the first image channel from the plurality of second image channels as a contrast channel. And if the channel attention of the first image channel is greater than that of the contrast channel, taking the first image channel as a target image channel. And if the channel attention of the first image channel is smaller than the channel attention of the contrast channel, taking the contrast channel as a target image channel. It is to be understood that if the number of first image channels (or second image channels) is C, the finally determined number of target image channels is also C. Based on the method, for each feature type, a first image channel or a second image channel with higher contribution weight can be screened out, and on the basis of reserving image channels of all feature types, non-key channels are reduced to participate in subsequent calculation.

In another implementation manner, step S250 may specifically be: and acquiring a preset channel number K, wherein K is a positive integer. And determining K target image channels with the maximum channel attention values from the plurality of first image channels and the plurality of second image channels according to a preset channel number K, wherein K is a positive integer. In practical application, the preset number of channels K can be adjusted according to actual needs, and K belongs to [1, P ], for example, K takes 512. Therefore, in this way, the image channels of all feature types do not need to be reserved, and the image channels are directly selected or rejected according to the magnitude of the attention of the channels.

Step S260: and acquiring target characteristic information corresponding to the target image channel according to the spatial characteristic information and the time sequence characteristic information.

Step S270: and performing living body detection according to the target characteristic information to obtain a living body detection result.

In the embodiment of the present application, the living body detection result is used to indicate whether or not the target object is a living body. In an alternative embodiment, the target feature information may be input into a pre-trained living body detection network, and a probability value that the target object is identified as a living body or an inanimate body from the image pair is obtained, and the probability value is e [0, 1 ]. The living body detection network can comprise a global average pooling layer, a full connection layer and a softmax classification layer.

Further, a biopsy result may be generated based on the probability value corresponding to each image pair. In one implementation, the number of image pairs with probability values exceeding a probability threshold may be counted according to all the image pairs, and if a ratio between the number and the number of all the image pairs exceeds a preset ratio, the target object is determined to be a living body, otherwise, the target object is determined to be a non-living body.

In another implementation, a total probability value corresponding to all the image pairs may be calculated, and if the total probability value exceeds a probability threshold, the target object is determined to be a living body, and if the total probability value does not exceed the probability threshold, the target object is determined to be a non-living body. Optionally, the manner of calculating the total probability value includes, but is not limited to: and performing weighted calculation on the probability value corresponding to each image pair by using a specified weight (for example, 1/the number of all the image pairs), and then performing summation calculation on the weighted result of each image pair to obtain a total probability value.

The probability threshold and the preset ratio are both set manually, for example, the values of the probability threshold and the preset ratio are both 50% or 70%, and the like, and are not limited.

In another alternative embodiment, as shown in fig. 6, the first preset model may be taken as a first extraction branch, and the second preset model may be taken as a second extraction branch, so as to construct an object model based on the first extraction branch, the second extraction branch, the attention network, and the living body detection network. Based on the above, the target model is trained by using the image pair samples as a training set and the labels of the image pair samples (i.e. the objects contained in the image pair samples are living or non-living) as a verification set, and an optical flow supervision module can be added in the training stage of the target model to perform optical flow supervision learning on the first extraction branch. In practical application, after step S210, the image pairs are directly input into the target model, so that the probability value corresponding to each image pair can be obtained.

Therefore, by adopting the method, the image channel with smaller contribution is abandoned, the more key image channel and the corresponding characteristic information are reserved for the living body detection, the inconsistency of the living body and the non-living body in the dynamic moving process is effectively utilized, the detection limitation problem of the single-frame static image is solved, the important time and space characteristic information is reserved from the dimensionality of the image channel, and the accuracy of the subsequent living body detection is improved.

The embodiment of the application also provides a living body detection device. Referring to fig. 7, fig. 7 is a block diagram of a living body detecting apparatus according to an embodiment of the present disclosure. As shown in fig. 7, the living body detecting apparatus 700 includes a first acquiring module 710, a first extracting module 720, a second extracting module 730, a generating module 740, a determining module 750, a second acquiring module 760, and a detecting module 770, wherein:

the first acquiring module 710 is configured to acquire a plurality of video image frames of the target object and generate a plurality of image pairs according to the plurality of video image frames.

The first extraction module 720 is configured to perform spatial feature extraction on the image pair to obtain spatial feature information, where the spatial feature information includes a plurality of first image channels and spatial features corresponding to the first image channels.

The second extraction module 730 is configured to perform timing characteristic extraction on the image pair to obtain timing characteristic information, where the timing characteristic information includes a plurality of second image channels and timing characteristics corresponding to the second image channels.

The generating module 740 is configured to generate a channel attention of the first image channel and a channel attention of the second image channel according to the spatial feature information and the timing feature information.

A determining module 750, configured to determine, from the plurality of first image channels and the plurality of second image channels, a target image channel whose channel attention satisfies a preset attention condition.

The second obtaining module 760 is configured to obtain target feature information corresponding to the target image channel according to the spatial feature information and the timing feature information.

And the detection module 770 is used for performing living body detection according to the target characteristic information to obtain a living body detection result.

In some optional embodiments, the second extraction module is specifically configured to input the image pair into the first preset model to perform timing characteristic extraction, so as to obtain timing characteristic information. The living body detection device further comprises a training module, and the training module is used for acquiring a certain number of image pair samples and target labeling data of the image pair samples. And training a first preset model by using the image to the sample to obtain training information. Optical flow feature information is generated from the training information. And adjusting parameters of the first preset model according to the optical flow characteristic information, the training information and the target marking data until the first preset model meets the training end condition.

Further, in some optional embodiments, the training module is further configured to input the image pair sample into a first preset model for N times of feature extraction to obtain training information, where the training information includes first extraction information corresponding to the N-m times of feature extraction and second extraction information corresponding to the N-th time of feature extraction, N and m are positive integers, and m belongs to [1, N-1 ]; and generating optical flow feature information according to the first extraction information.

Still further, in some alternative embodiments, the target annotation data comprises time-series feature annotation data of the image sample and optical flow annotation data. The training module is further used for calculating a first loss value according to the optical flow characteristic information and the optical flow marking data; calculating a second loss value according to the second extraction information and the time sequence feature marking data; verifying whether the first preset model meets a training end condition or not according to the first loss value and the second loss value; if the first preset model meets the training ending condition, ending the training; and if the first preset model does not meet the training end condition, adjusting the parameters of the first preset model according to the first loss value and the second loss value, increasing the number of samples and re-executing the training step.

In some optional embodiments, the determining module is further configured to, for each first image channel, take a second image channel corresponding to the first image channel from the plurality of second image channels as a comparison channel; if the channel attention of the first image channel is greater than the channel attention of the contrast channel, taking the first image channel as a target image channel; and if the channel attention of the first image channel is smaller than the channel attention of the contrast channel, taking the contrast channel as a target image channel.

In other optional embodiments, the determining module is further configured to obtain a preset number of channels K; and determining K target image channels with the maximum channel attention values from the plurality of first image channels and the plurality of second image channels according to a preset channel number K, wherein K is a positive integer.

In some optional embodiments, the first obtaining module is further configured to obtain attribute information of each video image frame for an attribute type according to a preset attribute type; screening a plurality of first image frames from the plurality of video image frames according to the attribute information of each video image frame and a preset rule; acquiring a second image frame meeting a preset matching condition with the first image frame from the plurality of video image frames, and taking the first image frame and the second image frame as an image pair, wherein the preset matching condition comprises the following steps: and the attribute information of the first image frame and the attribute information of the second image frame accord with a preset difference condition.

It should be noted that, for the specific implementation process of this embodiment, reference may be made to the specific implementation process of the foregoing method embodiment, and details are not described again.

An embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a program, and the program is executed by the processor to implement the above-mentioned biopsy method.

Embodiments of the present application also provide a storage medium for a computer-readable storage, the storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the above-mentioned in-vivo detection method.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not intended to limit the scope of the claims of the application accordingly. Any modifications, equivalents and improvements which may occur to those skilled in the art without departing from the scope and spirit of the present application are intended to be within the scope of the claims of the present application.

Claims

1. A method of in vivo detection, the method comprising:

2. The method of claim 1, wherein the performing temporal feature extraction on the image pair to obtain temporal feature information comprises:

inputting the image pair into a first preset model for time sequence feature extraction to obtain time sequence feature information;

wherein the training step of the first preset model comprises:

acquiring a certain number of image pair samples and target labeling data of the image pair samples;

training a first preset model by using the image to obtain training information;

generating optical flow characteristic information according to the training information;

and adjusting parameters of the first preset model according to the optical flow characteristic information, the training information and the target marking data until the first preset model meets the training end condition.

3. The method of claim 2, wherein the training a first preset model with the image to obtain training information comprises:

4. The method of claim 3, wherein the target annotation data comprises time-series feature annotation data and optical flow annotation data of the image sample; the adjusting the parameters of the first preset model according to the optical flow characteristic information, the training information and the target labeling data until the first preset model meets the training end condition includes:

calculating a first loss value according to the optical flow characteristic information and the optical flow marking data;

calculating a second loss value according to the second extraction information and the time sequence feature marking data;

verifying whether the first preset model meets a training end condition or not according to the first loss value and the second loss value;

if the first preset model meets the training ending condition, ending the training; if the first preset model does not meet the training end condition, adjusting parameters of the first preset model according to the first loss value and the second loss value, increasing the number of samples and re-executing the training step.

5. The method according to any one of claims 1 to 4, wherein the determining, from the plurality of first image channels and the plurality of second image channels, a target image channel whose channel attention satisfies a preset attention condition comprises:

for each first image channel, taking a second image channel corresponding to the first image channel from the plurality of second image channels as a comparison channel;

if the channel attention of the first image channel is greater than the channel attention of the contrast channel, taking the first image channel as a target image channel;

and if the channel attention of the first image channel is smaller than the channel attention of the contrast channel, taking the contrast channel as a target image channel.

6. The method according to any one of claims 1 to 4, wherein the determining, from the plurality of first image channels and the plurality of second image channels, a target image channel whose channel attention satisfies a preset attention condition comprises:

acquiring a preset channel number K, wherein K is a positive integer;

and determining K target image channels with the maximum channel attention values from the plurality of first image channels and the plurality of second image channels according to the preset channel number K.

7. The method of any of claims 1 to 4, wherein said generating a plurality of image pairs from said plurality of video image frames comprises:

acquiring attribute information of each video image frame aiming at the attribute type according to a preset attribute type;

according to the attribute information of each video image frame, screening a plurality of first image frames from the plurality of video image frames according to a preset rule;

acquiring a second image frame meeting a preset pairing condition with the first image frame from the plurality of video image frames, and taking the first image frame and the second image frame as an image pair, wherein the preset pairing condition comprises: and the attribute information of the first image frame and the attribute information of the second image frame accord with a preset difference condition.

8. A living body detection apparatus, the apparatus comprising:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of video image frames of a target object and generating a plurality of image pairs according to the plurality of video image frames;

the second extraction module is used for extracting time sequence characteristics of the image pair to obtain time sequence characteristic information, and the time sequence characteristic information comprises a plurality of second image channels and time sequence characteristics corresponding to the second image channels;

a generating module, configured to generate a channel attention of the first image channel and a channel attention of the second image channel according to the spatial feature information and the time-series feature information;

9. An electronic device, characterized in that the electronic device comprises a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling connection communication between the processor and the memory, the program, when executed by the processor, implementing the steps of the liveness detection method according to any one of claims 1 to 7.

10. A storage medium for computer-readable storage, characterized in that the storage medium stores one or more programs, which are executable by one or more processors, to implement the steps of the living body detecting method of any one of claims 1 to 7.