CN114283463A

CN114283463A - Image processing method, image processing device, electronic equipment and storage medium

Info

Publication number: CN114283463A
Application number: CN202111396832.6A
Authority: CN
Inventors: 王强昌; 王卓; 谭资昌; 郭国栋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-04-05

Abstract

The present disclosure provides an image processing method, an image processing apparatus, an electronic device and a storage medium, which relate to the technical field of artificial intelligence, specifically to the technical fields of deep learning, computer vision, etc., and can be applied to scenes such as face image processing, etc. The method comprises the following steps: the method comprises the steps of obtaining multi-frame images of a video, determining a plurality of pieces of space related information corresponding to the multi-frame images respectively, determining a plurality of pieces of time sequence information corresponding to the multi-frame images, generating a plurality of predicted depth images corresponding to the corresponding multi-frame images respectively according to the plurality of pieces of time sequence information, and carrying out in-vivo detection according to the plurality of pieces of space related information and the plurality of predicted depth images. Therefore, the accuracy and the effect of the image-based in-vivo detection can be effectively improved, and the generalization of the image processing method is effectively improved, so that the deployment and the application of the image processing method in the in-vivo detection scene are facilitated.

Description

Image processing method, image processing device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning and computer vision technologies, which can be applied in human face image processing scenarios, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

In the related art, the image-based in-vivo detection method has poor resolution in distinguishing a living body from a non-living body, resulting in poor in-vivo detection effect, which is not favorable for deployment and application of the in-vivo detection method.

Disclosure of Invention

The disclosure provides an image processing method, an image processing apparatus, an electronic device, a storage medium, and a computer program product.

According to a first aspect of the present disclosure, there is provided an image processing method including: acquiring a multi-frame image of a video; determining a plurality of pieces of spatial correlation information respectively corresponding to the plurality of frames of images; determining a plurality of time sequence information corresponding to the multi-frame images; respectively generating a plurality of predicted depth images corresponding to the plurality of frames of images according to the plurality of time sequence information; and performing a live body detection based on the plurality of spatial correlation information and the plurality of predicted depth images.

According to a second aspect of the present disclosure, there is provided an image processing apparatus comprising: the acquisition module is used for acquiring multi-frame images of the video; a first determining module, configured to determine a plurality of pieces of spatial correlation information respectively corresponding to the plurality of frames of images; a second determining module, configured to determine a plurality of timing information corresponding to the plurality of frame images; the generating module is used for respectively generating a plurality of predicted depth images corresponding to the plurality of frames of images according to the plurality of time sequence information; and a detection module for performing in vivo detection based on the plurality of spatial correlation information and the plurality of predicted depth images.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image processing method according to the first aspect of the disclosure.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the image processing method according to the first aspect of the present disclosure.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the image processing method according to the first aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram of differences in spatially relevant information for live and non-live images according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 4 is a schematic flow diagram of transform network-based feature extraction according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a structure of an image processing model according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of dynamic window drift of a PTA module according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a TDA module according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 9 is a schematic illustration of a liveness detection score for an abnormal frame image, in accordance with an implementation of the present disclosure;

FIG. 10 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 11 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 12 shows a schematic block diagram of an example electronic device for implementing the image processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure.

It should be noted that the execution subject of the image processing method of this embodiment is an image processing apparatus, the apparatus may be implemented by software and/or hardware, the apparatus may be configured in an electronic device, and the electronic device may include, but is not limited to, a terminal, a server, and the like.

The embodiment of the disclosure relates to the technical field of artificial intelligence, in particular to the technical field of computer vision, deep learning and the like, and can be applied to scenes such as face image processing and the like.

Wherein, Artificial Intelligence (Artificial Intelligence), english is abbreviated as AI. The method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence.

Deep learning is to learn the intrinsic rules and expression levels of sample data, and the information obtained in the learning process is helpful to the interpretation of data such as characters, images and sounds. The final goal of deep learning is to make a machine capable of human-like analytical learning, and to recognize data such as characters, images, and sounds.

Computer vision, which means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection.

The face image processing means processing an input face image or video stream by using a computer technology, extracting face image information contained in the image, automatically detecting and tracking a face in the image, and further processing the detected face image by using a series of related technologies.

When the image processing method provided by the disclosure is applied to scenes such as face image processing, the accuracy and the in-vivo detection effect of in-vivo detection in scenes such as face image processing can be effectively improved, the generalization of the face image processing method is effectively improved, and the application effect of in-vivo detection based on the face image is effectively improved.

The face image data includes a video stream of face images, and the face image information is obtained under the condition of meeting related laws and regulations, for example, the data may be from a public data set or obtained from an authorized organization after being authorized by a related organization, and the obtaining process of the data meets the regulations of the related laws and regulations without violating the customs of the public order.

It should be noted that the above-mentioned face data includes a video stream of face images, and the face image information is not data acquired for a specific user and cannot reflect personal information of a specific user.

In the embodiment of the disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related users all conform to the regulations of related laws and regulations, and do not violate the customs of the public order.

As shown in fig. 1, the image processing method includes:

s101: acquiring multi-frame images of a video.

The video may include a plurality of video frames, which may be referred to as multi-frame images, and the multi-frame images constitute a segment of the video or constitute a complete video.

In the embodiment of the present disclosure, when acquiring the multi-frame image of the video, a camera device with a shooting function, such as a mobile phone and a camera, may be used to acquire the video stream, analyze the acquired video stream to acquire the multi-frame image of the video, and then execute the image processing method described in the embodiment of the present disclosure for the multi-frame image of the video, which may be specifically referred to in the following embodiments.

It should be noted that the acquired multi-frame images in the embodiment of the present disclosure are not images acquired for a specific user, and do not reflect personal information of a specific user, and the aforementioned images are acquired after being authorized by the related user, and the acquisition process of the images conforms to the regulations of the related laws and regulations and does not violate the public order and customs.

S102: a plurality of pieces of spatial correlation information respectively corresponding to the plurality of frame images are determined.

The multi-frame image may have a plurality of pieces of information related to a spatial dimension, where the information may be referred to as spatial related information, and the spatial related information may specifically be, for example, a spatial feature (for example, depth information and position information of each pixel in the image in the spatial dimension, or may also be spatial information of a scene object represented by the image in the scene), which is not limited to this.

In some embodiments, the plurality of spatial correlation information corresponding to the plurality of frames of images may be determined in combination with the convolutional neural network, for example, the plurality of frames of images may be input into the convolutional neural network, respectively, to obtain a plurality of feature maps corresponding to the plurality of frames of images, respectively, and the plurality of feature maps may be used as the plurality of spatial correlation information corresponding to the plurality of frames of images, respectively.

In other embodiments, the plurality of spatial correlation information respectively corresponding to the plurality of frames of images are determined, or a feature extraction algorithm may be used to determine a plurality of spatial features respectively corresponding to the plurality of frames of images, and the plurality of spatial features are used as a plurality of spatial features respectively corresponding to the plurality of frames of images, or any other possible manner may be used to determine a plurality of spatial correlation information respectively corresponding to the plurality of frames of images, which is not limited herein.

S103: a plurality of pieces of timing information corresponding to the plurality of frame images are determined.

The multi-frame image may have a plurality of pieces of information related to a time dimension, which may be referred to as time sequence information, and the time sequence information may specifically be, for example, a time characteristic (for example, a frame time of each frame image in an image sequence composed of the multi-frame images, or an acquisition time corresponding to each frame image, or a time difference between the acquisition time corresponding to each frame image and one annotation time, or the like), which is not limited thereto.

In some embodiments, the recurrent neural network may be combined to determine a plurality of pieces of timing information corresponding to the plurality of frames of images, for example, the plurality of frames of images may be respectively input into the recurrent neural network to perform timing operation on the plurality of frames of images to extract time features of the plurality of frames of images, and the extracted time features are used as a plurality of pieces of timing information corresponding to the plurality of frames of images, which is not limited.

In other embodiments, the plurality of timing information corresponding to the plurality of frames of images may also be determined by combining the pre-trained timing information extraction model, for example, the plurality of frames of images may be respectively input into the pre-trained timing information extraction model to obtain the plurality of timing information corresponding to the plurality of frames of images output by the pre-trained timing information extraction model, or any other possible manner may also be adopted to determine the plurality of timing information corresponding to the plurality of frames of images, which is not limited to this.

S104: and respectively generating a plurality of predicted depth images corresponding to the corresponding multi-frame images according to the plurality of time sequence information.

After determining the plurality of time sequence information corresponding to the plurality of frames of images, the embodiments of the present disclosure may generate a plurality of depth images corresponding to the plurality of frames of images, respectively, according to the plurality of time sequence information, where the depth images may be referred to as predicted depth images.

In some embodiments, the plurality of predicted depth images corresponding to the plurality of frames of images are respectively generated according to the plurality of time sequence information, and the depth recognition may be performed on the plurality of frames of images respectively according to the plurality of time sequence information, and a depth image is formed according to the recognized depth, where the depth image may be referred to as a predicted depth image.

In other embodiments, a plurality of predicted depth images corresponding to the corresponding multi-frame images are respectively generated according to the plurality of pieces of timing information, or a plurality of predicted depth images corresponding to the corresponding multi-frame images are generated according to the plurality of pieces of timing information by combining with the convolutional neural network, for example, the plurality of pieces of timing information and the multi-frame images may be used as inputs of the convolutional neural network to obtain a plurality of predicted depth images output by the convolutional neural network, or a step of respectively generating a plurality of predicted depth images corresponding to the corresponding multi-frame images according to the plurality of pieces of timing information may be performed in any other possible manner, which is not limited thereto.

S105: and performing living body detection according to the plurality of spatial correlation information and the plurality of predicted depth images.

In which, living body detection can be used to distinguish living bodies from non-living bodies, and is a detection method for determining the real physiological characteristics of an object in some identity verification scenarios.

In some embodiments, the living body detection is performed according to the plurality of spatial correlation information and the plurality of predicted depth images, and the plurality of spatial correlation information and the plurality of predicted depth images may be input into a pre-trained living body detection model to obtain a living body detection result output by the pre-trained living body detection model, which is not limited herein.

In the embodiment of the present disclosure, as shown in fig. 2, fig. 2 is a schematic diagram of differences in spatial correlation information between live and non-live images according to the embodiment of the present disclosure, where the live and non-live images have differences in spatial correlation information, for example: the way of presenting the non-living body often exists in the same medium, the presenting medium of the non-living body presented in the printing way is paper, the medium presented in the video playback way is a screen, and the presenting medium of the three-dimensional head cover non-living body is a silica gel head cover, and these same presenting media have stronger spatial correlation with respect to the skin of the living body, and in addition, the non-living body presented in the two-dimensional form occupies a larger proportion in the non-living body, which has no additional depth information compared with the living body presented in the three-dimensional form.

It should be noted that the face image referred to in fig. 2 is not a face image acquired for a specific user, and cannot reflect personal information of a specific user, and the face images are acquired after authorization of related users, and the acquisition process of the face images conforms to the regulations of related laws and regulations and does not violate the public order and customs.

Thus, in some embodiments, the live body detection is performed according to the plurality of spatial correlation information and the plurality of predicted depth images, the spatial correlation information of the live body image and the predicted depth image corresponding to the live body image are determined, the plurality of spatial correlation information and the plurality of predicted depth images corresponding to the plurality of frames of images are compared with the spatial correlation information of the live body image and the predicted depth image corresponding to the live body image, and the setting conditions between the spatial correlation information of the live body image and the predicted depth image corresponding to the live body image are combined with the plurality of spatial correlation information and the plurality of predicted depth images corresponding to the plurality of frames of images, respectively (the setting conditions may be configured adaptively according to an actual live body detection scene, without limitation), so as to obtain the corresponding live body detection result.

Alternatively, the living body detection may be performed according to the plurality of pieces of spatial correlation information and the plurality of predicted depth images in any other possible manner, which is not limited thereto.

In this embodiment, a plurality of spatial correlation information corresponding to a plurality of frame images are determined by acquiring the plurality of frame images of a video, a plurality of timing information corresponding to the plurality of frame images are determined, a plurality of predicted depth images corresponding to the corresponding plurality of frame images are generated according to the plurality of timing information, and in-vivo detection is performed according to the plurality of spatial correlation information and the plurality of predicted depth images. Therefore, the accuracy and the effect of the in-vivo detection can be effectively improved, and the generalization of the image processing method is effectively improved, so that the deployment and the application of the image processing method in the in-vivo detection scene are facilitated.

Fig. 3 is a schematic diagram according to a second embodiment of the present disclosure.

As shown in fig. 3, the image processing method includes:

s301: acquiring multi-frame images of a video.

For the description of S301, reference may be made to the above embodiments, which are not described herein again.

S302: a plurality of correlation features respectively corresponding to the plurality of frame images are determined.

The features for describing a plurality of correlations respectively corresponding to a plurality of frames of images may be referred to as correlation features, and the correlations corresponding to the images may specifically be, for example, similarity of the images, vector cosine of the images, and the like, which is not limited thereto.

In some embodiments, the determining of the multiple correlation features respectively corresponding to the multiple frames of images may be, without limitation, performing analysis processing on the multiple frames of images to determine image representation vectors respectively corresponding to the multiple frames of images, and then determining vector cosines of the image representation vectors respectively corresponding to the multiple frames of images, and taking the vector cosines as the multiple correlation features respectively corresponding to the multiple frames of images, or may also determine similarity between the multiple frames of images, and take the similarity as the multiple correlation features respectively corresponding to the multiple frames of images, or may also determine the multiple correlation features respectively corresponding to the multiple frames of images in any other possible manner.

Alternatively, in some embodiments, the determining of the plurality of correlation features respectively corresponding to the plurality of frame images may be performing segmentation processing on the image to obtain a plurality of image segments, and determining a plurality of pieces of encoded information respectively corresponding to the plurality of image segments, and determining a plurality of degrees of correlation respectively corresponding to the plurality of pieces of encoded information, and taking the plurality of degrees of correlation as the plurality of correlation features of the corresponding plurality of frame images, where the degrees of correlation describe correlation between the corresponding encoded information and other encoded information, and since the image is segmented and the plurality of pieces of encoded information respectively corresponding to the plurality of image segments are determined, contextual semantic information of the image may be fully characterized, and further, by performing encoding processing on the plurality of image segments, image segment serialization may be implemented, so that when determining the correlation features based on the encoded information, influence of image noise on the correlation features may be avoided, the determination effect and the determination efficiency of the correlation characteristics are effectively improved.

In the embodiment of the present disclosure, a multi-frame image may be segmented to obtain a plurality of segmented images, which may be referred to as image segments.

For example, assuming that an image of a frame of image has a pixel value of 224 × 224, the segmentation process may be a segmentation process performed on the image according to the pixel value, for example, the image may be segmented into 196 image segments of size 16 × 16, which is not limited herein.

In the embodiment of the present disclosure, after obtaining the plurality of image segments, the plurality of image segments may be respectively subjected to encoding processing, and the image segment information after the plurality of encoding processing may be referred to as encoding information.

Among the plurality of pieces of encoded information, the encoded information whose correlation is to be determined at present may be referred to as corresponding encoded information, and accordingly, among the plurality of pieces of encoded information, the encoded information other than the corresponding encoded information may be referred to as other encoded information.

In some embodiments, determining the plurality of pieces of encoding information corresponding to the plurality of image segments may be inputting the plurality of image segments into a mapping network, encoding the plurality of image segments through the mapping network, and outputting the plurality of pieces of encoding information corresponding to the plurality of image segments, whereby encoding the plurality of image segments into a sequence may be achieved.

For example, after the image with the pixel value of 224x224 of one frame of image is divided into 196 image segments with the size of 16x16, the 196 image segments may be input into the mapping network, and each segment may be mapped into a 384-dimensional vector as a plurality of pieces of encoded information corresponding to the plurality of image segments, which is not limited.

The multiple pieces of encoded information corresponding to the image segment may respectively correspond to multiple correlation degrees, where the correlation degrees may be used to describe a correlation between the corresponding encoded information and other encoded information, and the correlation conditions may specifically be, for example, similarities between the multiple pieces of encoded information, euclidean distances between the multiple pieces of encoded information, or any other possible correlation conditions, which is not limited herein.

In the embodiment of the present disclosure, determining a plurality of correlation degrees corresponding to a plurality of pieces of encoded information may be continuously calculating correlation degrees between corresponding encoded information of different image segments by combining a self-attention structure of a transform network (the self-attention structure is an important component of the transform network), and taking the plurality of correlation degrees as a plurality of correlation characteristics of corresponding multi-frame images.

S303: and respectively extracting the features of the multi-frame images to obtain a plurality of initial image features.

The multi-frame image may have a plurality of image features, and the unprocessed image feature of the plurality of image features may be referred to as an initial image feature.

For example, a Transformer network may be adopted to perform feature extraction on multiple frames of images respectively to obtain multiple initial image features, as shown in fig. 4, fig. 4 is a schematic flow chart of feature extraction based on the Transformer network according to an embodiment of the present disclosure, that is, multiple frames of images may be input into the Transformer network shown in fig. 4 respectively to obtain multiple initial image features output by the Transformer network.

It should be noted that the face image referred to in fig. 4 is not a face image acquired for a specific user, and cannot reflect personal information of a specific user, and the face images are acquired after being authorized by related users, and the acquisition process of the face images conforms to the regulations of related laws and regulations and does not violate the common and well-known rules and regulations.

In the embodiment of the present disclosure, the multi-frame image may include a plurality of image segments, and accordingly, the feature extraction is performed on the multi-frame image to obtain a plurality of initial image features, or the feature extraction may be performed on the plurality of image segments of the multi-frame image, for example, the plurality of image segments corresponding to the multi-frame image may be input into a transform network shown in fig. 4 to obtain a plurality of initial image features output by the transform network, which is not limited thereto.

S304: and respectively processing the corresponding plurality of initial image features according to the plurality of correlation features to obtain a plurality of spatial correlation information.

According to the method and the device, after the characteristics of the multi-frame images are respectively extracted to obtain the plurality of initial image characteristics, the corresponding plurality of initial image characteristics can be respectively processed according to the plurality of correlation characteristics to obtain the plurality of spatial correlation information, the initial image characteristics of the multi-frame images are determined by combining a Transformer network, and the plurality of correlation characteristics respectively corresponding to the multi-frame images are adopted to optimize the corresponding plurality of initial image characteristics, so that the comprehensiveness and reliability of the initial image characteristics can be effectively improved and obtained by identification, the spatial correlation information obtained by processing the initial image characteristics based on the correlation characteristics can be more accurate, the characterization effect is better, more complete video semantic information carried in the video can be characterized, and the accuracy of subsequent living body detection can be guaranteed.

For example, the plurality of initial image features are processed according to the plurality of correlation features, respectively, feature enhancement processing may be performed on the plurality of initial image features according to the plurality of correlation features, and the features obtained by the enhancement processing may be used as spatial correlation information, or a flexible maximum transfer function softmax may be used to perform fusion processing on the plurality of correlation features and the plurality of initial image features, and the features obtained by the fusion processing may be used as spatial correlation information.

Alternatively, the step of processing the corresponding plurality of initial image features according to the plurality of correlation features to obtain the plurality of spatial correlation information may be performed in any other possible manner, which is not limited herein.

Optionally, in some embodiments, the plurality of initial image features are processed according to the plurality of correlation features to obtain a plurality of pieces of spatial correlation information, or the plurality of correlation features are weighted and fused with the plurality of initial image features to obtain a plurality of pieces of spatial correlation information.

For example, the weighted fusion processing is performed on the plurality of correlation features respectively corresponding to the plurality of initial image features, which may be to perform addition processing on the plurality of correlation features and the plurality of initial image features respectively corresponding to the plurality of initial image features to obtain a plurality of spatial correlation information, or may also be performed on the plurality of correlation features respectively corresponding to the plurality of initial image features in any other possible manner, which is not limited thereto.

S305: length information corresponding to the video is determined.

The information describing the video length may be referred to as length information, and the length information may specifically be, for example, the number of frames of video frames included in the video, which is not limited herein.

That is to say, in the embodiment of the present disclosure, the number of frames of a video frame included in a video may be determined and used as length information corresponding to the video, or the number of frames of a video frame including a face portion in the video may also be determined and used as length information, which is not limited in this regard.

S306: a plurality of window sizes are generated based on the length information.

In the embodiment of the present disclosure, the present embodiment may be specifically explained with reference to fig. 5, and fig. 5 is a schematic structural diagram of an image processing model according to the embodiment of the present disclosure, as shown in fig. 5, after inputting the t- Δ t frame image fragment, the t-time frame image fragment, and the t + Δt frame image fragment into a spatial Transformer network to obtain global classification information and local classification information of an image, respectively, the Temporal correlation part in the local classification information may be enhanced with reference to a Temporal Difference Attention module (TDA), and then a plurality of Attention windows of a Multi-window Temporal Transformer part (Pyramid Temporal Aggregation, PTA) are used to capture Multi-range Temporal information, respectively, that is, based on an original window of the PTA module (Multi-head Temporal Aggregation, MSA), adding a plurality of windows (for example, w1-MSA, w2-MSA, wherein w is the window size of MSA), then determining the Average value (Avg) of the time sequence information captured by the plurality of windows, and performing feature addition processing on the Average value and the multi-range time sequence information, and then performing Layer Normalization (LN) processing and multi-Layer Perceptron (MLP) processing on the processing result, thereby effectively improving the extraction effect of the time sequence information of the PTA module.

It should be noted that the face image referred to in fig. 5 is not a face image acquired for a specific user, and cannot reflect personal information of a specific user, and the face images are acquired after being authorized by related users, and the acquisition process of the face images conforms to the regulations of related laws and regulations and does not violate the common rules and regulations.

It should be noted that, the image processing model according to the embodiment of the present disclosure may specifically be, for example, a Temporal Transformer Network (TTN) model, or the image processing model may also be configured as any other possible artificial intelligence model for performing an image processing task, which is not limited to this.

In the embodiment of the present disclosure, the plurality of windows of the PTA module may respectively correspond to different sizes, which may be referred to as window sizes, where the window sizes are size parameters for performing feature division on the input features by the corresponding attention extraction models.

That is, in the embodiment of the present disclosure, the windows of the PTA module may have corresponding attention extraction models, and the attention extraction models may be used to extract a plurality of attention information corresponding to the window sizes.

In the embodiment of the present disclosure, videos having different length information may have different window sizes, respectively, and thus, a plurality of window sizes corresponding to the length information may be generated according to the length information of the video.

For example, assume that the length information of the video is N_fThe window sizes of multiple windows can be respectively determined according to the length information of the video, that is, the window size of the MSA can be determined to be N_fW1-MSA Window size is 2, w2-MSA Window size is

S307: and respectively inputting the global classification features into the plurality of attention extraction models to obtain a plurality of attention information which is respectively output by the plurality of attention extraction models and corresponds to the plurality of window sizes.

According to the embodiment of the disclosure, after the window sizes are generated according to the length information, the global classification characteristics can be respectively input into the attention extraction models to obtain the attention information corresponding to the window sizes and respectively output by the attention extraction models, and the time sequence information corresponding to the image segments is extracted according to the attention information, wherein the time sequence information describes the time sequence difference between the corresponding image segment and the adjacent frame image segment.

In the embodiment of the present disclosure, a plurality of PTA modules of a transform network are connected in series, and each window of the PTA modules may perform dynamic window shifting of the interlayer, as shown in fig. 6, where fig. 6 is according to the present disclosureThe dynamic window shift of the PTA module of an embodiment is illustrated by that the length of each window shift (shift length l) is half of the current window size, i.e. the window shift is a window shift of one window size

(w is the window size of the MSA) and thus, information exchange between adjacent windows of PTA modules can be facilitated.

S308: and extracting a plurality of time sequence information respectively corresponding to the image segments according to the plurality of attention information, wherein the time sequence information describes the time sequence difference situation between the corresponding image segment and the adjacent frame image segment.

Among the plurality of image segments, an image segment whose corresponding timing information is to be determined currently may be referred to as a corresponding image segment, and a corresponding image segment adjacent to the corresponding image segment may be referred to as an adjacent frame image segment.

The timing difference condition may be used to describe a timing difference between the corresponding image segment and the adjacent frame image segment, and the timing difference may specifically be, for example, a timing characteristic difference, which is not limited in this respect.

In the embodiment of the present disclosure, after obtaining a plurality of attention information corresponding to a plurality of window sizes and respectively output by a plurality of attention extraction models, a time sequence feature difference between a corresponding image segment and an adjacent frame image segment may be determined, and the time sequence feature difference may be used as a plurality of time sequence information corresponding to a plurality of image segments, which is not limited thereto.

S309: and enhancing the coding motion characteristics in the corresponding local classification characteristics according to the time sequence information to obtain target classification characteristics.

According to the embodiment of the disclosure, after the plurality of time sequence information respectively corresponding to the plurality of image segments are extracted according to the plurality of attention information, the encoding motion feature in the corresponding local classification feature can be enhanced according to the time sequence information to obtain an enhanced classification feature, and the enhanced classification feature can be called as a target classification feature.

In the embodiment of the present disclosure, a Temporal Difference Attention (TDA) may be combined to perform enhancement processing on coded motion features in corresponding local classification features according to Temporal information, as shown in fig. 7, fig. 7 is a schematic structural diagram of a TDA module according to an embodiment of the present disclosure, where an input to the Temporal Difference Attention module TDA may be Temporal information of adjacent frame image fragments, for example, Temporal information of a frame image fragment at a t- Δ t time, Temporal information of a frame image fragment at a t time, and Temporal information of a frame image fragment at a t + Δt time, where the Temporal information may be presented in a matrix form (a matrix size may be represented by L × C), and then, a reshape function is used to process the Temporal information corresponding to frames at different times respectively, so as to transform the Temporal information corresponding to frames at the matrix form into a set dimension matrix (for example, and K1 dimensions and K3 dimensions, wherein the K1 dimension represents a 1 × 1 dimension convolution, the K3 dimension represents a 3 × 3 × 3 dimension convolution, N can take the dimension 1 in the 1 × 1 dimension convolution or the dimension 3 in the 3 × 3 × 3 dimension convolution, then, the time sequence information of the frame image fragments at different moments obtained by transforming the dimensions can be respectively subjected to difference processing, and information linkage, average pooling operation and Softmax function processing operation are carried out on the time sequence information obtained by the difference processing, so that channel enhancement processing is carried out on the coding motion features in the corresponding local classification features according to the time sequence information of the frame image fragments at the t-delta t moment, the time sequence information of the frame image fragments at the t + delta t moment and the time sequence information of the frame image fragments at the t moment, and the target classification features are obtained.

S310: and predicting a plurality of predicted depth images respectively corresponding to the plurality of frame images according to the plurality of target classification features.

According to the embodiment of the disclosure, after the coding motion features in the corresponding local classification features are enhanced according to the time sequence information to obtain the target classification features, the multi-frame image can be subjected to prediction processing according to a plurality of target classification features, and the depth image obtained through the prediction processing is used as the predicted depth image.

For example, in order to be able to generate a more accurate depth map in depth map supervision. The basic idea of depth map supervision is that a depth map of a living body image is generated by utilizing a three-dimensional modeling algorithm in training, meanwhile, the pixel value of a depth map of another non-living body is 0, the depth map is used as a depth map (pseudo label) in training depth supervision, and a plurality of predicted depth maps corresponding to a plurality of frame images are predicted according to target classification characteristics, so that the depth information of the plurality of frame images can be more concerned in the network forwarding process, the purpose of distinguishing the living body from an attack by utilizing the depth information is achieved, and the generation effect of the predicted depth maps can be effectively improved.

Alternatively, in some embodiments, the plurality of predicted depth images corresponding to the plurality of frame images are predicted according to the plurality of target classification features, a plurality of initial depth images corresponding to the plurality of frame images are generated according to the plurality of target classification features, a plurality of depth change information corresponding to the plurality of initial depth images is determined, a loss value corresponding to the corresponding initial depth image is determined according to a depth change condition, and when the loss value satisfies a set condition, the initial depth image is used as the predicted depth image, since the depth change information of the plurality of frame images in the video is measured based on the loss function, the depth change information of the plurality of frame images can be accurately and comprehensively grasped, so that more accurate depth map prediction can be realized when the plurality of predicted depth images corresponding to the plurality of frame images are determined based on the depth change information of the plurality of frame images, the prediction effect of the depth map is effectively improved.

The unprocessed depth image corresponding to the multi-frame image may be referred to as an initial depth image, and correspondingly, the depth image of the adjacent frame image may be referred to as an adjacent depth image.

The depth change information describes a depth change situation between the corresponding initial depth image and the adjacent initial depth image, and the depth change situation may be, for example, a change situation of a depth value, which is not limited to this.

That is to say, in the embodiment of the present disclosure, based on the characteristics of the video task, the change condition of the depth map may be better learned by combining with the time sequence depth change loss function, for a non-live video, the change of the depth value does not occur, and for a live video, the depth value corresponding to the multi-frame image of the video also changes along with the movement of the region of the face of the live video. Such a change in depth value is a change in the relative value in time. Therefore, the step of determining the corresponding loss value of the corresponding initial depth image according to the depth change condition can be realized by combining the time sequence depth change loss function, and the expression of the loss function is as follows:

wherein N is_fIs the number of picture frames of the video,

is a pseudo label corresponding to the image of the t-th frame of the video,

is a pseudo label corresponding to the t-1 frame image of the video,

is a predicted depth map corresponding to the t frame image of the video,

is a predicted depth map, L, corresponding to the t-1 frame image of the video_TDLAs a function of time-series depth variation loss.

The above-mentioned loss function L in determining the time-series depth variation_TDLAfterwards, a binary loss function L can also be introduced_BinaryAnd depth map loss function L_Depth(L_DepthIs composed of

And

mean square error between) to obtain the final loss function:

L_overall＝α·L_Binary+(1-α)·L_Depth；

wherein alpha is presetA hyperparameter for adjusting the proportion of different loss functions in the total loss, L_overallAs a final loss function.

In the disclosed embodiment, the final loss function L is determined_overallThen, the loss function L can be combined_overallDetermining a loss value corresponding to the corresponding initial depth image, and taking the initial depth image as a predicted depth image when the loss value meets a set condition (the set condition can be configured in a self-adaptive manner according to the service requirement of an actual image processing scene, and is not limited to the set condition).

The disclosed embodiment determines the final loss function L_overallThen, the loss function L can be combined_overallThe image processing model (TTN model) is trained until the image processing model (TTN model) converges, and then a subsequent image processing method may be performed based on the image processing model (TTN model), which may be specifically referred to in the subsequent embodiments.

S311: and performing living body detection according to the plurality of spatial correlation information and the plurality of predicted depth images.

For the description of S311, reference may be made to the foregoing embodiments, which are not described herein again.

In the embodiment, by acquiring multi-frame images of a video, determining a plurality of correlation characteristics corresponding to the multi-frame images respectively, performing characteristic extraction on the multi-frame images respectively to obtain a plurality of initial image characteristics, processing the corresponding plurality of initial image characteristics according to the plurality of correlation characteristics respectively to obtain a plurality of pieces of spatial correlation information, the spatial correlation information obtained by processing the initial image characteristics based on the correlation characteristics can be more accurate, the representation effect is better, more complete video semantic information carried in the video can be represented, the accuracy of subsequent living body detection can be guaranteed, length information corresponding to the video is determined, a plurality of window sizes are generated according to the length information, global classification characteristics are input into a plurality of attention extraction models respectively, and a plurality of pieces of attention information corresponding to the plurality of window sizes and output by the plurality of attention extraction models respectively are obtained, therefore, the robustness of the Transformer network can be effectively improved, the accuracy and the efficiency of extracting time sequence information can be effectively improved, and the response efficiency of whole in-vivo detection is assisted to be improved. And then, according to the plurality of attention information, extracting a plurality of time sequence information respectively corresponding to the plurality of image segments, enhancing the coding motion characteristics in the corresponding local classification characteristics according to the time sequence information to obtain target classification characteristics, predicting a plurality of predicted depth images respectively corresponding to the multi-frame images according to the plurality of target classification characteristics, and then carrying out in vivo detection according to the plurality of spatial correlation information and the plurality of predicted depth images, thereby effectively improving the accuracy and the effect of in vivo detection.

Fig. 8 is a schematic diagram according to a third embodiment of the present disclosure.

As shown in fig. 8, the image processing method includes:

s801: acquiring multi-frame images of a video.

S802: a plurality of correlation features respectively corresponding to the plurality of frame images are determined.

For the description of S801 to S802, reference may be made to the above embodiments, which are not described herein again.

S803: a classification label is determined.

The label information used for classification may be referred to as a classification label (classification token), where the classification token is a network parameter in the Transformer network, and accordingly, determining the classification label may be to perform initialization processing on the Transformer network to determine a network parameter of the Transformer network and use the parameter as the classification token, where the classification label (classification token) may be specifically expressed in a vector form of 384 dimensions, or the classification label (classification token) may also be expressed in any other possible form, which is not limited thereto.

S804: a plurality of segment classification features corresponding to a plurality of image segments of the image are extracted according to the classification labels.

The feature information used for classification may be referred to as a classification feature, and accordingly, the plurality of image segments may have corresponding classification features, which may be referred to as segment classification features, and the segment classification features may specifically be, for example, feature representation vectors corresponding to the image segments, which is not limited to this.

In some embodiments, the segment classification features corresponding to the image segments of the image are extracted according to the classification labels, where the classification feature information related to the image segments is projected to a vector space according to the classification labels to obtain feature representation vectors, and the feature representation vectors are used as the segment classification features corresponding to the image segments of the multi-frame image, or any other possible manner may be adopted to extract the segment classification features corresponding to the image segments of the image according to the classification labels, such as a manner of a multi-layer perceptron, a manner of a recurrent neural network, and the like, which is not limited thereto.

Alternatively, in some embodiments, the extracting, according to the classification label, a plurality of segment classification features corresponding to a plurality of image segments of the image respectively may be extracting a plurality of local classification features corresponding to the classification label respectively for the plurality of image segments, and extracting a plurality of global classification features corresponding to the classification label respectively for the plurality of image segments, and using the plurality of local classification features and the plurality of global classification features as the plurality of segment classification features respectively, where the global classification feature is a local classification feature corresponding to another image segment related to the corresponding image segment, and the another image segment belongs to the plurality of image segments, and since the plurality of local classification features corresponding to the classification label respectively for the plurality of image segments and the plurality of global classification features corresponding to the classification label respectively for the plurality of image segments are used as the plurality of segment classification features together, the segment classification features can be based on a plurality of dimensions of features of the overall surface image, the segment classification features can correspondingly characterize the difference of local classification features among different image segments, and the feature characterization effect of the segment classification features is effectively improved.

The feature having local information, which is formed by self-attention superposition for describing the plurality of image segments after the encoding process, may be referred to as a local classification feature.

The global classification feature is a local classification feature corresponding to other image segments related to the corresponding image segment, that is, the global classification feature may be a feature that can be used for classification in a plurality of image segments of an image captured by a classification tag.

That is to say, in the embodiment of the present disclosure, the multiple image segments may be analyzed to extract multiple local classification features corresponding to the classification tags of the multiple image segments, and extract multiple global classification features corresponding to the classification tags of the multiple image segments, and use the multiple local classification features and the multiple global classification features as the multiple segment classification features, and then, a subsequent image processing method may be performed based on the multiple segment classification features, which may be specifically referred to in the subsequent embodiments.

S805: and taking the plurality of segment classification features as initial image features of the corresponding image together.

According to the method and the device, after the fragment classification features respectively corresponding to the image fragments of the image are extracted according to the classification labels, the fragment classification features can be jointly used as the initial image features of the corresponding image, and the fragment classification features respectively corresponding to the image fragments of the image are extracted according to the classification labels, so that information which can be used for classification can be captured from the image fragments more comprehensively, the characterization effect of the initial image features is effectively improved, and smooth execution of a subsequent image processing method can be effectively guaranteed.

S806: and respectively processing the corresponding plurality of initial image features according to the plurality of correlation features to obtain a plurality of spatial correlation information.

S807: a plurality of pieces of timing information corresponding to the plurality of frame images are determined.

S808: and respectively generating a plurality of predicted depth images corresponding to the corresponding multi-frame images according to the plurality of time sequence information.

For the description of S806-S808, reference may be made to the above embodiments, which are not described herein again.

S809: and performing living body detection according to the plurality of spatial correlation information and the plurality of predicted depth images.

Alternatively, in some embodiments, the live body detection is performed based on the plurality of spatial correlation information and the plurality of predicted depth images, and may be performed by determining a plurality of pieces of description information corresponding to predicted live body positions in the respective plurality of images, respectively, and determining a plurality of pieces of depth description information corresponding to the plurality of predicted depth images, respectively, based on the plurality of pieces of description information and the plurality of pieces of depth description information, determining a plurality of live body detection scores corresponding to the respective plurality of images, respectively, and performing the live body detection based on the plurality of live body detection scores, since the live body detection scores are determined based on the plurality of pieces of description information and the plurality of pieces of depth description information, the accuracy of the live body detection scores can be effectively improved, and in addition, the operability of the live body detection can be improved in conjunction with the live body detection scores, and in the live body detection process can be avoided, other subjective detection factors are introduced, and the in-vivo detection effect is effectively improved.

The plurality of description information may be used to describe the predicted living body position in the corresponding plurality of images, and the plurality of description information may be, for example, a logarithmic value of the predicted living body position, which is not limited thereto.

The information for describing the plurality of predicted depth images may be referred to as depth description information, and the depth description information may specifically be, for example, an average depth value of the predicted depth images, which is not limited thereto.

The living body detection score may be used to determine whether the detected image is a living body, for example, when the living body detection score approaches 0, the detected image may be determined as a non-living body image, and when the living body detection score approaches 1, the detected image may be determined as a living body image, which is not limited thereto.

In the embodiment of the present disclosure, the live body detection is performed according to the plurality of spatial correlation information and the plurality of predicted depth images, and may be performed by determining a plurality of live body detection scores respectively corresponding to the plurality of corresponding images in combination with a score calculation formula, and then determining a corresponding live body detection result according to the live body detection scores, where the calculation formula is:

wherein Score is the biopsy Score, N_fIs the number of image frames of the video, alpha is a preset hyper-parameter,

average depth value of predicted depth image for time t, b_lA plurality of pieces of descriptive information respectively corresponding to the positions of the living bodies.

In the embodiment of the present disclosure, when there is an abnormal frame in a multi-frame image of a detected video (e.g., all zero frames, noise frames, without limitation), the live detection score varies accordingly, as shown in fig. 9, fig. 9 is a schematic diagram of the live detection score of the abnormal frame image implemented according to the present disclosure, where there are all 0 frames in (a) and (b), there are noise frame images in (c) and (d), w/0PTA is the weight assigned by the PTA module to the abnormal frame image (e.g., all 0 frames and noise frames), w/PTA is the weight assigned by the PTA module to the images other than the abnormal frame, a histogram shows the weight assigned by a plurality of PTA modules of a transform network to each live frame image, a table shows that the multi-frame images respectively correspond to the live detection score, and then, the detection scores in the table can be combined, and performing living body detection.

It should be noted that the face image referred to in fig. 9 is not a face image acquired for a specific user, and cannot reflect personal information of a specific user, and the face images are acquired after being authorized by related users, and the acquisition process of the face images conforms to the regulations of related laws and regulations and does not violate the common rules and regulations.

In this embodiment, by acquiring a multi-frame image of a video, determining a plurality of correlation features corresponding to the multi-frame image, determining a classification tag, extracting a plurality of segment classification features corresponding to a plurality of image segments of the image according to the classification tag, using the plurality of segment classification features as initial image features of a corresponding image, processing the corresponding plurality of initial image features according to the plurality of correlation features to obtain a plurality of pieces of spatial correlation information, determining a plurality of pieces of timing information corresponding to the multi-frame image, generating a plurality of predicted depth images corresponding to the corresponding multi-frame image according to the plurality of pieces of timing information, performing live body detection according to the plurality of pieces of spatial correlation information and the plurality of predicted depth images, combining live body detection scores to perform live body detection, improving operability of live body detection, and avoiding in a live body detection process, other subjective detection factors are introduced, and the in-vivo detection effect is effectively improved.

Fig. 10 is a schematic diagram according to a fourth embodiment of the present disclosure.

As shown in fig. 10, the image processing apparatus 10 includes:

the acquiring module 101 is configured to acquire a multi-frame image of a video;

a first determining module 102, configured to determine a plurality of pieces of spatial correlation information respectively corresponding to a plurality of frames of images;

a second determining module 103, configured to determine a plurality of pieces of timing information corresponding to the plurality of frames of images;

a generating module 104, configured to generate, according to the plurality of pieces of timing information, a plurality of predicted depth images corresponding to the plurality of frames of images, respectively; and

a detection module 105 for performing a living body detection based on the plurality of spatial correlation information and the plurality of predicted depth images.

In some embodiments of the present disclosure, as shown in fig. 11, fig. 11 is a schematic diagram according to a fifth embodiment of the present disclosure, the image processing apparatus 11, including: the device comprises an acquisition module 111, a first determination module 112, a second determination module 113, a generation module 114 and a detection module 115, wherein the first determination module 112 comprises:

a determining submodule 1121 configured to determine a plurality of correlation features respectively corresponding to the plurality of frames of images;

the extracting submodule 1122 is configured to perform feature extraction on multiple frames of images respectively to obtain multiple initial image features; and

the first processing sub-module 1123 is configured to process the corresponding plurality of initial image features according to the plurality of correlation features, respectively, to obtain a plurality of pieces of spatial correlation information.

In some embodiments of the present disclosure, the determining sub-module 1121 is specifically configured to:

carrying out segmentation processing on the image to obtain a plurality of image segments;

determining a plurality of encoding information respectively corresponding to the plurality of image segments; and

and determining a plurality of correlation degrees corresponding to the plurality of coded information respectively, and taking the plurality of correlation degrees as a plurality of correlation characteristics of the corresponding multi-frame image, wherein the correlation degrees describe the correlation between the corresponding coded information and other coded information.

In some embodiments of the present disclosure, among others, the extraction sub-module 1122 includes:

a determining unit 11221 for determining a classification label;

an extracting unit 11222 configured to extract, according to the classification label, a plurality of segment classification features corresponding to a plurality of image segments of the image, respectively; and

a processing unit 11223, configured to use the plurality of segment classification features together as an initial image feature of the corresponding image.

In some embodiments of the present disclosure, among others, the extracting unit 11222 is specifically configured to:

extracting a plurality of local classification features of the image segments corresponding to the classification labels respectively;

extracting a plurality of global classification features corresponding to the classification labels of the plurality of image segments respectively, wherein the global classification features are local classification features corresponding to other image segments related to the corresponding image segments, and the other image segments belong to the plurality of image segments;

and respectively taking the plurality of local classification features and the plurality of global classification features as a plurality of segment classification features.

In some embodiments of the present disclosure, the first processing sub-module 1123 is specifically configured to:

and performing weighted fusion processing on the plurality of correlation characteristics and the corresponding plurality of initial image characteristics respectively to obtain a plurality of spatial correlation information.

In some embodiments of the present disclosure, the second determining module 113 is specifically configured to:

determining length information corresponding to the video;

generating a plurality of window sizes according to the length information, wherein the window sizes are different, and the window sizes are size parameters for performing feature division on the input features by the corresponding attention extraction model;

inputting the global classification features into a plurality of attention extraction models respectively to obtain a plurality of attention information which is output by the plurality of attention extraction models respectively and corresponds to the sizes of the plurality of windows;

and extracting a plurality of time sequence information respectively corresponding to the image segments according to the plurality of attention information, wherein the time sequence information describes the time sequence difference situation between the corresponding image segment and the adjacent frame image segment.

In some embodiments of the present disclosure, among others, the generating module 114 includes:

the second processing submodule 1141 is configured to perform enhancement processing on the coded motion features in the corresponding local classification features according to the time sequence information, so as to obtain target classification features;

the prediction sub-module 1142 is configured to predict, according to the plurality of target classification features, a plurality of predicted depth images respectively corresponding to the plurality of frames of images.

In some embodiments of the disclosure, the prediction sub-module 1142 is specifically configured to:

generating a plurality of initial depth images respectively corresponding to the multi-frame images according to the plurality of target classification features;

determining a plurality of depth change information corresponding to a plurality of initial depth images, wherein the depth change information describes a depth change situation between a corresponding initial depth image and an adjacent initial depth image;

determining a loss value corresponding to the corresponding initial depth image according to the depth change condition;

and if the loss value meets the set condition, taking the initial depth image as a predicted depth image.

In some embodiments of the present disclosure, the detection module is specifically configured to:

determining a plurality of pieces of description information respectively corresponding to the predicted living body positions in the respective plurality of images, based on the plurality of pieces of spatial correlation information;

determining a plurality of depth description information respectively corresponding to the plurality of predicted depth images;

determining a plurality of live body detection scores respectively corresponding to the respective plurality of images, based on the plurality of descriptive information and the plurality of depth descriptive information;

and performing the living body detection according to the plurality of living body detection scores.

It is understood that the image processing apparatus 11 in fig. 11 of the present embodiment and the image processing apparatus 10 in the foregoing embodiment, the obtaining module 111 and the obtaining module 101 in the foregoing embodiment, the first determining module 112 and the first determining module 102 in the foregoing embodiment, the second determining module 113 and the second determining module 103 in the foregoing embodiment, the generating module 114 and the generating module 104 in the foregoing embodiment, and the detecting module 115 and the detecting module 105 in the foregoing embodiment may have the same functions and structures.

It should be noted that the foregoing explanation of the image processing method is also applicable to the image processing apparatus of the present embodiment, and is not repeated herein.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 12 shows a schematic block diagram of an example electronic device for implementing the image processing method of an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 1200 includes a computing unit 1201 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1201 executes the respective methods and processes described above, such as the image processing method. For example, in some embodiments, the image processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the image processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image processing method comprising:

acquiring a multi-frame image of a video;

determining a plurality of pieces of spatial correlation information respectively corresponding to the plurality of frames of images;

determining a plurality of time sequence information corresponding to the multi-frame images;

respectively generating a plurality of predicted depth images corresponding to the plurality of frames of images according to the plurality of time sequence information; and

and performing living body detection according to the plurality of spatial correlation information and the plurality of predicted depth images.

2. The method of claim 1, wherein the determining a plurality of pieces of spatial correlation information respectively corresponding to the plurality of frames of images comprises:

determining a plurality of correlation features respectively corresponding to the plurality of frames of images;

respectively extracting the features of the multi-frame images to obtain a plurality of initial image features; and

and respectively processing the corresponding initial image characteristics according to the correlation characteristics to obtain the spatial correlation information.

3. The method of claim 2, wherein the determining a plurality of correlation features corresponding to the plurality of frames of images, respectively, comprises:

and determining a plurality of correlation degrees corresponding to the plurality of coded information respectively, and taking the plurality of correlation degrees as a plurality of correlation characteristics of the corresponding multi-frame image, wherein the correlation degrees describe the correlation condition between the corresponding coded information and other coded information.

4. The method according to claim 3, wherein the performing feature extraction on the plurality of frames of images respectively to obtain a plurality of initial image features comprises:

determining a classification label;

extracting a plurality of segment classification features respectively corresponding to a plurality of image segments of the image according to the classification labels; and

the plurality of segment classification features are used together as initial image features of the corresponding images.

5. The method of claim 4, wherein said extracting a plurality of segment classification features corresponding to a plurality of said image segments of said image, respectively, according to said classification label comprises:

extracting a plurality of local classification features of the plurality of image segments corresponding to the classification labels respectively;

extracting a plurality of global classification features corresponding to the classification labels respectively from the plurality of image segments, wherein the global classification features are local classification features corresponding to other image segments related to the corresponding image segments, and the other image segments belong to the plurality of image segments;

and respectively taking the plurality of local classification features and the plurality of global classification features as the plurality of segment classification features.

6. The method of claim 5, wherein said processing the respective plurality of initial image features according to the plurality of correlation features to obtain the plurality of spatially correlated information comprises:

and performing weighted fusion processing on the plurality of correlation characteristics and the plurality of corresponding initial image characteristics respectively to obtain the plurality of spatial correlation information.

7. The method of claim 5, wherein the determining a plurality of timing information corresponding to the plurality of frame images comprises:

determining length information corresponding to the video;

inputting the global classification features into a plurality of attention extraction models respectively to obtain a plurality of attention information which is output by the attention extraction models and corresponds to the window sizes;

and extracting a plurality of time sequence information respectively corresponding to the plurality of image segments according to the plurality of attention information, wherein the time sequence information describes the time sequence difference situation between the corresponding image segment and the adjacent frame image segment.

8. The method of claim 7, wherein said generating a plurality of predicted depth images corresponding to respective ones of the plurality of frame images, respectively, based on the plurality of timing information comprises:

enhancing the coding motion characteristics in the corresponding local classification characteristics according to the time sequence information to obtain target classification characteristics;

and predicting a plurality of predicted depth images respectively corresponding to the plurality of frames of images according to the plurality of target classification features.

9. The method of claim 8, wherein the predicting a plurality of predicted depth images corresponding to the plurality of frames of images, respectively, according to the plurality of target classification features comprises:

generating a plurality of initial depth images respectively corresponding to the multi-frame images according to the target classification features;

determining a plurality of depth change information corresponding to the plurality of initial depth images, wherein the depth change information describes a depth change condition between the corresponding initial depth image and the adjacent initial depth image;

determining a loss value corresponding to the initial depth image according to the depth change condition;

and if the loss value meets a set condition, taking the initial depth image as the predicted depth image.

10. The method of any of claims 1-9, wherein the performing a liveness check based on the plurality of spatially correlated information and the plurality of predicted depth images comprises:

determining a plurality of pieces of description information respectively corresponding to the predicted living body positions in the respective plurality of images, based on the plurality of pieces of spatially-related information;

determining a plurality of live body detection scores respectively corresponding to the plurality of images according to the plurality of description information and the plurality of depth description information;

performing the liveness detection according to the plurality of liveness detection scores.

11. An image processing apparatus comprising:

the acquisition module is used for acquiring multi-frame images of the video;

a first determining module, configured to determine a plurality of pieces of spatial correlation information respectively corresponding to the plurality of frames of images;

a second determining module, configured to determine a plurality of timing information corresponding to the plurality of frame images;

the generating module is used for respectively generating a plurality of predicted depth images corresponding to the plurality of frames of images according to the plurality of time sequence information; and

and the detection module is used for carrying out living body detection according to the plurality of pieces of spatial correlation information and the plurality of predicted depth images.

12. The apparatus of claim 11, wherein the first determining means comprises:

a determination submodule configured to determine a plurality of correlation features respectively corresponding to the plurality of frames of images;

the extraction submodule is used for respectively extracting the features of the multi-frame images to obtain a plurality of initial image features; and

and the first processing submodule is used for respectively processing the corresponding initial image characteristics according to the correlation characteristics to obtain the spatial correlation information.

13. The apparatus according to claim 12, wherein the determination submodule is specifically configured to:

14. The apparatus of claim 13, wherein the extraction submodule comprises:

a determination unit for determining a classification label;

an extracting unit configured to extract a plurality of segment classification features corresponding to the plurality of image segments of the image, respectively, according to the classification label; and

and the processing unit is used for taking the plurality of segment classification features as initial image features of the corresponding images together.

15. The apparatus according to claim 14, wherein the extraction unit is specifically configured to:

16. The apparatus of claim 15, wherein the first processing submodule is specifically configured to:

17. The apparatus of claim 15, wherein the second determining module is specifically configured to:

determining length information corresponding to the video;

18. The apparatus of claim 17, wherein the generating means comprises:

the second processing submodule is used for enhancing the coding motion characteristics in the corresponding local classification characteristics according to the time sequence information to obtain target classification characteristics;

and the prediction sub-module is used for predicting a plurality of predicted depth images respectively corresponding to the multi-frame images according to the target classification characteristics.

19. The apparatus of claim 18, wherein the prediction sub-module is specifically configured to:

20. The apparatus according to any one of claims 11-19, wherein the detection module is specifically configured to:

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1-10.