CN116229585A

CN116229585A - Image living body detection method and device, storage medium and electronic equipment

Info

Publication number: CN116229585A
Application number: CN202211660755.5A
Authority: CN
Inventors: 武文琦
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-06-06

Abstract

The specification discloses an image living body detection method, an image living body detection device, a storage medium and electronic equipment, wherein the method comprises the following steps: determining object modal characteristics corresponding to various object images, performing attention operation on the object modal characteristics to obtain first object modal characteristics, performing feature fusion on the basis of the first object modal characteristics to obtain first object fusion characteristics, performing modal correlation fusion on the basis of basic modal characteristics and reference modal characteristics in the object modal characteristics to obtain second object fusion characteristics, and then performing screening fusion on the basis of the first object fusion characteristics and the second object fusion characteristics to obtain third object fusion characteristics so as to perform image living body detection processing on a target object on the basis of the third object fusion characteristics.

Description

Image living body detection method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image living body detection method, an image living body detection device, a storage medium, and an electronic device.

Background

With the continuous development of object recognition systems in recent years, the "image living body detection" is an indispensable ring in the object recognition system, and the "image living body detection" needs to verify whether an image to be detected is a real living body object operation when being collected, and needs to be capable of effectively resisting common attack means such as photos, face changing, masks, shielding, screen flipping and the like so as to effectively intercept attack images of non-living body types, including mobile phone attacks, paper attacks, head models and the like.

Disclosure of Invention

The specification provides an image living body detection method, an image living body detection device, a storage medium and electronic equipment, wherein the technical scheme is as follows:

in a first aspect, the present specification provides an image live detection method, the method comprising:

acquiring at least two types of object images aiming at a target object, and determining object modal characteristics corresponding to the object images of various types;

performing feature attention operation processing on each object modal feature to obtain a first object modal feature corresponding to each object modal feature, and performing feature fusion on the basis of each first object modal feature to obtain a first object fusion feature;

acquiring at least one basic modal feature and at least one reference modal feature in the object modal features, and performing modal correlation fusion processing based on the basic modal features and the reference modal features to obtain a second object fusion feature;

and screening and fusing the first object fusion characteristic and the second object fusion characteristic to obtain a third object fusion characteristic, and performing image living body detection processing on the target object based on the third object fusion characteristic.

In a second aspect, the present specification provides an image living body detection apparatus, the apparatus comprising:

The image acquisition module is used for acquiring at least two types of object images aiming at a target object and determining object modal characteristics corresponding to the object images of various types;

the operation processing module is used for carrying out characteristic attention operation processing on the object modal characteristics to obtain first object modal characteristics corresponding to the object modal characteristics, and carrying out characteristic fusion on the basis of the first object modal characteristics to obtain first object fusion characteristics;

the operation processing module is used for acquiring at least one basic modal feature and at least one reference modal feature in the modal features of each object, and performing modal correlation fusion processing based on the basic modal features and the reference modal features to obtain a second object fusion feature;

and the living body detection module is used for carrying out screening fusion processing on the basis of the first object fusion characteristics and the second object fusion characteristics to obtain third object fusion characteristics, and carrying out image living body detection processing on the target object on the basis of the third object fusion characteristics.

In a third aspect, the present description provides a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of one or more embodiments of the present description.

In a fourth aspect, the present description provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of one or more embodiments of the present description.

In a fifth aspect, the present description provides a computer program product storing at least one instruction adapted to be loaded by a processor and to perform the method steps of one or more embodiments of the present description.

The technical scheme provided by some embodiments of the present specification has the following beneficial effects:

in one or more embodiments of the present disclosure, an electronic device performs attention operation processing on object mode features by determining object mode features corresponding to object images of different image modes of a target object to obtain first object mode features, performs feature fusion on the basis of each first object mode feature to obtain first object fusion features, performs mode correlation fusion on basic mode features and reference mode features in the object mode features to obtain second object fusion features, and performs screening fusion on the basis of the first object fusion features and the second object fusion features to obtain third object fusion features with high quality and deep granularity.

Drawings

For a clearer description of the present description or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are only some embodiments of the present application, from which other drawings can be obtained, without inventive effort, for a person skilled in the art.

FIG. 1 is a schematic view of an image live detection system provided herein;

FIG. 2 is a schematic flow chart of an image live detection method provided in the present specification;

FIG. 3 is a flow chart of another image live detection method provided in the present specification;

FIG. 4 is a schematic diagram of a first object fusion feature generation related to the image live detection method provided in the present specification;

FIG. 5 is a schematic flow chart of a correlation fusion method of the image living body detection method provided in the present specification;

FIG. 6 is a schematic diagram of one type of cross-attention related process provided herein;

FIG. 7 is a schematic illustration of an image live detection provided herein;

fig. 8 is a schematic structural view of an image living body detection apparatus provided in the present specification;

FIG. 9 is a schematic diagram of an operation processing module provided in the present specification;

fig. 10 is a schematic structural view of an electronic device provided in the present specification;

FIG. 11 is a schematic diagram of the architecture of the operating system and user space provided herein;

FIG. 12 is an architecture diagram of the android operating system of FIG. 11;

FIG. 13 is an architecture diagram of the IOS operating system of FIG. 11.

Detailed Description

The following description of the embodiments will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In the description of the present application, it is to be understood that the terms "comprise" and "have," and any variations thereof, are intended to cover non-exclusive inclusions, unless otherwise specifically defined and defined. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art in a specific context. Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

In the related art, the "image living body detection" needs to intercept attack image data of a non-living body type including, for example, a mobile phone attack, a paper attack, a head model attack, and the like, effectively. In practice, the image living body detection is often faced with the phenomena of large environmental difference and the like, the actual application living body detection result is inaccurate in the image living body detection mode in the related technology, and the generalization effect of the image living body detection is poor, so that the image living body detection in the related technology has certain limitations, and one or more of the limitations can be improved or even solved by executing the image living body detection method in one or more embodiments of the specification.

The present application is described in detail with reference to specific examples.

Please refer to fig. 1, which is a schematic diagram of a scene of an image living body detection system provided in the present specification. As shown in fig. 1, the image biopsy system may include at least a client cluster and a service platform 100.

The client cluster may include at least one client, as shown in fig. 1, specifically including a client 1 corresponding to a user 1, a client 2 corresponding to a user 2, …, and a client n corresponding to a user n, where n is an integer greater than 0.

Each client in the client cluster may be a communication-enabled electronic device including, but not limited to: wearable devices, handheld devices, personal computers, tablet computers, vehicle-mounted devices, smart phones, computing devices, or other processing devices connected to a wireless modem, etc. Electronic devices in different networks may be called different names, for example: a user equipment, an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent or user equipment, a cellular telephone, a cordless telephone, a personal digital assistant (personal digital assistant, PDA), an electronic device in a 5G network or future evolution network, and the like.

The service platform 100 may be a separate server device, such as: rack-mounted, blade, tower-type or cabinet-type server equipment or hardware equipment with stronger computing capacity such as workstations, mainframe computers and the like is adopted; the server cluster may also be a server cluster formed by a plurality of servers, and each server in the server cluster may be formed in a symmetrical manner, wherein each server is functionally equivalent and functionally equivalent in a transaction link, and each server may independently provide services to the outside, and the independent provision of services may be understood as no assistance of another server is needed.

In one or more embodiments of the present disclosure, the service platform 100 may establish a communication connection with at least one client in the client cluster, and perform interaction of data in an image living detection process based on the communication connection, such as online transaction data interaction, such as data interaction of at least two types of object images, and illustratively, the client may collect at least two types of object images of a target object and send the images to the service platform 100, and the service platform 100 performs the image living detection method related to the present disclosure to obtain a living detection category and feed the living detection category back to the client; as another example, the service platform 100 may issue a relevant image biopsy model for image biopsy to a plurality of clients to instruct the clients to execute the image biopsy method related in the present specification to perform image biopsy, so as to obtain a biopsy class; as another example, the service platform 100 may obtain training sample data, such as a biopsy sample image, from a client for relevant image biopsy model training.

Further, the method may further include the steps of performing "determining object mode features corresponding to the object images of the various types" by controlling a relevant image living detection model, performing feature attention processing on each object mode feature to obtain first object mode features corresponding to each object mode feature, performing feature fusion on the basis of each first object mode feature to obtain first object fusion features, obtaining at least one basic mode feature and at least one reference mode feature in each object mode feature, performing mode relevance fusion processing on the basis of the basic mode features and the reference mode features to obtain second object fusion features, performing screening fusion processing on the basis of the first object fusion features and the second object fusion features to obtain third object fusion features, and performing image living detection processing on the target object on the basis of the third object fusion features ".

It should be noted that, the service platform 100 establishes a communication connection with at least one client in the client cluster through a network for interactive communication, where the network may be a wireless network, or may be a wired network, where the wireless network includes, but is not limited to, a cellular network, a wireless local area network, an infrared network, or a bluetooth network, and the wired network includes, but is not limited to, an ethernet network, a universal serial bus (universal serial bus, USB), or a controller area network. In one or more embodiments of the specification, techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible Markup Language, XML), and the like are used to represent data exchanged over a network (e.g., target compression packages). All or some of the links may also be encrypted using conventional encryption techniques such as secure socket layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet Protocol Security, IPsec), and the like. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

The embodiments of the image living body detection system provided in the present specification and the image living body detection method in one or more embodiments belong to the same concept, and an execution subject corresponding to the image living body detection method related in one or more embodiments in the present specification may be the service platform 100 described above; the execution subject corresponding to the image living body detection method related to one or more embodiments in the specification may also be a client, and specifically determined based on an actual application environment. The implementation process of the embodiment of the image living body detection system may be described in detail in the following method embodiments, which are not described herein.

Based on the schematic view of the scenario shown in fig. 1, a model processing method provided in one or more embodiments of the present disclosure is described in detail below.

Referring to fig. 2, a flow diagram of a model processing method, which may be implemented in dependence on a computer program and may be run on an image biopsy device based on von neumann system, is provided for one or more embodiments of the present description. The computer program may be integrated in the application or may run as a stand-alone tool class application. The model processing means may be an electronic device.

Specifically, the image living body detection method comprises the following steps:

s102: acquiring at least two types of object images aiming at a target object, and determining object modal characteristics corresponding to the object images of various types;

the object image can be understood as image data acquired for a target object (such as a user, an animal, etc.) in an image living body detection scene;

the image mode types of the object images in the at least two classes of object images are different;

further, the image living body detection scene may be a scene requiring image living body detection of an object under practical application, such as finance, insurance, security, and the like. For example, when a user performs user object verification, it is generally required to upload an object image on an application program or a website to verify whether the user itself operates;

the object image actual application can have different image mode types, and the object image can be a fitting of one or more of image mode types such as video mode type, color picture mode (rgb) type, small video mode type, animation mode type, depth image (depth) mode type, infrared image (ir) mode type and the like which bear object information.

In one or more embodiments of the present specification, the image modality type of the subject image for image live detection is at least two.

Schematically, with popularization and rapid development of electronic devices, the electronic devices can support acquisition of object images of a plurality of image models, and a multi-mode acquisition component of an execution body can respectively acquire a color mode object image, an infrared mode object image and a depth mode object image;

the object mode features are object mode features obtained after image feature extraction is carried out on object images of corresponding image modes, and the object mode features can comprise one or more fitting of mode features such as color features, shape features, depth features, texture features and spatial relationship features based on different image modes;

it can be appreciated that the extracted object modality features are different for object images of different image modalities;

in one or more embodiments of the present disclosure, a feature transformation network may be used to transform each object image into a corresponding object modal feature, where the object modal feature may be mapping the object image to a high-dimensional vector space to characterize the image feature in the form of a modal feature map, and the object modal feature may be characterized in vector engineering by a character F;

for example, for an infrared modality object image of the infrared image modality (IR) type, its corresponding infrared object modality feature F may be determined _IR ；

For example, for a depth-picture-modality (depth) type of depth-modality object image, its corresponding depth-object-modality feature F may be determined _Depth ；

Optionally, the feature conversion network may be implemented by using one or more of a VGG feature processing network, an AlexNet feature processing network, an overfeature feature processing network, and a Resnet feature processing network in the classical feature conversion network, for example, in some embodiments, the feature processing network is based on an open source Resnet feature processing network as a function to perform feature processing on the input image to obtain the object mode feature;

in one or more embodiments of the present disclosure, image living detection is performed by using at least two types of object images with different image modes, so that the imaging characteristics of the corresponding image modes in the different image modes can be fully utilized, and the separability characteristics brought by extracting and fusing a plurality of modes are brought into play through subsequent corresponding processing, so that indexes such as fine granularity, image quality and the like of the mode image characteristics are significantly improved, and classification and identification of the image by the image living detection are facilitated.

Optionally, determining object mode features corresponding to the object images of each type may be implemented based on a pre-trained image living detection model, and the image living detection method related to the present specification may be executed by the image living detection model by inputting at least two types of object images into the image living detection model after acquiring at least two types of object images for the target object.

S104: performing feature attention operation processing on each object modal feature to obtain a first object modal feature corresponding to each object modal feature, and performing feature fusion on the basis of each first object modal feature to obtain a first object fusion feature;

the feature attention operation processing is used for carrying out self-adaptive attention focusing on the object mode features of the corresponding image modes so as to obtain first object mode features which are more suitable for the corresponding image mode features, the first object mode features have richer mode feature characteristics than the original object mode features before processing, the features on the corresponding feature map channels are fully interacted, the fine granularity of the feature features of the corresponding modes is deeper, the feature quality is higher, and further, the object mode features of different object modes can obtain corresponding first object model features through feature attention operation processing;

illustratively, the feature attention process may be implemented based on a fit of one or more of a channel attention operation, a spatial attention operation, a backlog excitation attention operation, and the like;

alternatively, the channel attention operation processing may be performed on each of the object modal features to obtain a first object modal feature after the object modal feature is improved, the spatial attention operation processing may be performed on each of the object modal features to obtain a first object modal feature after the object modal feature is improved, and so on.

It can be understood that, assuming that the object mode features are i, the i first object mode features reflect the mode features of the same target object after the attention self-adaptive focusing, and after obtaining a plurality of first object mode features, feature fusion is performed on a plurality of first object mode features of different image mode dimensions to obtain a first object fusion feature;

the feature fusion processing mode may be implemented by a feature stitching operation, by a feature sampling operation, by a feature fitting operation, or the like.

In some embodiments, feature fusion of the plurality of first object mode features of different image mode dimensions may be implemented based on a feature fusion network, and feature fusion of the plurality of first object mode features may be implemented based on the feature fusion network, so as to obtain a first object fusion feature;

alternatively, the feature fusion network may be part of a trained image live detection model;

s106: acquiring at least one basic modal feature and at least one reference modal feature in the object modal features, and performing modal correlation fusion processing based on the basic modal features and the reference modal features to obtain a second object fusion feature;

The basic modal feature can be a basic image feature (or can be regarded as a main image feature) which is taken as a modal feature reference in the fusion process, and the reference modal feature is taken as an auxiliary for feature fusion in the subsequent fusion process by taking the basic modal feature as a reference;

the reference modality feature may be understood as an auxiliary reference image feature (or may be considered as an auxiliary image feature) determined based on the base modality feature;

in a possible implementation manner, the acquiring at least one basic modal feature and at least one reference modal feature in the object modal features may be: selecting at least one basic modal feature and at least one reference modal feature from the modal features of each object;

schematically, priorities of different image modes can be set, object mode characteristics of a target number are selected as basic mode characteristics based on the high-low sequence of the image mode priorities corresponding to the object mode characteristics, and characteristics except the basic mode characteristics are used as reference mode characteristics.

Illustratively, a target image mode with better prediction can be determined by combining equipment environment parameters (such as brightness parameters, environment position parameters and electromagnetic interference parameters), the object mode characteristics corresponding to the target image mode are taken as basic mode characteristics, and the characteristics except the basic mode characteristics are taken as reference mode characteristics.

The method can be that a basic mode mapping relation between a plurality of reference equipment environment parameters and corresponding target image modes is preset, and the target image modes corresponding to the equipment environment parameters are queried based on the basic mode mapping relation by acquiring the equipment environment parameters during image acquisition.

In a possible implementation manner, the acquiring at least one basic modal feature and at least one reference modal feature in the object modal features may be: and acquiring at least one preset basic modal characteristic and at least one reference modal characteristic from the object modal characteristics.

Schematically, that is, a preset mode is sampled, the object mode characteristics corresponding to the designated basic image mode are taken as basic mode characteristics, and the characteristics except the basic mode characteristics are taken as reference mode characteristics;

specifically, performing modal correlation fusion processing based on the basic modal characteristics and the reference modal characteristics to obtain a second object fusion characteristic corresponding to each object modal characteristic;

the mode correlation fusion processing is used for fusing attention area characteristics by taking basic mode characteristics as a reference, measuring the correlation degree of channel characteristics and basic mode characteristics in the reference mode characteristics and adopting attention operation based on the correlation degree, so as to obtain a second object fusion characteristic;

S108: and screening and fusing the first object fusion characteristic and the second object fusion characteristic to obtain a third object fusion characteristic, and performing image living body detection processing on the target object based on the third object fusion characteristic.

Illustratively, a feature importance screening network can be constructed and trained in advance based on a machine learning model, and importance feature screening and fusion are carried out on the first object fusion feature and the second object fusion feature through the feature importance screening network to obtain a final fusion feature, namely a third object fusion feature;

optionally, the importance feature screening may be feature enhancement and fusion performed by the feature importance screening network using at least one or more of convolution operation, pooling operation, normalization operation, and the like.

It can be understood that the third object fusion feature fuses the image characteristics of a plurality of image modes, and the feature quality of the third object fusion feature is higher, the feature granularity is better, so that the third object fusion feature for image living body detection classification has higher separability, thereby being convenient for distinguishing living body categories from attack categories, and realizing better living body attack detection effect.

Further, performing image living body detection processing through the third object fusion characteristic, and outputting a living body detection category aiming at the target object;

wherein the living body detection category includes one of a living body image category and an attack image category;

optionally, a living body detection classifier can be constructed and trained in advance based on a machine learning model, image living body detection processing is carried out by the living body detection classifier based on the third object fusion characteristic, and the living body detection category aiming at the target object is output;

Referring to fig. 3, fig. 3 is a flow chart illustrating another embodiment of an image live detection method according to one or more embodiments of the present disclosure. Specific:

s202: acquiring at least two types of object images aiming at a target object, and determining object modal characteristics corresponding to the object images of various types;

reference may be made specifically to the method steps of other embodiments of the present disclosure, and details are not repeated here.

S204: and carrying out channel attention operation processing on each object modal feature to obtain a first object modal feature corresponding to each object modal feature, carrying out feature stitching processing on each first object modal feature to obtain stitching processing features, and carrying out convolution aggregation processing on the stitching processing features to obtain a first object fusion feature.

The first object modal feature is a modal feature obtained after channel attention operation processing is performed on the basis of a plurality of object modal features.

It can be appreciated that a channel attention mechanism may be used to process the object by using each object mode feature as a processing object, so as to obtain a first object mode feature, and the importance of the channel of the feature map corresponding to each object mode feature is usually focused by using the channel attention mechanism, so as to perform feature fusion by distributing feature weights based on the importance.

In a possible implementation manner, the performing a channel attention operation on each object modal feature to obtain a first object modal feature corresponding to each object modal feature may include the following schemes:

the electronic equipment can determine the channel importance degree of each object modal feature in at least one image feature channel dimension from the image feature channel dimension through the extrusion excitation processing network, so as to perform attention operation processing on the object modal feature in the image feature channel dimension based on the channel importance degree, and obtain a first object modal feature corresponding to each object modal feature.

Illustratively, the channel attention operation process may be implemented based on a squeeze excitation processing network process, where the squeeze excitation processing network process is generally divided into two parts, a squeeze part and a compression part, where the squeeze is to compress global spatial information of a feature map corresponding to a modal feature of an object, then perform feature learning in a feature map channel dimension to form importance of each channel, that is, a channel importance degree, and finally assign different weights to each feature map channel through the excitation part to generate a modal feature of the first object.

Specifically, after determining a plurality of first object modal features, performing feature stitching processing on each first object modal feature by adopting feature stitching operation to obtain stitching processing features, and performing convolution aggregation processing on the stitching processing features to obtain first object fusion features;

schematically, as shown in fig. 4, fig. 4 is a schematic diagram related to the generation of the first object fusion feature, assuming that the dimension of the feature map corresponding to n object modal features (f 1, f2, and/or fn, respectively) may be expressed as h×w×c, where H is the feature map height, W is the feature map width, and C is the feature map channel number, the extruding portion may compress global space information of the feature map corresponding to the object modal feature, such as a pooling operation, compress the dimension from h×2w×3c to 1×1×c (often characterized by a weight tensor of 1×1×c), then predict the channel importance degree of each image feature channel dimension through the full connection layer in the exciting portion, and then perform the channel fusion operation by exciting to the image feature channel corresponding to the original feature map h×w×c, thereby obtaining the first object feature; as can be appreciated, forObtaining a plurality of first object modal features (F1, F2,..fn) from the plurality of object modal features through the foregoing steps; then, feature stitching is performed on the first object modal features (F1, F2 and Fn) by adopting feature stitching operation to obtain stitching features, convolution aggregation processing (such as conv processing in fig. 4) is performed on the stitching features to obtain first object fusion features (which can be expressed as F) _x1 )；

S206: acquiring at least one basic modal feature and at least one reference modal feature in the object modal features;

in one possible embodiment, at least one base modality feature and at least one reference modality feature may be selected from among the object modality features;

in one possible embodiment, the at least one preset base modality characteristic and the at least one reference modality characteristic may be obtained from each object modality characteristic.

S208: performing mode correlation fusion processing based on the basic mode characteristics and the reference mode characteristics to obtain second object fusion characteristics;

specifically, the mode correlation information of the basic mode characteristics and the reference mode characteristics can be determined from the mode correlation dimension through a cross attention processing network;

schematically, the mode correlation information is correlation information which takes the basic mode characteristic as a reference and measures the correlation degree of the channel characteristic and the basic mode characteristic in the reference mode characteristic.

Specifically, the basic modal feature and the reference modal feature may be subjected to feature fusion based on the modal correlation information to obtain a second object fusion feature, where the object modal feature includes at least one basic modal feature and at least one reference modal feature.

Schematically, the cross attention processing network may be a trained neural network module constructed in advance based on a machine learning model, and the cross attention processing network may perform cross attention operation with the most separable basic mode feature as a basic mode and perform feature fusion with other reference mode features respectively, so as to obtain mode correlation information by calculating the direct correlation degree between the basic mode feature and the reference mode feature through a cross attention mechanism (CA), where in some embodiments, the mode correlation information is characterized in the form of a relational mapping vector, and after normalization based on the mode correlation information, the basic mode feature and the reference mode feature may be subjected to feature fusion to obtain a second object fusion feature.

As shown in fig. 5, fig. 5 is a schematic flow chart of a correlation fusion, and the feature fusion of the basic modal feature and the reference modal feature based on the modal correlation information to obtain a second object fusion feature may be:

a2: determining mode correlation information of basic mode characteristics and reference mode characteristics from mode correlation dimensions through a cross attention processing network, and performing point multiplication processing on the basic mode characteristics and the reference mode characteristics respectively based on the mode correlation information to obtain at least one cross attention mode characteristic;

It can be appreciated that the modality relevance information is characterized in terms of an interaction relationship mapping vector;

as shown in fig. 6, fig. 6 is a schematic diagram of a cross-attention process, in fig. 6, the fundamental mode feature may be represented as Fa, the reference mode feature is typically plural, and the reference mode feature may be represented as F _b1 、F _b2 、...、F _bn N is a positive integer, and the processing object of the cross-attention processing network is a basic modal feature Fa and a reference modal feature F _b1 、F _b2 、...、F _bn Computing the fundamental modal feature Fa and the reference modal feature F, respectively, by a cross-attention mechanism (CA part as shown in fig. 6) of a cross-attention processing network _b1 、F _b2 、...、F _bn The modal correlation information between the two modes can be expressed as interaction relation mapping vectors, and then at least one cross attention modal feature Ft is formed by multiplying the normalized interaction relation mapping vectors with the original reference modal feature (namely, point multiplication operation) ₁ 、F _t2 、...、Ft _n ；

A4: adding the basic modal characteristics and the cross attention modal characteristics to obtain cross attention fusion characteristics;

illustratively, at least one cross-attention modality feature Ft is generated during the cross-attention processing network processing phase, as shown in fig. 6 ₁ 、F _t2 、...、Ft _n Adding and processing the superposition of the original basic mode characteristics Fa to obtain a plurality of fused cross attention fusion characteristics of the reference modes and the basic modes after cross fusion, wherein the fused cross attention fusion characteristics carry basic image mode characteristics and cross attention characteristics;

A6: and carrying out convolution processing on the cross attention fusion characteristic to obtain a second object fusion characteristic.

Illustratively, a plurality of the cross-attention fusion features are subjected to convolution processing (conv processing shown in fig. 6) through a convolution layer to obtain second object fusion features, and the resolution of the feature map of the second object fusion features is consistent with that of the feature map of the original object modal features.

Illustratively, taking the at least two types of object images as color object images, infrared object images and depth object images as examples, taking the determined basic mode characteristics as color mode characteristics, taking the reference mode characteristics as infrared mode characteristics and depth mode characteristics as examples for explanation, as follows,

the performing the dot multiplication processing on the basic modal feature and each reference modal feature based on the modal correlation information to obtain at least one cross attention modal feature, and performing the addition processing on the basic modal feature and each cross attention modal feature to obtain a cross attention fusion feature may include the following schemes:

specifically, performing dot multiplication processing on the color modal feature, the infrared modal feature and the depth modal feature based on the modal correlation information to obtain a cross attention color-infrared modal feature and a cross attention color-depth modal feature;

And specifically, adding the color modal characteristics, the cross-attention color-infrared modal characteristics and the cross-attention color-depth modal characteristics to obtain a cross-attention fusion characteristic.

S210: and screening and fusing the first object fusion characteristic and the second object fusion characteristic to obtain a third object fusion characteristic, and performing image living body detection processing on the target object based on the third object fusion characteristic.

Schematically, as shown in fig. 7, fig. 7 is a schematic diagram of an image living body detection according to the present specification, in fig. 7, a channel importance degree of each of the object modal features in at least one of the image feature channel dimensions is determined from the image feature channel dimensions by a squeezing excitation processing network, so as to perform attention operation processing on the object modal features in the image feature channel dimensions based on the channel importance degree, thereby obtaining a first object modal feature F corresponding to each of the object modal features _x1 The method comprises the steps of carrying out a first treatment on the surface of the Determining mode correlation information of basic mode characteristics and reference mode characteristics from mode correlation dimensions through a cross attention processing network, performing point multiplication processing on the basic mode characteristics and the reference mode characteristics respectively based on the mode correlation information to obtain at least one cross attention mode characteristic, and performing convolution processing on the cross attention mode characteristic to obtain a second object fusion characteristic; then, carrying out region screening and fusion on the basis of the first object fusion feature and the second object fusion feature by adopting a feature importance screening network to obtain a final third object fusion feature, then carrying out image living detection processing on the basis of the third object fusion feature by adopting a living detection classification network, and outputting a living detection category aiming at the target object, wherein the living detection category comprises one of an image living category and an image attack category;

The first object fusion feature and the second object fusion feature are typically characterized in the form of an object fusion feature map;

in a possible implementation manner, the first object fusion feature and the second object fusion feature may be subjected to region screening processing through a feature importance screening network to obtain a plurality of object region features, and the object region features are fused to obtain a third object fusion feature.

Illustratively, the region screening process may be a screening mechanism based on a feature block, where the screening mechanism based on the feature block may screen a plurality of object region features with good separability from the first object fusion feature and the second object fusion feature, and then fuse the object region features to obtain a third object fusion feature.

Illustratively, the screening mechanism based on the feature block may be constructed and trained by a neural network module based on a machine learning model.

It should be noted that the machine learning model according to one or more embodiments of the present disclosure includes, but is not limited to, fitting of one or more of a convolutional neural network (Convolutional Neural Network, CNN) model, a deep neural network (Deep Neural Network, DNN) model, a recurrent neural network (Recurrent Neural Networks, RNN), a model, an embedding (embedding) model, a gradient lifting decision tree (Gradient Boosting Decision Tree, GBDT) model, a logistic regression (Logistic Regression, LR) model, and the like.

In one or more embodiments of the present disclosure, an initial image living detection model and at least one network module included in the initial image living detection model, such as a squeeze excitation processing network, a cross-attention processing network, a feature importance screening network, a living detection classification network, may be pre-constructed based on a machine learning model, and a trained image living detection model may be obtained after the initial image living detection model meets a model end training condition by acquiring a large amount of sample object image data, which is at least two types of sample object images for a sample object, and then performing model training on the initial image living detection model using the sample object image data. The image live detection method in the model training stage may refer to the image live detection method in one or more embodiments of the present specification, and will not be described herein.

In one or more embodiments herein, the model ending training condition may include, for example, a loss function having a value less than or equal to a preset loss function threshold, a number of iterations reaching a preset number of times threshold, and so on. The specific model end training conditions may be determined based on actual conditions and are not specifically limited herein.

In one or more embodiments of the present specification, an image living detection method may be understood as a multi-modal living attack detection method based on cross-attention and hybrid feature fusion. The image living body detection respectively acquires the original data of image modes such as an rgb mode, an ir mode and a depth mode through a multi-mode acquisition form, and then performs feature extraction, so that on one hand, the original multi-mode features and fusion features can be effectively extracted based on the feature extraction of the extrusion excitation processing network and the feature splicing operation. On the other hand, the characteristic extraction based on cross attention and mixed characteristic fusion can be combined to obtain high-quality and fine-granularity deep third object fusion characteristics, the third object fusion characteristics have higher separability, and the image characteristics among different image modes are fully utilized to enable the image living body detection to have good characteristic characterization effect, the subsequent accurate living body detection classification can be assisted, the better living body attack detection effect is achieved, and the generalization capability of the image living body detection is improved.

The image living body detection apparatus provided in the present specification will be described in detail with reference to fig. 8. Note that, the image living body detection apparatus shown in fig. 8 is used to perform the method of the embodiment shown in fig. 1 to 7 of the present application, and for convenience of explanation, only the portion relevant to the present specification is shown, and specific technical details are not disclosed, please refer to the embodiment shown in fig. 1 to 7 of the present application.

Referring to fig. 8, a schematic structural diagram of the image living body detection apparatus of the present specification is shown. The image living body detection apparatus 1 may be implemented as all or a part of a user terminal by software, hardware, or a combination of both. According to some embodiments, the image living body detection apparatus 1 includes an image living body detection module 11, an operation processing module 12, and a living body detection module 13, specifically for:

the image acquisition module 11 is used for acquiring at least two types of object images aiming at a target object and determining object modal characteristics corresponding to the object images of various types;

the operation processing module 12 is configured to perform feature attention operation processing on each object modal feature to obtain a first object modal feature corresponding to each object modal feature, and perform feature fusion based on each first object modal feature to obtain a first object fusion feature;

the operation processing module 12 is configured to obtain at least one basic modal feature and at least one reference modal feature in the object modal features, and perform a modal correlation fusion process based on the basic modal feature and the reference modal feature to obtain a second object fusion feature;

a living body detection module 13, configured to perform screening fusion processing based on the first object fusion feature and the second object fusion feature to obtain a third object fusion feature, and perform image living body detection processing on the target object based on the third object fusion feature

Optionally, the operation processing module 12 is configured to:

performing channel attention operation processing on each object modal feature to obtain a first object modal feature corresponding to each object modal feature;

optionally, the operation processing module 12 is configured to:

determining the channel importance degree of each object modal feature in at least one image feature channel dimension from the image feature channel dimension through an extrusion excitation processing network, so as to perform attention operation processing on the object modal feature in the image feature channel dimension based on the channel importance degree, and obtaining a first object modal feature corresponding to each object modal feature.

Optionally, the operation processing module 12 is configured to: and performing feature stitching processing on the modal features of the first objects to obtain stitching processing features, and performing convolution aggregation processing on the stitching processing features to obtain first object fusion features.

Optionally, the operation processing module 12 is configured to: selecting at least one basic modal feature and at least one reference modal feature from the object modal features; or alternatively, the first and second heat exchangers may be,

and acquiring at least one preset basic modal characteristic and at least one reference modal characteristic from the object modal characteristics.

Alternatively, as shown in fig. 9, the operation processing module 12 includes:

a correlation processing unit 121 for determining modality correlation information of the basic modality features and the reference modality features, respectively, from the modality correlation dimension through a cross-attention processing network;

a feature fusion unit 122, configured to perform feature fusion on the basic modal feature and the reference modal feature based on the modal correlation information to obtain a second object fusion feature, where the object modal feature includes at least one basic modal feature and at least one reference modal feature

Optionally, the feature fusion unit 122 is configured to:

performing point multiplication processing on the basic modal characteristics and the reference modal characteristics respectively based on the modal correlation information to obtain at least one cross attention modal characteristic;

adding the basic modal characteristics and the cross attention modal characteristics to obtain cross attention fusion characteristics;

and carrying out convolution processing on the cross attention fusion characteristic to obtain a second object fusion characteristic.

Optionally, the at least two types of object images are color class object images, infrared class object images, and depth class object images, the basic mode feature is a color mode feature, the reference mode feature is an infrared mode feature and a depth mode feature, and the feature fusion unit 122 is configured to:

Performing dot multiplication processing on the color modal feature, the infrared modal feature and the depth modal feature based on the modal correlation information to obtain a cross-attention color-infrared modal feature and a cross-attention color-depth modal feature;

and adding the color modal feature, the cross-attention color-infrared modal feature and the cross-attention color-depth modal feature to obtain a cross-attention fusion feature.

Optionally, the living body detection module 13 is configured to:

performing region screening processing on the first object fusion feature and the second object fusion feature through a feature importance screening network to obtain a plurality of object region features;

and fusing the object region features to obtain a third object fusion feature.

Optionally, the device 1 is configured to:

and performing image living detection processing based on the third object fusion characteristic through a living detection classification network, and outputting a living detection category for the target object, wherein the living detection category comprises one of an image living category and an image attack category.

It should be noted that, when the image living body detection apparatus provided in the above embodiment performs the image living body detection method, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the image living body detection device and the image living body detection method provided in the above embodiments belong to the same concept, which embody the detailed implementation process and are not described herein.

The foregoing description is provided for the purpose of illustration only and does not represent the advantages or disadvantages of the embodiments.

The present disclosure further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are adapted to be loaded by a processor and executed by the processor, where the specific execution process may refer to the specific description of the embodiment shown in fig. 1 to 7, and details are not repeated herein.

The application further provides a computer program product, where at least one instruction is stored, where the at least one instruction is loaded by the processor and executed by the processor, and the specific execution process may refer to the specific description of the embodiment shown in fig. 1 to 7, and details are not repeated herein.

Referring to fig. 10, a block diagram of an electronic device according to an exemplary embodiment of the present application is shown. An electronic device in the present application may include one or more of the following components: processor 110, memory 120, input device 130, output device 140, and bus 150. The processor 110, the memory 120, the input device 130, and the output device 140 may be connected by a bus 150.

Processor 110 may include one or more processing cores. The processor 110 utilizes various interfaces and lines to connect various portions of the overall electronic device, perform various functions of the electronic device 100, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and invoking data stored in the memory 120. Alternatively, the processor 110 may be implemented in at least one hardware form of digital signal processing (digital signal processing, DSP), field-programmable gate array (field-programmable gate array, FPGA), programmable logic array (programmable logic Array, PLA). The processor 110 may integrate one or a combination of several of a central processor (central processing unit, CPU), an image processor (graphics processing unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 110 and may be implemented solely by a single communication chip.

The memory 120 may include a random access memory (random Access Memory, RAM) or a read-only memory (ROM). Optionally, the memory 120 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, which may be an Android (Android) system, including an Android system-based deep development system, an IOS system developed by apple corporation, including an IOS system-based deep development system, or other systems, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the electronic device in use, such as phonebooks, audiovisual data, chat log data, and the like.

Referring to FIG. 11, the memory 120 may be divided into an operating system space in which the operating system runs and a user space in which native and third party applications run. In order to ensure that different third party application programs can achieve better operation effects, the operating system allocates corresponding system resources for the different third party application programs. However, the requirements of different application scenarios in the same third party application program on system resources are different, for example, under the local resource loading scenario, the third party application program has higher requirement on the disk reading speed; in the animation rendering scene, the third party application program has higher requirements on the GPU performance. The operating system and the third party application program are mutually independent, and the operating system often cannot timely sense the current application scene of the third party application program, so that the operating system cannot perform targeted system resource adaptation according to the specific application scene of the third party application program.

In order to enable the operating system to distinguish specific application scenes of the third-party application program, data communication between the third-party application program and the operating system needs to be communicated, so that the operating system can acquire current scene information of the third-party application program at any time, and targeted system resource adaptation is performed based on the current scene.

Taking an operating system as an Android system as an example, as shown in fig. 12, a program and data stored in the memory 120 may be stored in the memory 120 with a Linux kernel layer 320, a system runtime library layer 340, an application framework layer 360 and an application layer 380, where the Linux kernel layer 320, the system runtime library layer 340 and the application framework layer 360 belong to an operating system space, and the application layer 380 belongs to a user space. The Linux kernel layer 320 provides the underlying drivers for various hardware of the electronic device, such as display drivers, audio drivers, camera drivers, bluetooth drivers, wi-Fi drivers, power management, and the like. The system runtime layer 340 provides the main feature support for the Android system through some C/c++ libraries. For example, the SQLite library provides support for databases, the OpenGL/ES library provides support for 3D graphics, the Webkit library provides support for browser kernels, and the like. Also provided in the system runtime library layer 340 is a An Zhuoyun runtime library (Android run) which provides mainly some core libraries that can allow developers to write Android applications using the Java language. The application framework layer 360 provides various APIs that may be used in building applications, which developers can also build their own applications by using, for example, campaign management, window management, view management, notification management, content provider, package management, call management, resource management, location management. At least one application program is running in the application layer 380, and these application programs may be native application programs of the operating system, such as a contact program, a short message program, a clock program, a camera application, etc.; and may also be a third party application developed by a third party developer, such as a game-like application, instant messaging program, photo beautification program, etc.

Taking an operating system as an IOS system as an example, the program and data stored in the memory 120 are shown in fig. 13, the IOS system includes: core operating system layer 420 (Core OS layer), core service layer 440 (Core Services layer), media layer 460 (Media layer), and touchable layer 480 (Cocoa Touch Layer). The core operating system layer 420 includes an operating system kernel, drivers, and underlying program frameworks that provide more hardware-like functionality for use by the program frameworks at the core services layer 440. The core services layer 440 provides system services and/or program frameworks required by the application, such as a Foundation (Foundation) framework, an account framework, an advertisement framework, a data storage framework, a network connection framework, a geographic location framework, a sports framework, and the like. The media layer 460 provides an interface for applications related to audiovisual aspects, such as a graphics-image related interface, an audio technology related interface, a video technology related interface, an audio video transmission technology wireless play (AirPlay) interface, and so forth. The touchable layer 480 provides various commonly used interface-related frameworks for application development, with the touchable layer 480 being responsible for user touch interactions on the electronic device. Such as a local notification service, a remote push service, an advertisement framework, a game tool framework, a message User Interface (UI) framework, a User Interface UIKit framework, a map framework, and so forth.

Among the frameworks illustrated in fig. 13, frameworks related to most applications include, but are not limited to: the infrastructure in core services layer 440 and the UIKit framework in touchable layer 480. The infrastructure provides many basic object classes and data types, providing the most basic system services for all applications, independent of the UI. While the class provided by the UIKit framework is a basic UI class library for creating touch-based user interfaces, iOS applications can provide UIs based on the UIKit framework, so it provides the infrastructure for applications to build user interfaces, draw, process and user interaction events, respond to gestures, and so on.

The manner and principle of implementing data communication between the third party application program and the operating system in the IOS system may refer to the Android system, which is not described herein.

The input device 130 is configured to receive input instructions or data, and the input device 130 includes, but is not limited to, a keyboard, a mouse, a camera, a microphone, or a touch device. The output device 140 is used to output instructions or data, and the output device 140 includes, but is not limited to, a display device, a speaker, and the like. In one example, the input device 130 and the output device 140 may be combined, and the input device 130 and the output device 140 are a touch display screen for receiving a touch operation thereon or thereabout by a user using a finger, a touch pen, or any other suitable object, and displaying a user interface of each application program. Touch display screens are typically provided on the front panel of an electronic device. The touch display screen may be designed as a full screen, a curved screen, or a contoured screen. The touch display screen can also be designed to be a combination of a full screen and a curved screen, and a combination of a special-shaped screen and a curved screen is not limited in this specification.

In addition, those skilled in the art will appreciate that the configuration of the electronic device shown in the above-described figures does not constitute a limitation of the electronic device, and the electronic device may include more or less components than illustrated, or may combine certain components, or may have a different arrangement of components. For example, the electronic device further includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, a wireless fidelity (wireless fidelity, wiFi) module, a power supply, and a bluetooth module, which are not described herein.

In this specification, the execution subject of each step may be the electronic device described above. Optionally, the execution subject of each step is an operating system of the electronic device. The operating system may be an android system, an IOS system, or other operating systems, which is not limited in this specification.

The electronic device of the present specification may further have a display device mounted thereon, and the display device may be various devices capable of realizing a display function, for example: cathode ray tube displays (cathode ray tubedisplay, CR), light-emitting diode displays (light-emitting diode display, LED), electronic ink screens, liquid crystal displays (liquid crystal display, LCD), plasma display panels (plasma display panel, PDP), and the like. A user may utilize a display device on electronic device 101 to view displayed text, images, video, etc. The electronic device may be a smart phone, a tablet computer, a gaming device, an AR (Augmented Reality ) device, an automobile, a data storage device, an audio playing device, a video playing device, a notebook, a desktop computing device, a wearable device such as an electronic watch, electronic glasses, an electronic helmet, an electronic bracelet, an electronic necklace, an electronic article of clothing, etc.

In the electronic device shown in fig. 10, the processor 110 may be configured to call an application program stored in the memory 120, and specifically perform the following operations:

performing feature attention processing on each object modal feature to obtain a first object modal feature corresponding to each object modal feature, and performing feature fusion on the basis of each first object modal feature to obtain a first object fusion feature;

In one embodiment, the processor 110 performs the feature attention processing on each of the object mode features to obtain a first object mode feature corresponding to each of the object mode features, and further performs the following operations:

And carrying out channel attention operation processing on each object modal feature to obtain a first object modal feature corresponding to each object modal feature.

In one embodiment, the processor 110 performs the following steps when performing the channel attention operation on each of the object modal features to obtain a first object modal feature corresponding to each of the object modal features:

In one embodiment, the processor 110 performs the following steps when performing the feature fusion based on each of the first object modal features to obtain a first object fusion feature:

and performing feature stitching processing on the modal features of the first objects to obtain stitching processing features, and performing convolution aggregation processing on the stitching processing features to obtain first object fusion features.

In one embodiment, the processor 110, when executing the acquiring at least one base modality feature and at least one reference modality feature of each of the object modality features, performs the steps of:

selecting at least one basic modal feature and at least one reference modal feature from the object modal features; or alternatively, the first and second heat exchangers may be,

In one embodiment, the processor 110 performs the following steps when performing the mode relevance fusion process based on the basic mode feature and the reference mode feature to obtain a second object fusion feature:

determining modality correlation information of the basic modality features and the reference modality features respectively from the modality correlation dimension through a cross attention processing network;

feature fusion is carried out on the basic modal feature and the reference modal feature based on the modal correlation information to obtain a second object fusion feature, wherein the object modal feature comprises at least one basic modal feature and at least one reference modal feature

In one embodiment, the processor 110 performs the following steps when performing the feature fusion of the basic modality feature and the reference modality feature based on the modality correlation information to obtain a second object fusion feature:

In one embodiment, the at least two types of object images are color type object images, infrared type object images and depth type object images, the basic mode features are color mode features, the reference mode features are infrared mode features and depth mode features, the processor 110 performs the dot multiplication processing on the basic mode features and each of the reference mode features based on the mode correlation information to obtain at least one cross-attention mode feature, adds and processes the basic mode features and each of the cross-attention mode features to obtain a cross-attention fusion feature, and performs the following steps:

In one embodiment, the processor 110 performs the following steps when performing the filtering fusion process based on the first object fusion feature and the second object fusion feature to obtain a third object fusion feature:

and fusing the object region features to obtain a third object fusion feature.

In one embodiment, the processor 110 performs the following steps when performing the image live detection processing on the target object based on the third object fusion feature:

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, or the like.

The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims

1. An image live detection method, the method comprising:

2. The method according to claim 1, wherein the performing feature attention processing on each of the object modal features to obtain a first object modal feature corresponding to each of the object modal features includes:

3. The method of claim 2, wherein performing the channel attention operation on each of the object modal features to obtain a first object modal feature corresponding to each of the object modal features, comprises:

4. The method of claim 1, wherein the feature fusion based on each of the first object modality features to obtain a first object fusion feature comprises:

5. The method of claim 1, the acquiring at least one base modality feature and at least one reference modality feature of each of the object modality features, comprising:

acquiring at least one preset basic modal feature and at least one reference modal feature from each object modal feature; or alternatively, the first and second heat exchangers may be,

at least one base modality feature and at least one reference modality feature are selected from each of the object modality features.

6. The method of claim 1, wherein the performing the mode relevance fusion process based on the basic mode feature and the reference mode feature to obtain a second object fusion feature comprises:

and carrying out feature fusion on the basic modal feature and the reference modal feature based on the modal correlation information to obtain a second object fusion feature, wherein the object modal feature comprises at least one basic modal feature and at least one reference modal feature.

7. The method of claim 6, wherein the feature fusing the base modality feature and the reference modality feature based on the modality correlation information to obtain a second object fusion feature, comprises:

8. The method of claim 7, wherein the at least two classes of object images are color class object images, infrared class object images, depth class object images, the base modality features are color modality features, the reference modality features are infrared modality features and depth modality features,

the performing point multiplication processing on the basic modal feature and each reference modal feature based on the modal correlation information to obtain at least one cross attention modal feature, and performing addition processing on the basic modal feature and each cross attention modal feature to obtain a cross attention fusion feature, including:

9. The method of claim 1, wherein the filtering fusion process based on the first object fusion feature and the second object fusion feature results in a third object fusion feature, comprising:

and fusing the object region features to obtain a third object fusion feature.

10. The method of claim 1, the performing image live detection processing on the target object based on the third object fusion feature, comprising:

11. An image living body detection apparatus, the apparatus comprising:

12. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any one of claims 1 to 10.

13. A computer program product storing at least one instruction adapted to be loaded by a processor and to perform the method steps of any of claims 1 to 10.

14. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1-10.