CN114550313A

CN114550313A - Image processing method, neural network, and training method, device, and medium thereof

Info

Publication number: CN114550313A
Application number: CN202210152340.0A
Authority: CN
Inventors: 谭资昌; 刘阿建; 郭国栋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-05-27
Anticipated expiration: 2042-02-18
Also published as: CN114550313B

Abstract

The disclosure provides an image processing method, a neural network, a training method, equipment and a medium thereof, and relates to the field of artificial intelligence, in particular to a computer vision technology, an image processing technology and a deep learning technology. The neural network includes a plurality of branch networks corresponding to a plurality of modalities, the branch networks including: an input sub-network configured to extract a first feature in an input image of a corresponding modality; the first interaction subnetwork is configured to: determining a first attention score for each of a plurality of modalities; adjusting a first attention score of a corresponding modality based on a first attention score of each of a plurality of modalities; processing the first feature of the corresponding modality based on the adjusted first attention score of the corresponding modality to obtain a second feature; the output sub-network is configured to derive a first result based on a second characteristic of the corresponding modality. The neural network further comprises: the synthetic output sub-network is configured to obtain a second result based on a plurality of second characteristics of the modality.

Description

Image processing method, neural network, and training method, device, and medium thereof

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a computer vision technique, an image processing technique, and a deep learning technique, and more particularly, to a neural network, a training method of the neural network, a method of image processing using the neural network, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

With the progress of the face recognition technology, the face recognition system is also widely used, but how to ensure that the face recognition system resists the attack of various false faces and ensure the reliability of the face recognition system is a great challenge. In this regard, human face anti-counterfeiting has received extensive attention from both academic and industrial fields. The human face anti-counterfeiting aims at judging whether an input human face image or video is a real living face or not, and the attack algorithm of a fake and synthesized human face should be rejected, so that a lawbreaker can be prevented from attacking a human face recognition system through the fake human face such as a photo, a video and a mask.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a neural network, a training method of the neural network, a method of image processing using the neural network, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a neural network including a plurality of branch networks corresponding to a plurality of modalities, wherein each of the plurality of branch networks includes: an input sub-network configured to extract a plurality of first features of a corresponding modality in an input image of the corresponding modality, wherein the plurality of first features of the corresponding modality correspond to a plurality of first features of any other of the plurality of modalities; a first interaction subnetwork configured to: for each of a plurality of modalities, determining a plurality of first attention scores for the modality, the plurality of first attention scores corresponding to a plurality of first features of the modality; adjusting a plurality of first attention scores of corresponding modalities based on a plurality of first attention scores of each of a plurality of modalities; processing a plurality of first features of the corresponding modality based on the adjusted plurality of first attention scores of the corresponding modality to obtain a plurality of second features of the corresponding modality; and an output subnetwork configured to obtain a first result based on a plurality of second features of the corresponding modality, wherein the neural network further comprises: a synthetic output sub-network configured to obtain a second result based on a plurality of second characteristics of each of the plurality of modalities.

According to another aspect of the present disclosure, there is provided a training method of a neural network, the neural network being the above neural network, the neural network including a plurality of branch networks corresponding to a plurality of modalities, the method including: acquiring a plurality of sample images and real labels, wherein the sample images are images of a sample object in a plurality of modalities; inputting the plurality of sample images into an input sub-network in a branch network of a corresponding modality of the plurality of branch networks, respectively; acquiring first prediction labels output by output sub-networks of a plurality of branch networks; acquiring a second prediction label output by a comprehensive output sub-network in the neural network; for each of a plurality of modalities, calculating a first loss value corresponding to the modality based on a first predictive tag and a real tag corresponding to the modality; calculating a second loss value based on the second predicted tag and the real tag; and adjusting a parameter of the neural network based on at least one of a plurality of first loss values and second loss values corresponding to the plurality of modalities.

According to another aspect of the present disclosure, there is provided a method of image processing using a neural network, the neural network being the above neural network or a neural network obtained using the above training method, and including a plurality of branch networks corresponding to a plurality of modalities, the method including: acquiring at least one image to be processed, wherein the at least one image to be processed is an image of a target object in at least one of a plurality of modalities; respectively inputting at least one image to be processed into an input sub-network in a branch network of a corresponding modality in the plurality of branch networks; in response to determining that the plurality of modalities includes other modalities other than the at least one modality, inputting a target pending image of the at least one pending image into an input sub-network of a branching network of the other modalities of the plurality of branching networks; and in response to determining that at least one modality is a modality, acquiring a first image processing result output by an output sub-network of a branch network corresponding to the modality in the plurality of branches.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the above method.

According to another aspect of the disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes the above method when executed by a processor.

According to one or more embodiments of the disclosure, a branch network is arranged for a plurality of modalities to extract features of each modality, attention scores of the features of each modality are calculated, and then the attention score of each modality is adjusted based on the attention scores, so that information interaction among different modalities is realized, and multi-modality information is fully utilized to improve the image processing capability of a neural network. In addition, the corresponding output sub-networks are arranged for each mode, so that the neural network trained by utilizing the multi-mode information can process single-mode images, the convenience of model deployment is greatly improved, and the neural network has better performance compared with the neural network trained by utilizing the single-mode information.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a block diagram of a neural network, according to an example embodiment of the present disclosure;

FIG. 3A shows a block diagram of a neural network, according to an exemplary embodiment of the present disclosure;

FIG. 3B shows a schematic diagram of deriving a plurality of second features using a first interaction subnetwork in accordance with an illustrative embodiment of the present disclosure;

FIG. 3C shows a schematic diagram of a plurality of third features being derived using a second interaction subnetwork in accordance with an illustrative embodiment of the present disclosure;

FIG. 4 shows a flow chart of a method of training a neural network according to an exemplary embodiment of the present disclosure;

FIG. 5 shows a flow diagram of a method of image processing using a neural network according to an example embodiment of the present disclosure; and

FIG. 6 sets forth a block diagram of exemplary electronic devices that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

In the related art, the existing multi-modal face anti-counterfeiting method performs authenticity prediction on each mode independently, then synthesizes prediction results under each mode, or combines characteristics of each mode to obtain multi-modal characteristics, and performs prediction by using the multi-modal characteristics. The former can not make the model learn the information of each modality when training, and the latter has limitation in the model deployment stage.

In order to solve the above problems, the present disclosure sets a branch network for a plurality of modalities to extract features of each modality, calculates an attention score of the features of each modality, and adjusts the attention score of each modality based on the attention scores, thereby implementing information interaction between different modalities and fully utilizing multi-modality information to improve image processing capability of a neural network. In addition, the corresponding output sub-networks are arranged for each mode, so that the neural network trained by utilizing the multi-mode information can process single-mode images, the convenience of model deployment is greatly improved, and the neural network has better performance compared with the neural network trained by utilizing the single-mode information.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In an embodiment of the present disclosure, the server 120 may run one or more services or software applications that enable a training method or an image processing method of a neural network to be performed.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, such as provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) network.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use

client devices

101, 102, 103, 104, 105, and/or 106 for document image understanding. The client device may provide an interface that enables a user of the client device to interact with the client device, e.g., the user may capture multimodal images using various input devices of the client, as well as perform methods of image processing using the client. The client device may also output information to the user via the interface, e.g., the client may output the results of the image processing to the user. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptops), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 can include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The data store 130 may reside in various locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In certain embodiments, the data store used by the server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

According to an aspect of the present disclosure, a neural network is provided. As shown in fig. 2, the neural network 200 includes a plurality of

branch networks

202, 204 corresponding to a plurality of modalities, wherein each branch network of the plurality of branch networks includes: an

input sub-network

210, 212 configured to extract a plurality of first features of a corresponding modality in the

input image

206, 208 of the corresponding modality, wherein the plurality of first features of the corresponding modality correspond to a plurality of first features of any other of the plurality of modalities; a first interaction subnetwork 214, 216 configured to: for each of a plurality of modalities, determining a plurality of first attention scores for the modality, the plurality of first attention scores corresponding to a plurality of first features; adjusting a plurality of first attention scores of corresponding modalities based on a plurality of first attention scores of each of a plurality of modalities; processing a plurality of first features of the corresponding modality based on the adjusted plurality of first attention scores of the corresponding modality to obtain a plurality of second features of the corresponding modality; and

output sub-networks

218, 220 configured to derive

first results

222, 224 based on a plurality of second characteristics of the corresponding modalities. The neural network 220 further includes: a synthetic output sub-network 226 configured to derive a second result 228 based on a plurality of second characteristics of each of the plurality of modalities. Only two

branch networks

202 and 204 are shown in fig. 2, but are not intended to limit the scope of the present disclosure. It is understood that the neural network 200 may further include more branch networks, which are not limited herein.

Therefore, the branch network is arranged for the plurality of modes to extract the characteristics of each mode, the attention scores of the characteristics of each mode are calculated, the attention score of each mode is adjusted based on the attention scores, information interaction among different modes is achieved, and multi-mode information is fully utilized to improve the image processing capacity of the neural network. In addition, the corresponding output sub-networks are arranged for each mode, so that the neural network trained by utilizing the multi-mode information can process single-mode images, the convenience of model deployment is greatly improved, and the neural network has better performance compared with the neural network trained by utilizing the single-mode information.

The neural network disclosed by the invention can be used for executing tasks such as face recognition, anti-counterfeiting and the like. The neural network is trained by using multi-modal real or non-real face training sample data, so that the neural network can complete the above task based on single-modal or multi-modal face image data, as will be described below.

According to some embodiments, the plurality of modalities may include at least one of a color image modality, a depth image modality, and an infrared image modality. The input image may be a face image. The accuracy of tasks such as face recognition, anti-counterfeiting and the like is improved by using the information of a plurality of modes. The present disclosure will take a color image modality and a depth image modality as examples to illustrate the processing of images and image features, and how to interact information between different modalities.

According to some embodiments, the input sub-network is configured to receive an input image of a corresponding modality and output image features, i.e., a plurality of first features. In some embodiments, the input image needs to meet certain requirements, for example, the image size should meet 224 x 224. Therefore, before the image is input to the neural network, a series of pre-processing can be performed on the image so that the input image meets the corresponding requirements.

According to some embodiments, the plurality of first features includes a plurality of first partial features. The input subnetwork is further configured to: determining a plurality of image regions in the input image of the corresponding modality, wherein the plurality of image regions in the input image of the corresponding modality correspond to the plurality of image regions of the image of any other modality of the plurality of modalities; and extracting image features from the plurality of regions respectively to obtain a plurality of first local features.

According to some embodiments, the plurality of first features includes a first global feature and a plurality of first local features.

In one embodiment, for an input color image I^i＝rAnd depth image I^i＝dImage block division may be performed first (e.g., 224 × 224 sized images are divided into 14 × 14 and 196 16 sized image blocks). Subsequently, a linear mapping unit can be used to map the image blocks into vectors x having dimensions of n × D_patI.e., n first local features (e.g., n-196). Subsequently, referring to the way of the self-attention mechanism based transform image processing model Vision transform ViT, one x can be added_cls∈R^1×D(i.e., the first global feature) is used to complete the image processing task (e.g., classification). Furthermore, the coding information x may be added to the first local feature and the first global feature_pos∈R^(n+1)×D. Finally, the three codes are combined together to be used as the output z of the input sub-network₀＝[x_cls||x_pat]+x_pos,z₀∈R^N×DN +1, i.e., a plurality of first features. Where | represents the sign of the feature concatenation.

In some embodiments, since the color image and the depth image employ the same image block division, a plurality of image areas in the color image may correspond to a plurality of image areas in the depth image. Accordingly, the plurality of first features of the color image may correspond to the plurality of first features of the depth image.

According to some embodiments, as shown in fig. 3A, in the neural network 300, at least one feature enhancing sub-network 330, 332 may be accessed after the input sub-network to enhance the image features of the respective modalities. The other input, output, and sub-networks in the neural network 300 are similar to the corresponding input, output, and sub-networks in the neural network 200, and no recourse is made here.

In one exemplary embodiment, a self-attention mechanism based transform block may be used as a feature enhancement sub-network. Compared with the convolutional neural network, the transform or image processing model ViT based on the self-attention mechanism has the following advantages: (1) ViT long term modeling of sequence dependencies. The dependence between the local attack trace and the real face region runs through the whole network, and a long-term indication effect is achieved. (2) ViT may capture multiple dependencies in parallel. Besides the local attack trace, the face attack sample contains various global attack traces, such as reverse shooting sensor noise, color distortion, moire and the like. ViT, multiple-headed self-attention mechanisms can capture multiple global attack traces in parallel from the global view. (3) A multi-modal sequence can be modeled. Because the domain differences among multimodal samples are significant, convolutional neural networks typically use independent branches to learn specific modal features at the shallow layer, and then fuse high-level semantic features. While ViT may encode different modality samples into the same semantic space, naturally modeling all modality sequences without considering domain differences between modality sequences. Thus, the present disclosure employs ViT learning live features with a number of advantages over using convolutional neural networks.

In some embodiments, the feature enhancement subnetwork may include a first Layer Normalization Layer, a multi-headed self-attention Layer, a second Layer Normalization Layer, and a multi-layered perceptron, connected in sequence. The feature-enhancing sub-network may also include a cross-connect from before the first layer normalization layer to after the multi-headed attention layer, and a cross-connect from before the second layer normalization layer to after the multi-layered perceptron, to further enhance the expressiveness of the features. The feature enhancing sub-network may enhance a plurality of first features of the corresponding modalities to obtain an enhanced first global feature and a plurality of enhanced first local features.

After the feature enhancement sub-network, a first interaction sub-network can be used for multimodal information interaction. The first interaction subnetwork aims at mining the most efficient image local features through information interaction between different modalities.

According to some embodiments, the first features of the respective modalities may be processed using a third normalization layer before being input into the first interaction subnetwork.

According to some embodiments, determining the plurality of first attention scores for the modality may comprise: a plurality of first attention scores for the modality is determined based on a product of the first global feature of the modality and each of the plurality of first local features of the modality, the plurality of first attention scores corresponding to the plurality of first local features. Thus, by using the global feature to multiply each local feature separately, an attention score indicating the degree of importance of each feature can be obtained.

According to some embodiments, the first interaction sub-network may be further configured to: for each of a plurality of modalities, mapping a first global feature of the modality to a first query feature using a first query parameter; and mapping the plurality of first local features of the modality into a plurality of first key features and first value features using the first key parameter and the first value parameter, respectively. Thus, by mapping the features into query features, key features, and value features, a more accurate attention score and a more efficient feature can be obtained.

In some embodiments, the first query parameter, the first key parameter, and the first value parameter, W_q，W_kAnd W_vA first query feature that may be used to extract a first global feature, a plurality of first key features of a plurality of first local features, and a plurality of first value features q of the plurality of first local features_cls,k_patAnd v_patNamely: [ q ] of_cls,k_pat,v_pat]＝[z_clsW_q,z_patW_k,z_patW_v]。

According to some embodiments, determining the plurality of first attention scores for the modality comprises: determining a plurality of first attention scores for the modality based on a product of the first query feature of the modality and each of a plurality of first key features of the modality,

in some embodiments, q is used_clsAnd k_patTo calculate first attention scores (or may be referred to as attention maps) maps of first features in the modality_clsNamely:

where D represents the feature dimension and h represents the number of times this attention is used in a multi-headed mutual attention mechanism. In some embodiments, each of the branch networks may include a plurality of first interactive sub-networks, i.e., a plurality of heads, as will be described below.

According to some embodiments, adjusting the first attention score of the first feature of the corresponding modality may comprise: comparing, for each of a plurality of first attention scores of a corresponding modality, a first attention score to which each of the plurality of modalities corresponds with the first attention score with a preset threshold; based on the comparison result, the first attention score of the corresponding modality is adjusted. Therefore, the first attention score of the corresponding modality is adjusted according to the comparison result of the first attention score of each modality and the preset threshold, so that the first attention score of the corresponding modality is associated with the first attention scores of other modalities, the branch network of the corresponding modality can focus on the region or the feature with the higher first attention score of other modalities, interaction of multi-modal information is achieved, and the image processing capability of the neural network is improved.

According to some embodiments, adjusting the first attention score of the corresponding modality, based on the comparison result, may comprise performing at least one of the following steps: in response to determining that at least one first attention score of first attention scores of a plurality of modalities, each corresponding to the first attention score, is greater than a preset threshold, increasing the first attention score; and in response to determining that none of the first attention scores of the plurality of modalities, each corresponding to the first attention score, is greater than a preset threshold, decreasing the first attention score.

In some embodiments, the first attention score greater than the preset threshold may be maintained and other first attention scores may be reduced, the first attention score greater than the preset threshold may be increased and other first attention scores may be maintained, and the first attention score greater than the preset threshold may be increased and other first attention scores may be reduced, which is not limited herein.

In one exemplary embodiment, mayBased on a threshold function Γ_λTo generate a mask matrix M, i.e.

Therein is directed to

(i.e., the plurality of first attention scores) each having a mask output of 1 when its value is greater than the threshold coefficient λ and 0 otherwise. For color image mode and depth image mode, their respective branches will generate a mask matrix M^rAnd M^d. The different modalities view the input image differently and each modality has its own advantages and therefore has some complementarity. Therefore, in order to let different modalities interact and to force learning of regions where the own modality is not interested but other modalities may be, masks are added, i.e. M-M^r+M^d. It is understood that the threshold coefficient λ may be set according to the requirement, and is not limited herein. Subsequently, the resulting mask (i.e., the comparison result) may be utilized to adjust the first attention score of the corresponding modality. For example, softmax function and selection function Γ 'may be employed'_MFor map_clsThe adjusted plurality of first attention scores may be softmax [ Γ'_M(map_cls)]Wherein r'_MComprises the following steps:

wherein A is_a,bRepresenting the value of the corresponding position in a.

According to some embodiments, processing the plurality of first local features of the corresponding modality based on the adjusted respective first attention scores of the plurality of first local features of the corresponding modality may include: and obtaining a plurality of second characteristics of the corresponding modality based on the product of the adjusted plurality of first attention scores of the corresponding modality and the corresponding first value characteristics in the plurality of first value characteristics of the corresponding modality.

In one exemplary embodiment, the plurality of second features may be:

MA(z)＝softmax[Γ′_M(map_cls)]·v_pat。

according to some embodiments, the first interaction subnetwork may map the first global feature and the first local features of each of the plurality of modalities using the same set of the first query parameter, the first key parameter, and the first value parameter. The corresponding first interaction subnetwork in the respective branched network of the plurality of modalities may use the same set of the first query parameter, the first key parameter, and the first value parameter. That is, the W used when each modality is processed inside the first interaction subnetwork_q，W_kAnd W_vAre all the same, and corresponding first interactive subnetworks in different branch networks use the same set W_q，W_kAnd W_v. The term "corresponding" here particularly refers to the correspondence of a plurality of first interactive sub-networks when the branch network comprises these first interactive sub-networks. For example, a first interaction subnetwork of a color image modality and a first interaction subnetwork of a depth image modality may use the same set W_q，W_kAnd W_v. Furthermore, the plurality of first interacting sub-networks within each of the branched networks may use different groups of Ws_q，W_kAnd W_vTo enhance the richness of the feature information extracted in the modality.

By the method, the parameter quantity of the model can be greatly reduced under the condition of not reducing the accuracy of the output result of the neural network, and the training speed of the model is improved.

In some embodiments, as shown in fig. 3B, a plurality of second features ma (z) may be obtained by processing a plurality of first features of each of a plurality of modalities through the above-described method.

According to some embodiments, each of the plurality of branch networks may comprise a first number of first interactive sub-networks, and the branch network further comprises: the first fusing sub-network is configured to fuse the plurality of second features respectively output by the first number of first interaction sub-networks to obtain a plurality of fused second features, wherein the plurality of fused second features comprise a second global feature and a plurality of second local features. Thus, by using multiple first interaction sub-networks (i.e., multiple heads of attention), the richness of feature information extracted at the modality may be enhanced.

In some embodiments, the plurality of second features respectively output by the first number of interactive subnetworks may be concatenated, and the concatenated features may be mapped to obtain a merged plurality of second features. It is understood that these multiple second object features may be fused in other ways, and are not limited herein. In one exemplary embodiment, the first number may be 12, i.e., include 12 heads of attention.

According to some embodiments, the output sub-network may be further configured to derive the first result based on a second global feature of the corresponding modality. The synthetic output sub-network may be further configured to obtain a second result based on a second global characteristic of each of the plurality of modalities. Thus, global features can be used for prediction.

In some embodiments, the output sub-network may for example comprise a plurality of fully connected layers and one classifier to predict the output classification result when the image to be processed comprises only input images of the corresponding modality.

In some embodiments, the comprehensive output sub-network may include, for example, a plurality of fully connected layers and a classifier to predict output classification results when the image to be processed includes input images of a plurality of modalities.

According to some embodiments, as shown in fig. 3A, each of the plurality of branch networks further comprises: a second interaction subnetwork 334, 336 configured to: determining a target modality different from a corresponding modality among a plurality of modalities; determining a plurality of second attention scores of the target modality based on the second global feature of the corresponding modality and the plurality of first local features of the target modality, the plurality of second attention scores corresponding to the plurality of first local features; and processing the plurality of first local features of the target modality based on the plurality of second attention scores to obtain a plurality of third features of the corresponding modality. The second interaction subnetwork can fuse information of multiple modalities to obtain more efficient attention features.

According to some embodiments, the output sub-network may be further configured to obtain a third result based on a plurality of third characteristics of the corresponding modality.

According to some embodiments, the target modality is randomly determined among the plurality of modalities. In some embodiments, multiple second interaction sub-networks (i.e., multiple heads of attention) may be included in the branched network to ensure that each of the other modalities may interact with the corresponding modality for information using the second interaction sub-networks.

According to some embodiments, the second interaction sub-network may be further configured to: mapping a second global feature of the corresponding modality into a second query feature by using a second query parameter; and mapping the plurality of first local features of the target modality into a plurality of second key features and second value features using the second key parameters and second value parameters, respectively.

In some embodiments, determining the plurality of second attention scores for the target modality comprises: a plurality of second attention scores is determined based on a product of the second query feature of the corresponding modality and each of a plurality of second key features of the target modality.

Similarly, a second query parameter, a second key parameter, and a second value parameter W_q，W_kAnd W_vCan be used to extract features

And

namely:

subsequently, a plurality of second attention scores can be obtained using a softmax function, namely:

in some embodiments, processing the plurality of first local features of the target modality based on the plurality of second attention scores comprises: and obtaining a plurality of third features of the corresponding modality based on the products of the plurality of second attention scores and the corresponding second value features in the plurality of second value features respectively.

In an exemplary embodiment, the product of the plurality of second attention scores and the plurality of second value features may be used directly as the plurality of third features, i.e. as

According to some embodiments, the corresponding second interaction subnetwork in the respective branched network of the plurality of modalities uses the same set of second query parameters, second key parameters, and second value parameters.

In some embodiments, as shown in fig. 3C, a plurality of third features fa (z) may be obtained by the above method using the second global feature of the current modality and the plurality of first local features of the target modality.

According to some embodiments, each of the plurality of branch networks comprises a second number of second interactive subnetworks, and the branch network further comprises: and the second fusing sub-network is configured to fuse the plurality of third features output by the second number of second interaction sub-networks respectively to obtain a plurality of fused third features, wherein the plurality of fused third features comprise a third global feature and a plurality of third local features. Thus, attention features output by multiple heads may be fused to further enhance the features.

In some embodiments, the second query parameter, the second key parameter, and the second value parameter are different from the first query parameter, the first key parameter, and the first value parameter, and the second number (i.e., the number h of the plurality of headers) may be the same as or different from the first number, which is not limited herein. It can be understood that the manner of fusing the plurality of third features output by the second number of second interactive sub-networks is similar to the manner of fusing the plurality of second features output by the first number of first interactive sub-networks, and details thereof are not described herein.

According to some embodiments, the output sub-network is further configured to obtain a third result based on a third global feature of the corresponding modality. The synthetic output sub-network is further configured to obtain a third result based on a third global characteristic of each of the plurality of modalities.

According to some embodiments, as shown in fig. 3A, each of the plurality of branch networks further comprises at least one of: a

crossover connection

338, 340 that fuses a plurality of first features of a corresponding modality into a plurality of second features of a corresponding modality; and a spanning

connection

342, 344 that fuses the plurality of second features of the corresponding modality into the plurality of third features of the corresponding modality. Thus, the second and third features may be further enriched by crossing connections.

In one exemplary embodiment, three groups in series may be provided between the input and output sub-networks of each of the branching networks, each group comprising, in turn, three feature extraction sub-networks in series, 12 first interacting sub-networks in parallel, a first blending sub-network, 12 second interacting sub-networks in parallel, and a second blending sub-network. Each first interaction sub-network acquires a plurality of first features output by the last feature extraction sub-network in the same group and a plurality of first features output by the last feature extraction sub-network in the group corresponding to other modalities, and each second interaction sub-network acquires a plurality of first features output by the last feature extraction sub-network in the group corresponding to the target modality and a plurality of second features output by the first fusion sub-network in the same group after fusion. By the mode, the characteristics of each mode can be fully learned by utilizing the characteristic extraction sub-network, information among the modes can be fully interacted and fused by utilizing the first interaction sub-network and the second interaction sub-network, and meanwhile, the training speed can be guaranteed. It is understood that the number and the sequence of the sub-networks may be adjusted to some extent according to the accuracy requirement, the performance requirement, and the like, and are not limited herein. In summary, by using the neural network, especially the first interactive subnetwork and the second interactive subnetwork, information of multiple modalities can be interacted and fused, so that the neural network has better image processing capability, and meanwhile, the neural network can be trained by using multi-modality information and predicted by using single-modality data or preset data of part of modalities in the multi-modalities, so that convenience in model deployment is improved.

In summary, by using the neural network, especially the first interactive subnetwork and the second interactive subnetwork, information of multiple modalities can be interacted and fused, so that the neural network has better image processing capability, and meanwhile, the neural network can be trained by using multi-modality information and predicted by using single-modality data or preset data of part of modalities in the multi-modalities, so that convenience in model deployment is improved.

According to another aspect of the present disclosure, a method of training a neural network is provided. The neural network is the neural network described above, including a plurality of branch networks corresponding to a plurality of modalities. As shown in fig. 4, the training method includes: step S401, obtaining a plurality of sample images and real labels, wherein the sample images are images of a sample object in a plurality of modes; step S402, inputting the plurality of sample images into an input sub-network in a branch network of a corresponding modality in the plurality of branch networks; step S403, acquiring first prediction labels output by output sub-networks of a plurality of branch networks; s404, acquiring a second prediction label output by a comprehensive output sub-network in the neural network; step S405, calculating a first loss value corresponding to each of a plurality of modalities based on a first prediction tag and a real tag corresponding to the modality; step S406, calculating a second loss value based on the second predicted tag and the real tag; and a step S407 of adjusting a parameter of the neural network based on at least one of a plurality of first loss values and second loss values corresponding to the plurality of modalities.

Therefore, the neural network is trained by using the multi-modal data, so that the trained neural network can improve the understanding of the multi-modal information based on the information interaction among the multiple modalities, and the accuracy of the output result is improved. Due to the structural particularity of the neural network, when the neural network is used for prediction, even data with partial modes can still be predicted by using the neural network, so that the convenience of deployment of the neural network is obviously improved.

According to another aspect of the present disclosure, a method of image processing using a neural network is provided. The neural network is the neural network described above or a neural network obtained by training using the training method described above, and includes a plurality of branch networks corresponding to a plurality of modalities. As shown in fig. 5, the method includes: step S501, at least one image to be processed is obtained, and the at least one image to be processed is an image of a target object in at least one of a plurality of modalities; step S502, respectively inputting at least one image to be processed into an input sub-network in a branch network of a corresponding modality in a plurality of branch networks; step S503, in response to determining that the plurality of modalities include other modalities besides the at least one modality, inputting a target to-be-processed image in the at least one to-be-processed image into an input sub-network in a branch network of the other modality in the plurality of branch networks; and step S504, in response to the fact that at least one mode is determined to be one mode, obtaining a first image processing result output by an output sub-network of a branch network corresponding to the mode in the plurality of branches. Therefore, the processing of the single-mode data by the neural network is realized. Because the neural network is trained by utilizing multi-modal data, the neural network can output more accurate results and has better performance than the neural network specially used for single-modal data.

According to some embodiments, the method may further comprise: in response to determining that the at least one modality is a plurality of modalities, second image processing results output by a synthetic output sub-network in the neural network are obtained. Therefore, when the data input into the neural network comprises a plurality of modes, the neural network can process the data, an accurate result can be obtained, and convenience of model deployment is greatly improved.

According to some embodiments, the target to-be-processed image may be, for example, randomly selected among the at least one to-be-processed image. It is to be understood that the target to-be-processed image may be selected from the at least one to-be-processed image in other manners, such as specifying a particular modality to be processed, which is not limited herein.

According to some embodiments, when the image of the input neural network is a complete image of multiple modalities, then the results output by the synthetic output sub-network may be used as the final result.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.

Referring to fig. 5, a block diagram of a structure of an electronic device 500, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506, an output unit 507, a storage unit 508, and a communication unit 509. The input unit 506 may be any type of device capable of inputting information to the device 500, and the input unit 506 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 507 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 508 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 802.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning network algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as the training method of the neural network and the method of image processing using the neural network. For example, in some embodiments, the training method of the neural network and the method of image processing using the neural network may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM503 and executed by the computing unit 501, one or more steps of the training method of the neural network and the method of image processing using the neural network described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured in any other suitable way (e.g., by means of firmware) to perform a training method of a neural network and a method of image processing with a neural network.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A neural network comprising a plurality of branch networks corresponding to a plurality of modalities, wherein each branch network of the plurality of branch networks comprises:

an input sub-network configured to extract a plurality of first features of a corresponding modality in an input image of the corresponding modality, wherein the plurality of first features of the corresponding modality correspond to a plurality of first features of any other modality of the plurality of modalities;

a first interaction subnetwork configured to:

for each of the plurality of modalities, determining a plurality of first attention scores for the modality, the plurality of first attention scores corresponding to a plurality of first features of the modality;

adjusting a plurality of first attention scores for the corresponding modality based on a plurality of first attention scores for each of the plurality of modalities; and

processing a plurality of first features of the corresponding modality based on the adjusted plurality of first attention scores of the corresponding modality to obtain a plurality of second features of the corresponding modality; and

an output sub-network configured to derive a first result based on a plurality of second characteristics of the corresponding modality,

wherein the neural network further comprises:

a synthetic output sub-network configured to derive a second result based on a plurality of second characteristics of each of the plurality of modalities.

2. The neural network of claim 1, wherein the adjusting the first attention score of the first feature of the corresponding modality comprises:

comparing, for each of a plurality of first attention scores of the corresponding modality, a first attention score to which each of the plurality of modalities corresponds with the first attention score with a preset threshold;

based on the comparison result, the first attention score of the corresponding modality is adjusted.

3. The neural network of claim 2, wherein the adjusting the first attention score of the corresponding modality based on the comparison results includes performing at least one of:

in response to determining that at least one of first attention scores of the plurality of modalities, each corresponding to the first attention score, is greater than a preset threshold, increasing the first attention score; and

in response to determining that none of the first attention scores of the plurality of modalities corresponding to the first attention score is greater than the preset threshold, decreasing the first attention score.

4. The neural network of claim 1, wherein the plurality of first features includes a first global feature and a plurality of first local features, wherein the determining the plurality of first attention scores for the modality includes:

determining a plurality of first attention scores for the modality based on a product of the first global feature of the modality and each of a plurality of first local features of the modality, the plurality of first attention scores corresponding to the plurality of first local features.

5. The neural network of claim 4, wherein the first interaction subnetwork is further configured to:

for each of a plurality of modalities, mapping a first global feature of the modality to a first query feature using a first query parameter; and

mapping the plurality of first local features of the modality into a plurality of first key features and first value features using the first key parameter and the first value parameter, respectively,

wherein the determining the plurality of first attention scores for the modality comprises:

determining a plurality of first attention scores for the modality based on a product of the first query feature of the modality and each of a plurality of first key features of the modality,

wherein the processing the plurality of first local features of the corresponding modality based on the adjusted respective first attention scores of the plurality of first local features of the corresponding modality comprises:

obtaining a plurality of second features of the corresponding modality based on the product of the adjusted plurality of first attention scores of the corresponding modality and the corresponding first value features of the plurality of first value features of the corresponding modality, respectively.

6. The neural network of claim 5, wherein the first interaction subnetwork maps the first global feature and the plurality of first local features for each of the plurality of modalities using a same set of the first query parameter, the first key parameter, and the first value parameter, and wherein a corresponding first interaction subnetwork in the respective branch network of the plurality of modalities uses the same set of the first query parameter, the first key parameter, and the first value parameter.

7. The neural network of any one of claims 4-6, wherein each of the plurality of branch networks includes a first number of first interactive subnetworks, and further comprising:

a first fusing sub-network configured to fuse the plurality of second features respectively output by the first number of first interacting sub-networks to obtain a fused plurality of second features, wherein the fused plurality of second features includes a second global feature and a plurality of second local features,

wherein the output sub-network is further configured to derive the first result based on a second global feature of the corresponding modality,

and wherein the synthetic output subnetwork is further configured to derive the second result based on a second global characteristic of each of the plurality of modalities.

8. The neural network of claim 7,

wherein each of the plurality of branch networks further comprises:

a second interaction subnetwork configured to:

determining a target modality different from the corresponding modality among the plurality of modalities;

determining a plurality of second attention scores for the target modality based on a second global feature of the corresponding modality and a plurality of first local features of the target modality, the plurality of second attention scores corresponding to the plurality of first local features; and

processing a plurality of first local features of the target modality based on the plurality of second attention scores to obtain a plurality of third features of the corresponding modality,

wherein the output sub-network is further configured to derive a third result based on a plurality of third features of the corresponding modality,

and wherein the synthetic output sub-network is further configured to derive a fourth result based on a plurality of third characteristics of each of the plurality of modalities.

9. The neural network of claim 8, wherein the second interaction subnetwork is further configured to:

mapping a second global feature of the corresponding modality into a second query feature by using a second query parameter; and

mapping the plurality of first local features of the target modality into a plurality of second key features and second value features using second key parameters and second value parameters, respectively,

wherein the determining a plurality of second attention scores for the target modality comprises:

determining the plurality of second attention scores based on a product of the second query feature of the corresponding modality and each of a plurality of second key features of the target modality,

wherein the processing a plurality of first local features of the target modality based on the plurality of second attention scores comprises:

obtaining a plurality of third features of the corresponding modality based on a product of the plurality of second attention scores and the corresponding second value features of the plurality of second value features respectively.

10. The neural network of claim 9, wherein a corresponding second interaction subnetwork in the respective branch network of the plurality of modalities uses the same set of second query parameters, second key parameters, and second value parameters.

11. The neural network of claim 8 or 9, wherein each of the plurality of branch networks includes a second number of second interactive subnetworks, and further comprising:

a second merging sub-network configured to merge a plurality of third features respectively output by the second number of second interacting sub-networks to obtain a merged plurality of third features, wherein the merged plurality of third features includes a third global feature and a plurality of third local features,

wherein the output sub-network is further configured to derive the third result based on a third global feature of the corresponding modality,

and wherein the synthetic output sub-network is further configured to derive the third result based on a third global feature of each of the plurality of modalities.

12. The neural network of claim 11, wherein each of the plurality of branch networks further comprises at least one of:

fusing a plurality of first features of the corresponding modality into a spanning connection of a plurality of second features of the corresponding modality; and

merging the plurality of second features of the corresponding modality into a cross-over connection of a plurality of third features of the corresponding modality.

13. The neural network of claim 8, wherein the target modality is randomly determined among the plurality of modalities.

14. The neural network of claim 1, wherein the plurality of first features includes a plurality of first partial features, wherein the input subnetwork is further configured to:

determining a plurality of image regions in the input image of the corresponding modality, wherein the plurality of image regions in the input image of the corresponding modality correspond to a plurality of image regions of an image of any other modality of the plurality of modalities; and

image features are extracted from the plurality of regions respectively to obtain the plurality of first local features.

15. The neural network of claim 1, wherein the plurality of modalities includes at least one of a color image modality, a depth image modality, and an infrared image modality, the input image being a face image.

16. A method of training a neural network, the neural network being according to any one of claims 1-15 and comprising a plurality of branch networks corresponding to a plurality of modalities, the method comprising:

acquiring a plurality of sample images and real labels, wherein the sample images are images of a sample object under the plurality of modalities;

inputting the plurality of sample images to input sub-networks in a branch network of a corresponding modality of the plurality of branch networks, respectively;

acquiring first prediction labels output by output sub-networks of the plurality of branch networks;

acquiring a second prediction label output by a comprehensive output sub-network in the neural network;

for each of the plurality of modalities, calculating a first loss value corresponding to the modality based on a first predictive tag corresponding to the modality and the real tag;

calculating a second loss value based on the second predictive tag and the authentic tag; and

adjusting a parameter of the neural network based on at least one of a plurality of first loss values and the second loss values corresponding to the plurality of modalities.

17. A method of image processing using a neural network, the neural network being one according to any one of claims 1-15 or derived using a training method according to claim 16 and comprising a plurality of branch networks corresponding to a plurality of modalities, the method comprising:

acquiring at least one image to be processed, wherein the at least one image to be processed is an image of a target object in at least one of the plurality of modalities;

inputting the at least one image to be processed into an input sub-network in a branch network of a corresponding modality of the plurality of branch networks, respectively;

in response to determining that the plurality of modalities includes other modalities other than the at least one modality, inputting a target pending image of the at least one pending image into an input sub-network in a branching network of the other modalities of the plurality of branching networks; and

in response to determining that the at least one modality is one modality, obtaining a first image processing result output by an output sub-network of a branch network corresponding to the modality in the plurality of branches.

18. The method of claim 17, further comprising:

in response to determining that the at least one modality is a plurality of modalities, obtaining a second image processing result output by a synthetic output sub-network in the neural network.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-15.

21. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-15 when executed by a processor.