CN116597454A

CN116597454A - Image processing method, training method and device of image processing model

Info

Publication number: CN116597454A
Application number: CN202310597526.1A
Authority: CN
Inventors: 钦夏孟; 李煜林; 谢群义; 姚锟; 韩钧宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-08-15
Anticipated expiration: 2043-05-24
Also published as: CN116597454B

Abstract

The disclosure provides an image processing method, a training device and training equipment for an image processing model, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as OCR, intelligent government affairs and the like. The image processing method comprises the following steps: performing text recognition on the target image to obtain a plurality of text areas and respective text contents; extracting a plurality of first visual features characterizing visual modality information of a plurality of text regions and a plurality of first text features of text modality information of text content included in each text region; fusing the plurality of first text features based on the plurality of first visual features to obtain a plurality of first text fusion features; fusing the plurality of first visual features based on the plurality of first text features to obtain a plurality of first visual fusion features; and strengthening the first visual fusion features and the first text fusion features based on the attention mechanism to obtain an image processing result.

Description

Image processing method, training method and device of image processing model

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as OCR, intelligent government affairs and the like, in particular to an image processing method, an image processing model training method, an image processing device, an image processing model training device, electronic equipment, a computer readable storage medium and a computer program product.

Background

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

The current multi-mode large model technology is popular, and multi-mode pre-training tasks are designed by using a large amount of data, so that the effect is rapidly improved.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides an image processing method, an image processing model training method, an image processing apparatus, an image processing model training apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided an image processing method including: performing text recognition on the target image to obtain a plurality of text areas and text contents included in each text area in the plurality of text areas; extracting a plurality of first visual features characterizing visual modality information for a plurality of text regions; extracting a plurality of first text features, wherein the plurality of first text features characterize text modal information of text contents included in the plurality of text regions; based on the plurality of first visual features, respectively fusing at least one group of adjacent first text features in the plurality of first text features to obtain a plurality of first text fusion features, wherein the number of the plurality of first text fusion features is smaller than that of the plurality of first text features; based on the plurality of first text features, respectively fusing at least one group of adjacent first visual features in the plurality of first visual features to obtain a plurality of first visual fusion features, wherein the number of the plurality of first visual fusion features is less than that of the plurality of first visual features; enhancing the plurality of first visual fusion features and the plurality of first text fusion features based on an attention mechanism to obtain a plurality of second visual features and a plurality of second text features; and obtaining an image processing result based on the plurality of second visual features and the plurality of second text features.

According to an aspect of the present disclosure, there is provided a training method of an image processing model, the method including: determining a real image processing result of the sample image and a plurality of sample text regions in the sample image and text contents included in each of the plurality of sample text regions; extracting a plurality of first sample visual features characterizing visual modality information for a plurality of sample text regions; extracting a plurality of first sample text features, wherein the plurality of first sample text features characterize text modal information of sample text contents included in the plurality of sample text regions; based on the plurality of first sample visual features, fusing at least one set of adjacent first sample features of the plurality of first sample features, respectively, using a neural network model to obtain a plurality of first sample fusion features, the number of the plurality of first sample fusion features being less than the number of the plurality of first sample features; based on the plurality of first sample visual features, respectively fusing at least one group of adjacent first sample visual features in the plurality of first sample visual features by using a neural network model to obtain a plurality of first sample visual fusion features, wherein the number of the plurality of first sample visual fusion features is smaller than that of the plurality of first sample visual features; utilizing a neural network model to strengthen the plurality of first sample visual fusion features and the plurality of first sample text fusion features based on an attention mechanism to obtain a plurality of second sample visual features and a plurality of second sample text features; obtaining a predicted image processing result based on the plurality of second sample visual features and the plurality of second sample text features; and adjusting parameters of the neural network model based on the real image processing result and the predicted image processing result to obtain an image processing model.

According to an aspect of the present disclosure, there is provided an image processing apparatus including: a text recognition unit configured to perform text recognition on the target image to obtain a plurality of text regions and text contents included in each of the plurality of text regions; a first extraction unit configured to extract a plurality of first visual features of visual modality information characterizing a plurality of text regions; a second extraction unit configured to extract a plurality of first text features, the plurality of first text features characterizing text modality information of text content included in the plurality of text regions; the first fusion unit is configured to fuse at least one group of adjacent first text features in the plurality of first text features respectively based on the plurality of first visual features to obtain a plurality of first text fusion features, wherein the number of the plurality of first text fusion features is smaller than that of the plurality of first text features; the second fusion unit is configured to fuse at least one group of adjacent first visual features in the first visual features respectively based on the first text features to obtain first visual fusion features, wherein the number of the first visual fusion features is smaller than that of the first visual features; a first enhancement unit configured to enhance the plurality of first visual fusion features and the plurality of first text fusion features based on an attention mechanism to obtain a plurality of second visual features and a plurality of second text features; and a first processing unit configured to obtain an image processing result based on the plurality of second visual features and the plurality of second text features.

According to an aspect of the present disclosure, there is provided a training apparatus of an image processing model, the apparatus including: a determination unit configured to determine a true image processing result of the sample image and a plurality of sample text regions in the sample image and text contents included in each of the plurality of sample text regions; a third extraction unit configured to extract a plurality of first sample visual features characterizing visual modality information of a plurality of sample text regions; a fourth extraction unit configured to extract a plurality of first sample text features, the plurality of first sample text features characterizing text modality information of sample text content included in the plurality of sample text regions; a third fusing unit configured to fuse at least one set of adjacent first sample features of the plurality of first sample features, respectively, using the neural network model based on the plurality of first sample visual features to obtain a plurality of first sample fusion features, the number of the plurality of first sample fusion features being less than the number of the plurality of first sample features; a fourth fusion unit configured to fuse at least one set of adjacent first sample visual features of the plurality of first sample visual features, respectively, using the neural network model based on the plurality of first sample visual features to obtain a plurality of first sample visual fusion features, the number of the plurality of first sample visual fusion features being less than the number of the plurality of first sample visual features; a second reinforcement unit configured to reinforce the plurality of first sample visual fusion features and the plurality of first sample text fusion features based on an attention mechanism using the neural network model to obtain a plurality of second sample visual features and a plurality of second sample text features; a second processing unit configured to obtain a predicted image processing result based on the plurality of second sample visual features and the plurality of second sample text features; and a parameter adjusting unit configured to adjust parameters of the neural network model based on the real image processing result and the predicted image processing result to obtain an image processing model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described method.

According to one or more embodiments of the present disclosure, by using the features of the visual modality as guidance to reduce the number of features of the text modality, and using the features of the text modality as guidance to reduce the number of features of the visual modality, the fused text features and visual features are further enhanced by using an attention mechanism, so that sufficient interaction between the visual modality and the text modality is achieved. In addition, through the mode, the number of the features which are required to be processed by the attention mechanism is reduced, the time consumption of the attention mechanism is reduced, and the image processing efficiency is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flowchart of an image processing method according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a flowchart of extracting a plurality of first text features characterizing text modality information in accordance with an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a flowchart of fusing adjacent first text features based on first visual features to obtain first text fusion features according to an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a schematic diagram of a neural network model, according to an exemplary embodiment of the present disclosure;

FIG. 6 illustrates a flowchart of deriving image processing results based on a plurality of second visual features and a plurality of second text features, according to an exemplary embodiment of the present disclosure;

FIG. 7 illustrates a flowchart of deriving image processing results based on a plurality of second visual features and a plurality of second text features, according to an exemplary embodiment of the present disclosure;

FIG. 8 illustrates a schematic diagram of a neural network model, according to an exemplary embodiment of the present disclosure;

FIG. 9 illustrates a flowchart of a training method of an image processing model according to an exemplary embodiment of the present disclosure;

fig. 10 shows a block diagram of an image processing apparatus according to an exemplary embodiment of the present disclosure;

FIG. 11 shows a block diagram of a training apparatus of an image processing model according to an exemplary embodiment of the present disclosure; and

fig. 12 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

In the related art, in a multi-mode large model, the attention mechanism is time-consuming and the interaction effect between modes is poor.

In order to solve the problems, the method and the device reduce the number of the characteristics of the text modes by using the characteristics of the visual modes as guidance, and reduce the number of the characteristics of the visual modes by using the characteristics of the text modes as guidance, so that the fused text characteristics and the fused visual characteristics are reinforced by using an attention mechanism, and the full interaction between the visual modes and the text modes is realized. In addition, through the mode, the number of the features which are required to be processed by the attention mechanism is reduced, the time consumption of the attention mechanism is reduced, and the image processing efficiency is improved.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable execution of a training method that generates a character image method or an image processing model.

In some embodiments, server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, such as provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) network.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use client devices 101, 102, 103, 104, 105, and/or 106 for human-machine interaction. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

According to an aspect of the present disclosure, there is provided an image processing method. As shown in fig. 2, the image processing method includes: step S201, carrying out text recognition on a target image to obtain a plurality of text areas and text contents included in each text area in the plurality of text areas; step S202, extracting a plurality of first visual features of visual mode information representing a plurality of text areas; step S203, extracting a plurality of first text features, wherein the plurality of first text features represent text modal information of text contents included in the plurality of text regions; step S204, based on the plurality of first visual features, respectively fusing at least one group of adjacent first text features in the plurality of first text features to obtain a plurality of first text fusion features, wherein the number of the plurality of first text fusion features is smaller than that of the plurality of first text features; step S205, based on the plurality of first text features, respectively fusing at least one group of adjacent first visual features in the plurality of first visual features to obtain a plurality of first visual fusion features, wherein the number of the plurality of first visual fusion features is less than that of the plurality of first visual features; step S206, strengthening the first visual fusion features and the first text fusion features based on the attention mechanism to obtain second visual features and second text features; and step S207, obtaining an image processing result based on the plurality of second visual features and the plurality of second text features.

Therefore, the number of the characteristics of the text mode is reduced by using the characteristics of the visual mode as guidance, and the number of the characteristics of the visual mode is reduced by using the characteristics of the text mode as guidance, so that the fused text characteristics and visual characteristics are reinforced by using an attention mechanism, and the full interaction between the visual mode and the text mode is realized. In addition, through the mode, the number of the features which are required to be processed by the attention mechanism is reduced, the time consumption of the attention mechanism is reduced, and the image processing efficiency is improved.

In some embodiments, the target image may be an image including text, such as a document image or the like. In step S201, various types of optical character recognition (Optical Character Recognition, OCR) methods may be used to perform text recognition on the target image to obtain a plurality of text regions containing text in the target image, and text content included in each text region may be obtained. In some embodiments, the text region may be represented using coordinates of the corner points, and the text content may include, for example, letters, symbols, characters, words, sentences, paragraphs, and the like.

In some embodiments, at step S202, a first visual feature of visual modality information characterizing each of the plurality of text regions may be extracted using a visual feature extraction model or otherwise. The visual modality information may be, for example, image features, and the visual feature extraction model may be, for example, a convolutional neural network model. The pooled region of interest (RoI) visual features for each text region may be extracted using a region of interest alignment (RoIAlign) or region of interest pooling (roiooling) approach. Further, a position vector may be added to the first visual feature to indicate correspondence between different first visual features and different text regions, and may indicate a particular order (e.g., reading order).

In some embodiments, at step S203, a plurality of first text features characterizing text modality information of text content included in each of the plurality of text regions may be extracted using a text feature extraction model or otherwise. The text feature extraction model may be, for example, a word embedding model, or may be another neural network model for natural language processing. It will be appreciated that the number of first text features need not be the same as the number of text regions, e.g. a text region may correspond to a plurality of first text features.

According to some embodiments, as shown in fig. 3, extracting the plurality of first text features characterizing the text modality information of the text content included in each of the plurality of text regions in step S203 may include: step S301, segmenting text content included in each text region in a plurality of text regions to obtain at least one text segmentation corresponding to the text region; step S302, word embedding is carried out on at least one text word corresponding to each text region in a plurality of text regions, so as to obtain at least one word embedding feature corresponding to the text region; and step S303, obtaining a plurality of first text features based on at least one word embedding feature corresponding to each text region in the plurality of text regions.

Thus, by segmenting the text content in each text region and word embedding each segmented word, it is possible to obtain the first text feature containing rich information corresponding to the text content of the text region based on the word embedding feature corresponding to the segmented word.

In some embodiments, text content may be segmented with individual words, characters, words, or other granularity. The text feature extraction model mentioned above may be, for example, a word embedding layer. A position vector may be added to the word embedding feature or the first text feature to indicate correspondence between different word embedding features or the first visual feature and different text regions, and may indicate a particular order (e.g., reading order).

In some embodiments, in step S303, each word embedding feature corresponding to each text region may be used as the first text feature, thereby obtaining a large number of first text features. In this way, the information contained in the first text feature characterizing the text modality information can be enriched as much as possible, while in the subsequent step S205, the number of first text features can still be reduced by means of fusion. After processing in the above manner, the number of first text features is typically greater than the number of first text features (the text content of each text region typically includes more than one word segment).

According to some embodiments, step S303, deriving the plurality of first text features based on the at least one word embedding feature corresponding to each of the plurality of text regions may include: at least one word embedding feature corresponding to each text region in the plurality of text regions is fused to obtain a first text feature corresponding to the text region. The plurality of first text features may be in one-to-one correspondence with the plurality of text regions, and the plurality of first visual features may be in one-to-one correspondence with the plurality of text regions.

Because there is a correspondence between text regions, text content, and word embedding features (i.e., each text region has corresponding text content that includes a number of words and thus corresponds to word embedding features derived based on the words), at least one word embedding feature corresponding to each text region may be fused based on such correspondence to derive a first text feature that characterizes text modality information for the entirety of the text content included in the text region, thereby reducing the number of the first text feature according to an inherent logical relationship between the text region, the text content, and the word embedding feature.

In some embodiments, the fusion of word embedding features may be achieved by direct summation, weighted averaging, stitching, processing using a multi-layer perceptron, or the like, or any combination thereof, without limitation. In an exemplary embodiment, word embedding features corresponding to each text region may be fused in an averaging manner to obtain a first text feature corresponding to the text region. After the processing in the above manner, the number of the first text features is the same as the number of the first visual features.

In some embodiments, at step S204, at least one set of adjacent first text features may be determined from the plurality of first text features, and each set of adjacent first text features may be fused based on the plurality of first visual features, resulting in a plurality of first text fusion features that are fewer in number than the first text features. It will be appreciated that the plurality of first text features may have a corresponding ordering. In an exemplary embodiment, the text regions corresponding to the first text features may be ranked according to their respective locations in the target image to obtain a plurality of first text features in a list, and each of the at least one set of adjacent first text features may include, for example, a consecutive plurality of first text features in the list. It will be appreciated that the ordering relationship of the plurality of first text features may also be determined in other ways, and is not limited herein.

In some embodiments, at least one set of adjacent first text features may be determined by a predetermined parameter, for example, each set including features of the first predetermined parameter, features of the same second predetermined parameter between adjacent sets, or features of a third predetermined parameter between adjacent sets, etc., without limitation. In some embodiments, the above-described process may also be implemented using a neural network model, i.e., the plurality of first text features and the plurality of first visual features are each input into the neural network model to obtain a plurality of first text fusion features output by the neural network model. It is understood that the neural network model may be, for example, trained using the training method of the image processing model provided by the present disclosure. It should be noted that the number of features after fusion should be smaller than the number of features before fusion.

According to some embodiments, as shown in fig. 4, the fusing at least one set of adjacent first text features in the plurality of first text features based on the plurality of first visual features to obtain the plurality of first text fusion features may include: step S401, generating a plurality of first text fusion weights corresponding to the plurality of first text features one by one based on the plurality of first visual features; and step S402, fusing the plurality of first text features based on the plurality of first text fusion weights to obtain a plurality of first text fusion features.

Therefore, the first text features are guided to be fused by the first visual features representing the visual mode information by obtaining the first text fusion features corresponding to the first text features one by one based on the first visual features.

In some embodiments, the plurality of first text fusion weights may be generated using the neural network model described above. In step S402, each set of neighboring first text features may be fused based on the first text fusion weights to obtain corresponding first text fusion features.

According to some embodiments, step S402, fusing the plurality of first text features based on the plurality of first text fusion weights to obtain the plurality of first text fusion features may include: and carrying out text weighted average pooling on the plurality of first text features based on the plurality of first text fusion weights to obtain a plurality of first text fusion features.

Therefore, by using a weighted average pooling mode, adjacent first text features in the plurality of first text features can be fused rapidly, so that a plurality of first text fusion features can be obtained.

In some embodiments, the text weighted average pooling corresponding first hyper-parameters may include a corresponding pooling window size and step size. The pooling window size describes the number of adjacent first text features in each group, and the step size describes the distance between the adjacent groups, and also can describe the number of finally available first text fusion features. The first hyper-parameter may be set manually during the training phase.

According to some embodiments, a first hyper-parameter corresponding to a weighted average pooling of the plurality of first text features may be determined based on the plurality of first visual features, and the first hyper-parameter may include a pooling window size and a step size corresponding to a weighted average pooling of the plurality of first text features.

Therefore, the dynamic adjustment control of the text feature fusion process can be realized by determining the first hyper-parameters of the text weighted average pooling based on a plurality of first visual features, and the further interaction of the information among the modalities is realized by determining the first hyper-parameters by using the first visual features representing the visual modality information.

In some embodiments, the operation of step S205 may be similar to the operation of step S204, except that the visual features of the visual modality and the text features of the text modality are interchanged. The adjacent and/or ordered relationships between the plurality of first visual features may refer to the above-mentioned spelling and/or ordering relationships between the plurality of first text features, which are not described in detail herein.

According to some embodiments, the plurality of first visual fusion features may be derived based on the plurality of first text fusion features. That is, the first text feature may be first guided to be fused by using the visual mode information, and then the first visual feature of the visual mode may be guided to be fused by using the fused first text fusion feature. Because the number of the first text features is generally larger than the number of the first visual features (particularly, in the case that at least one word embedding feature corresponding to each text region is not fused into the first text feature corresponding to the text region), compared with the method that the text modal information is used for guiding the first visual features to fuse, the method can reduce the calculation amount, thereby improving the performance of the neural network model and improving the speed of the image processing flow.

According to some embodiments, step S205, based on the plurality of first text features, fusing at least one set of adjacent first visual features of the plurality of first visual features, respectively, to obtain a plurality of first visual fused features may include: generating a plurality of first visual fusion weights corresponding to the first visual features one by one based on the first text fusion features; and based on the plurality of first visual fusion weights, performing visual weighted average pooling on the plurality of first visual features to obtain a plurality of first visual fusion features.

Therefore, the method realizes that the first visual features of the visual mode are guided to be fused by using the first text features of the text mode.

In some embodiments, a second hyper-parameter corresponding to weighted average pooling of the plurality of first visual features may be determined based on the plurality of first text fusion features, and the second hyper-parameter may include a pooling window size and a step size corresponding to weighted average pooling of the plurality of first visual features.

Thus, by determining a first hyper-parameter of the text weighted average pooling based on a plurality of first text features, dynamic adjustment control of the visual feature fusion process can be achieved, and by determining a second hyper-parameter using the first text features characterizing the text modality information, further interaction of the information between modalities is achieved.

It will be appreciated that in some embodiments, step S205 may be performed first and then step S204 may be performed, that is, the first visual features may be fused first and then the first text features may be fused (based on the first visual features or the first visual fusion features), which is not limited herein.

In some embodiments, the resulting plurality of first text fusion features and first visual fusion features may be enhanced based on a self-attention mechanism to obtain a corresponding plurality of second text features and second visual features at step S206.

In some embodiments, all of the plurality of first text fusion features and the plurality of first visual fusion features may be enhanced directly based on the self-attention mechanism, i.e., mapping all of the features described above to corresponding query features Q, key features K, and value features V, and calculating corresponding self-attention results.

According to some embodiments, step S206, enhancing the plurality of first visual fusion features and the plurality of first text fusion features based on the attention mechanism to obtain the plurality of second visual features and the plurality of second text features may include: enhancing a plurality of visual query features corresponding to the plurality of first visual fusion features based on the plurality of text key features and the plurality of text value features corresponding to the plurality of first text fusion features to obtain a plurality of second visual features; and reinforcing the plurality of text query features corresponding to the plurality of first text fusion features based on the plurality of visual key features and the plurality of visual value features corresponding to the plurality of first visual fusion features to obtain a plurality of second text features.

Therefore, by utilizing the cross attention mechanism, the first text fusion feature is enhanced based on the first visual fusion feature, and the first visual fusion feature is enhanced based on the first text fusion feature, so that full interaction among multiple modes is realized.

In some embodiments, at step S206, the cross-attention and self-attention mechanisms may be performed sequentially (including performing the two in any order) to achieve sufficient interaction between the different features.

In some embodiments, step S204 and step S205 may utilize a neural network model 500 as shown in fig. 5. The neural network model 500 may be a fusion module in the image processing model 800 in fig. 8. As shown in fig. 5, the neural network model 500 includes: a first linear unit 508, configured to generate a plurality of first text fusion weights corresponding to the plurality of first text features 504 one-to-one based on the plurality of first visual features 502; a first fusing unit 506, configured to fuse at least one set of adjacent first text features 504 in the plurality of first text features 504, respectively, based on the plurality of first visual features 502, so as to obtain a plurality of first text fusion features 510; a second linear unit 512, configured to generate a plurality of first visual fusion weights corresponding to the plurality of first visual features 502 one-to-one based on the plurality of first text fusion features 510; and a second fusion unit 514, configured to perform visual weighted average pooling on the plurality of first visual features 502 based on the plurality of first visual fusion weights, so as to obtain a plurality of first visual fusion features 516; a first enhancing unit 518 is configured to enhance the first visual fusion features 516 and the first text fusion features 510 based on the attention mechanism to obtain second visual features 522 and second text features 520.

After obtaining the second text feature and the second visual feature, which have been reduced in the number of features and enhanced with the attention mechanism, these features may be upsampled to obtain a target visual extension feature and a target text extension feature, which are the same as the number of first visual features, the number of first text features, respectively, for a downstream image processing task.

According to some embodiments, as shown in fig. 6, step S207, obtaining the image processing result based on the plurality of second visual features and the plurality of second text features may include: step S601, up-sampling the plurality of second visual features to obtain a plurality of target visual expansion features, where the number of the plurality of target visual expansion features is the same as the number of the plurality of first visual features; step S602, up-sampling the plurality of second text features to obtain a plurality of target text expansion features, wherein the number of the plurality of target text expansion features is the same as the number of the plurality of first text features; and step S603, obtaining an image processing result based on the plurality of target visual expansion features and the plurality of target text expansion features.

Thus, by upsampling the second text feature and the second visual feature, the target text extension feature and the target visual extension feature which are respectively the same as the first text feature and the first visual feature can be obtained, so that the method can be more suitable for various downstream image processing tasks.

In some embodiments, the plurality of second visual features and the plurality of second text features may be upsampled in various types of upsampling to obtain the same target text feature and target visual feature as the first text feature and the first visual feature, respectively.

According to some embodiments, step S601, upsampling the plurality of second visual features to obtain a plurality of target visual extension features may include: at least a portion of the plurality of second visual features is replicated and stitched with the plurality of second visual features to obtain a plurality of target visual extension features. Step S602, up-sampling the plurality of second text features to obtain a plurality of target text extension features may include: at least a portion of the plurality of second text features is copied and spliced with the plurality of second text features to obtain a plurality of target text extension features.

Therefore, the up-sampling can be simply, conveniently and rapidly realized through the mode, and the image processing efficiency is improved. It will be appreciated that the second text feature and the second visual feature may be upsampled in other ways, not limited herein.

In some embodiments, any of the second visual features may be replicated and spliced into the second visual features until the number of resulting features (i.e., the plurality of target visual features) is the same as the number of first visual features. The plurality of second visual features may be duplicated in one or more copies as a whole, and the redundant features may be truncated to obtain a plurality of target visual features. It will be appreciated that the plurality of second visual features may also be upsampled by replication in other ways, not limited herein. In addition, the upsampling of the plurality of second text features may refer to the upsampling of the plurality of second visual features, which is not described herein.

In some embodiments, the target text extension feature and the target visual extension feature may be directly utilized as input features for downstream tasks to achieve a particular image processing task. The image processing tasks may include, for example, various types of image processing tasks related to text, and are not limited herein.

According to some embodiments, step S603, obtaining an image processing result based on the plurality of target visual extension features and the plurality of target text extension features may include (not shown in the figure): step S6031, strengthening the plurality of target visual expansion features and the plurality of target text expansion features based on the attention mechanism to obtain a plurality of target visual features and a plurality of target text features; and step S6032, obtaining an image processing result based on the plurality of target visual features and the plurality of target text features.

It will be appreciated that the operations of step S6031 and step S6032 may refer to the enhancement of the plurality of first text fusion features and the plurality of first visual fusion features described above, and are not described herein.

Therefore, the target text characteristic representing the text mode and the target visual characteristic representing the visual mode can be obtained more effectively by processing the target extended visual characteristic and the target text extended characteristic by using the attention mechanism.

In some embodiments, the target text feature and the target visual feature may be utilized as input features for downstream tasks to achieve a particular image processing task.

In some embodiments, multiple adjacent feature fusion and upsampling may be performed to further reduce the number of features and enable deeper modal interactions.

According to some embodiments, as shown in fig. 7, step S207, obtaining the image processing result based on the plurality of second visual features and the plurality of second text features may include: step S701, based on a plurality of second visual features, respectively fusing at least one group of adjacent second text features in the plurality of second text features to obtain a plurality of second text fusion features, wherein the number of the plurality of second text fusion features is less than that of the plurality of second text features; step S702, based on the plurality of second text features, respectively fusing at least one group of adjacent second visual features in the plurality of second visual features to obtain a plurality of second visual fusion features, wherein the number of the plurality of second visual fusion features is less than that of the plurality of second visual features; step S703, strengthening the plurality of second visual fusion features and the plurality of second text fusion features based on the attention mechanism to obtain a plurality of third visual features and a plurality of third text features; step S704, up-sampling the plurality of third visual features to obtain a plurality of target visual features, where the number of the plurality of target visual features is the same as the number of the plurality of first visual features; step S705, up-sampling the plurality of third text features to obtain a plurality of target text features, wherein the number of the plurality of target text features is the same as the number of the plurality of first text features; and step S706, obtaining an image processing result based on the plurality of target visual features and the plurality of target text features.

It is understood that the operations of step S701 to step S706 may refer to the above steps of feature fusion, enhancement, upsampling, etc., and will not be described herein.

Therefore, the number of the features can be further reduced and deeper modal interaction can be realized by further fusing and strengthening the second text features and the second visual features.

According to some embodiments, step S704, up-sampling the plurality of third visual features to obtain a plurality of target visual features may include (not shown in the figure): step S7041, up-sampling the plurality of third visual features to obtain a plurality of intermediate visual features, where the number of the plurality of intermediate visual features is the same as the number of the plurality of second visual features; step S7042, fusing the plurality of second visual features with corresponding intermediate visual features in the plurality of intermediate visual features, so as to obtain a plurality of target visual fusion features; and step S7043, upsampling the plurality of target visual fusion features to obtain a plurality of target visual features. Step S705, up-sampling the plurality of third text features to obtain a plurality of target text features may include (not shown in the figure): step S7051, up-sampling the plurality of third text features to obtain a plurality of intermediate text features, where the number of the plurality of intermediate text features is the same as the number of the plurality of second text features; step S7052, fusing the plurality of second text features with corresponding intermediate text features in the plurality of intermediate text features respectively to obtain a plurality of target text fusion features; and step S7052, upsampling the multiple target text fusion features to obtain multiple target text features.

Therefore, the second visual feature and the intermediate visual feature obtained after the third visual feature is up-sampled are fused, and the second text feature and the intermediate text feature obtained after the third text feature is up-sampled are fused, so that information contained in the obtained target visual feature and the target text feature can be further enriched, and the final effect of image processing is improved.

In some embodiments, the fusion of the features may be achieved in step S7041 and step S7051 by direct summation, weighted averaging, stitching, processing using a multi-layer perceptron, or any combination thereof. Such fusion may also be referred to as a jump connection.

In some embodiments, as shown in fig. 8, the neural network model 800 consists essentially of fusion modules 806, 808 and expansion modules 810, 812. There may be further fusion modules and expansion modules between the fusion module 808 and the expansion module 810, as will be described below. Jump connections 818, 820 are provided between the fusion module 806 and the corresponding expansion module 812, and between the fusion module 808 and the corresponding expansion module 810. Fusion module 806 (808) may include therein a fusion unit 818 (822) for fusing features and an attention unit 820 (824) for enhancing features. An expansion unit 830 (826) for upsampling and a attention unit 832 (828) for emphasizing features may be included in the expansion module 812 (810). After multiple rounds of fusion and upsampling, final target text feature 814 and target visual feature 816 may be derived based on the entered first text feature 802 and first visual feature 804.

In some embodiments, more fusion modules and expansion modules may be included in the neural network model 800 to obtain a smaller number of "deep" features and perform more "deep" feature fusion and modal interactions. In an exemplary embodiment, the neural network model may include three fusion modules and extension modules connected in series, and feature fusion is performed between the corresponding fusion modules and extension modules by using jump connection. It will be appreciated that the neural network model may also include more or fewer fusion modules and expansion modules, and is not limited in this regard.

According to another aspect of the present disclosure, a training method of an image processing model is provided. As shown in fig. 9, the training method includes: step S901, determining a real image processing result of a sample image and a plurality of sample text regions in the sample image and text contents included in each of the plurality of sample text regions; step S902, extracting a plurality of first sample visual features of visual mode information representing a plurality of sample text areas; step 903, extracting a plurality of first sample text features, where the plurality of first sample text features characterize text modal information of sample text contents included in the plurality of sample text regions; step S904, based on the plurality of first sample visual features, respectively fusing at least one set of adjacent first sample features in the plurality of first sample features by using the neural network model to obtain a plurality of first sample fusion features, wherein the number of the plurality of first sample fusion features is less than the number of the plurality of first sample features; step S905, based on the plurality of first sample visual features, respectively fusing at least one group of adjacent first sample visual features in the plurality of first sample visual features by using a neural network model to obtain a plurality of first sample visual fusion features, wherein the number of the plurality of first sample visual fusion features is less than that of the plurality of first sample visual features; step S906, strengthening the plurality of first sample visual fusion features and the plurality of first sample text fusion features based on an attention mechanism by utilizing a neural network model to obtain a plurality of second sample visual features and a plurality of second sample text features; step S907, obtaining a predicted image processing result based on the plurality of second sample visual features and the plurality of second sample text features; and step S908, based on the real image processing result and the predicted image processing result, adjusting parameters of the neural network model to obtain an image processing model.

It is to be understood that the operations of step S902 to step S907 in fig. 9 are similar to those of step S202 to step S207 in fig. 2, and are not described herein.

Therefore, through the mode, the trained neural network can utilize the characteristics of the visual modes as guidance to reduce the number of the characteristics of the text modes, and can utilize the characteristics of the text modes as guidance to reduce the number of the characteristics of the visual modes, so that the attention mechanism can be utilized to strengthen the fused text characteristics and visual characteristics, and the full interaction between the visual modes and the text modes is realized. In addition, through the mode, the number of the features which are required to be processed by the attention mechanism is reduced, the time consumption of the attention mechanism is reduced, and the image processing efficiency of the trained image processing model is improved.

Based on the structure of the neural network model and the corresponding training steps, the model can be directly used for downstream tasks after pre-training, and distillation, miniaturized design and other treatments are not needed.

In some embodiments, in step S901, the true image processing result of the sample image and the plurality of sample text regions in the sample image and the text content included in each of the plurality of sample text regions may be identified by labeling, text recognition, or otherwise. The true image processing results may be determined from training tasks of the neural network model. In one exemplary embodiment, for a pre-trained task mask language model, the real image processing results may be masked image features or text features; for a specific downstream task, the real image processing result may be a result corresponding to the downstream task.

In some embodiments, the neural network model may also be trained using a pre-training mechanism for content automation coding (Content Autoencoder, CAE).

According to some embodiments, step S903, extracting a plurality of first sample text features characterizing text modality information of sample text content included in each of the plurality of sample text regions may include: segmenting sample text content included in each sample text region in the plurality of sample text regions to obtain at least one sample text segmentation corresponding to the sample text region; word embedding is carried out on at least one sample word corresponding to each sample text region in a plurality of sample text regions, so as to obtain at least one sample word embedding feature corresponding to the sample text region; and obtaining a plurality of first sample features based on at least one sample word embedding feature corresponding to each of the plurality of sample text regions.

Thus, by segmenting the sample text content in each sample text region and word embedding each segmented word, the first sample text feature containing rich information corresponding to the sample text content of the sample text region can be obtained based on the sample word embedding feature corresponding to the segmented words.

According to some embodiments, the plurality of first sample text features may be in one-to-one correspondence with the plurality of sample text regions, and the plurality of first sample visual features may be in one-to-one correspondence with the plurality of sample text regions. Obtaining the plurality of first sample features based on at least one sample word embedding feature corresponding to each sample text region of the plurality of sample text regions may include: at least one sample word embedding feature corresponding to each sample text region of the plurality of sample text regions is fused to obtain a first sample text feature corresponding to the sample text region.

By the method, the number of the first text features is effectively reduced according to the inherent logic relation among the sample text region, the sample text content and the sample word embedded features.

It will be appreciated that the adjacency and/or ordering relationship between the plurality of first sample features and the adjacency and/or ordering relationship between the plurality of first sample visual features may refer to the plurality of first text features and the plurality of first visual features above, and are not described in detail herein.

According to some embodiments, step S904, based on the plurality of first sample visual features, fusing at least one set of adjacent first sample features of the plurality of first sample features, respectively, using the neural network model, to obtain a plurality of first sample fusion features may include: generating a plurality of first sample fusion weights in one-to-one correspondence with the plurality of first sample features using a neural network model based on the plurality of first sample visual features; and fusing the plurality of first sample features based on the plurality of first sample fusion weights to obtain a plurality of first sample fusion features.

Thus, by deriving a plurality of first sample fusion features, one-to-one, with the plurality of first sample features based on the plurality of first sample visual features, it is achieved that the first sample text features are guided for fusion with the first sample visual features characterizing the visual modality information.

According to some embodiments, fusing the plurality of first sample features based on the plurality of first sample fusion weights to obtain the plurality of first sample fusion features may include: text weighted average pooling of the plurality of first sample features based on the plurality of first sample fusion weights to obtain the plurality of first sample fusion features.

Thus, by using a weighted average pooling approach, a fast fusion of adjacent first sample features of the plurality of first sample features to obtain a plurality of first sample fusion features can be achieved.

According to some embodiments, a first sample hyper-parameter corresponding to weighted average pooling of the plurality of first sample features may be determined based on the plurality of first sample visual features, and the first sample hyper-parameter may include a pooling window size and a step size corresponding to weighted average pooling of the plurality of first sample features.

Therefore, the dynamic adjustment control of the text feature fusion process can be realized by determining the text weighted average pooled first sample hyper-parameters based on a plurality of first sample visual features, and the further interaction of the information among the modalities is realized by determining the first sample hyper-parameters by using the first sample visual features representing the visual modality information.

According to some embodiments, the plurality of first sample visual fusion features may be derived based on the plurality of first sample visual fusion features.

Because the number of the first sample visual features is generally greater than the number of the first sample visual features (particularly, in the case that at least one sample word embedding feature corresponding to each sample text region is not fused into the first sample visual features corresponding to the sample text region), compared with the method that the first sample visual features are guided to be fused by using the text modal information, the method can reduce the calculation amount, thereby improving the performance of the trained image processing model and improving the speed of the image processing flow.

According to some embodiments, step S905, based on the plurality of first sample visual features, fusing at least one set of adjacent first sample visual features of the plurality of first sample visual features, respectively, using the neural network model, to obtain a plurality of first sample visual fusion features may include: generating a plurality of first sample visual fusion weights in one-to-one correspondence with the plurality of first sample visual features based on the plurality of first sample fusion features; and based on the plurality of first sample visual fusion weights, performing visual weighted average pooling on the plurality of first sample visual features to obtain a plurality of first sample visual fusion features. A second sample hyper-parameter corresponding to weighted average pooling of the plurality of first visual features may be determined based on the plurality of first sample fusion features, and the second sample hyper-parameter may include a pooling window size and a step size corresponding to weighted average pooling of the plurality of first visual features.

By the method, the first sample visual features of the text mode are guided to be fused by the first sample visual features of the text mode.

According to some embodiments, step S906 of enhancing the plurality of first sample visual fusion features and the plurality of first sample text fusion features based on the attention mechanism to obtain the plurality of second sample visual features and the plurality of second sample text features may include: enhancing a plurality of sample visual query features corresponding to the plurality of first sample visual fusion features based on the plurality of sample text key features and the plurality of sample text value features corresponding to the plurality of first sample text fusion features to obtain a plurality of second sample visual features; and reinforcing the plurality of sample text query features corresponding to the plurality of first sample text fusion features based on the plurality of sample visual key features and the plurality of sample visual value features corresponding to the plurality of first sample visual fusion features to obtain a plurality of second sample text features.

Therefore, by utilizing a cross attention mechanism, the first sample visual fusion characteristic is enhanced based on the first sample visual fusion characteristic, and the first sample visual fusion characteristic is enhanced based on the first sample visual fusion characteristic, the full interaction among multiple modes is realized.

According to some embodiments, step S907, obtaining the predicted image processing result based on the plurality of second sample visual features and the plurality of second sample text features may include: upsampling the plurality of second sample visual features to obtain a plurality of sample target visual extension features, the number of the plurality of sample target visual extension features being the same as the number of the plurality of first sample visual features; upsampling the plurality of second sample text features to obtain a plurality of sample target text extension features, the number of the plurality of sample target text extension features being the same as the number of the plurality of first sample text features; and obtaining a predicted image processing result based on the plurality of sample target visual expansion features and the plurality of sample target text expansion features.

Therefore, by upsampling the second sample text feature and the second sample visual feature, the sample target text expansion feature and the sample target visual expansion feature which are respectively the same in number as the first sample text feature and the first sample visual feature can be obtained, so that the method can be more suitable for various downstream image processing tasks.

According to some embodiments, upsampling the plurality of second sample visual features to obtain a plurality of sample target visual extension features may comprise: at least a portion of the plurality of second sample visual features is replicated and stitched with the plurality of second sample visual features to obtain a plurality of sample target visual extension features. Upsampling the plurality of second sample text features to obtain a plurality of sample target text extension features may include: at least a portion of the plurality of second text features is copied and spliced with the plurality of second text features to obtain a plurality of target text extension features.

Therefore, the up-sampling can be simply, conveniently and rapidly realized through the mode, and the efficiency of image processing of the trained image processing model is improved. It will be appreciated that the second sample text feature and the second sample visual feature may be upsampled in other ways, not limited herein.

According to some embodiments, obtaining the predicted image processing result based on the plurality of sample target visual extension features and the plurality of sample target text extension features may include: strengthening the plurality of sample target visual expansion features and the plurality of sample target text expansion features based on an attention mechanism to obtain a plurality of sample target visual features and a plurality of sample target text features; and obtaining a predicted image processing result based on the plurality of sample target visual features and the plurality of sample target text features.

Therefore, the sample target text characteristic representing the text mode and the sample target visual characteristic representing the visual mode can be obtained more effectively by processing the sample target extended visual characteristic and the sample target text extended characteristic by using the attention mechanism.

According to some embodiments, step S907, obtaining the predicted image processing result based on the plurality of second sample visual features and the plurality of second sample text features may include: based on the plurality of second sample visual features, respectively fusing at least one group of adjacent second sample text features in the plurality of second sample text features to obtain a plurality of second sample text fusion features, wherein the number of the plurality of second sample text fusion features is less than that of the plurality of second sample text features; based on the plurality of second sample text features, respectively fusing at least one group of adjacent second sample visual features in the plurality of second sample visual features to obtain a plurality of second sample visual fusion features, wherein the number of the plurality of second sample visual fusion features is less than that of the plurality of second sample visual features; reinforcing the plurality of second sample visual fusion features and the plurality of second sample text fusion features based on an attention mechanism to obtain a plurality of third sample visual features and a plurality of third sample text features; upsampling the plurality of third sample visual features to obtain a plurality of sample target visual features, the number of the plurality of sample target visual features being the same as the number of the plurality of first sample visual features; upsampling the plurality of third sample text features to obtain a plurality of sample target text features, the number of the plurality of sample target text features being the same as the number of the plurality of first sample text features; and obtaining a predicted image processing result based on the plurality of sample target visual features and the plurality of sample target text features.

Therefore, the number of the features can be further reduced and deeper modal interaction can be realized by further fusing and strengthening the text features of the second sample and the visual features of the second sample.

According to some embodiments, upsampling the plurality of third sample visual features to obtain a plurality of sample target visual features may comprise: upsampling the plurality of third sample visual features to obtain a plurality of sample intermediate visual features, the number of the plurality of sample intermediate visual features being the same as the number of the plurality of second sample visual features; fusing the plurality of second sample visual features with corresponding sample intermediate visual features in the plurality of sample intermediate visual features respectively to obtain a plurality of sample target visual fusion features; and upsampling the plurality of sample target visual fusion features to obtain a plurality of sample target visual extension features. Upsampling the plurality of second sample text features to obtain a plurality of sample target text features may include: upsampling the plurality of third sample text features to obtain a plurality of sample intermediate text features, the number of the plurality of sample intermediate text features being the same as the number of the plurality of second sample text features; respectively fusing the plurality of second sample text features with corresponding sample intermediate text features in the plurality of sample intermediate text features to obtain a plurality of sample target text fusion features; and upsampling the plurality of sample target text fusion features to obtain a plurality of sample target text features.

Therefore, the second sample visual feature and the sample intermediate visual feature obtained after the third sample visual feature are up-sampled are fused, and the second sample text feature and the sample intermediate text feature obtained after the third sample text feature are up-sampled are fused, so that information contained in the obtained sample target visual feature and the sample target text feature can be further enriched, and the final effect of image processing of the trained image processing model is improved.

In some embodiments, in step S908, a loss value may be calculated according to a predetermined loss function based on the real image processing result and the predicted image processing result, and further parameters of the neural network model may be adjusted according to the loss value to obtain a trained image processing model. It will be appreciated that, in implementing the training method of the present disclosure, the corresponding loss function or the training manner of the neural network model may be determined according to the requirement, which is not limited herein.

According to another aspect of the present disclosure, an image processing apparatus is provided. As shown in fig. 10, the image processing apparatus 1000 includes: a text recognition unit 1010 configured to perform text recognition on the target image, resulting in a plurality of text regions and text content included in each of the plurality of text regions; a first extraction unit 1020 configured to extract a plurality of first visual features of visual modality information characterizing a plurality of text regions; a second extraction unit 1030 configured to extract a plurality of first text features characterizing text modality information of text content included in the plurality of text regions; the first fusing unit 1040 is configured to fuse at least one group of adjacent first text features in the plurality of first text features respectively based on the plurality of first visual features to obtain a plurality of first text fusion features, and the number of the plurality of first text fusion features is smaller than that of the plurality of first text features; a second fusing unit 1050 configured to fuse at least one set of adjacent first visual features of the plurality of first visual features, respectively, based on the plurality of first text features, to obtain a plurality of first visual fusion features, the number of the plurality of first visual fusion features being less than the number of the plurality of first visual features; a first enhancement unit 1060 configured to enhance the plurality of first visual fusion features and the plurality of first text fusion features based on the attention mechanism to obtain a plurality of second visual features and a plurality of second text features; and a first processing unit 1070 configured to obtain an image processing result based on the plurality of second visual features and the plurality of second text features.

It is understood that the operations of the units 1010-1070 in the apparatus 1000 are similar to the operations of the steps S201-S207 in fig. 2, respectively, and are not described herein.

According to another aspect of the present disclosure, a training apparatus for an image processing model is provided. As shown in fig. 11, the training apparatus 1100 includes: a determining unit 1110 configured to determine a true image processing result of the sample image and a plurality of sample text regions in the sample image and text contents included in each of the plurality of sample text regions; a third extraction unit 1120 configured to extract a plurality of first sample visual features characterizing visual modality information of a plurality of sample text regions; a fourth extraction unit 1130 configured to extract a plurality of first sample text features, the plurality of first sample text features characterizing text modality information of sample text content included in the plurality of sample text regions; a third fusing unit 1140 configured to fuse at least one set of adjacent first sample features of the plurality of first sample features, respectively, using the neural network model, based on the plurality of first sample visual features, to obtain a plurality of first sample fusion features, the number of the plurality of first sample fusion features being less than the number of the plurality of first sample features; a fourth fusing unit 1150 configured to fuse at least one set of neighboring first sample visual features among the plurality of first sample visual features, respectively, using the neural network model based on the plurality of first sample text features to obtain a plurality of first sample visual fusion features, the number of the plurality of first sample visual fusion features being less than the number of the plurality of first sample visual features; a second enhancement unit 1160 configured to enhance the plurality of first sample visual fusion features and the plurality of first sample text fusion features based on the attention mechanism using the neural network model to obtain a plurality of second sample visual features and a plurality of second sample text features; a second processing unit 1170 configured to obtain a predicted image processing result based on the plurality of second sample visual features and the plurality of second sample text features; and a parameter tuning unit 1180 configured to adjust parameters of the neural network model based on the real image processing result and the predicted image processing result to obtain an image processing model.

It is understood that the operations of the units 1110 to 1180 in the apparatus 1100 are similar to the operations of the steps S901 to S908 in fig. 9, respectively, and are not described herein.

According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

With reference to fig. 12, a block diagram of an electronic device 1200 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 1200 includes a computing unit 1201, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other via a bus 1204. An input/output (I/O) interface 1205 is also connected to the bus 1204.

Various components in device 1200 are connected to I/O interface 1205, including: an input unit 1206, an output unit 1207, a storage unit 1208, and a communication unit 1209. The input unit 1206 may be any type of device capable of inputting information to the device 1200, the input unit 1206 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 1207 may be any type of device capable of presenting information, and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 1208 may include, but is not limited to, magnetic disks, optical disks. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication receiversHair pin and/or chipset, e.g. Bluetooth ^TM Devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning network algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The computing unit 1201 performs the various methods and processes described above, such as an image processing method and/or a training method of an image processing model. For example, in some embodiments, the image processing method and/or the training method of the image processing model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the image processing method and/or the training method of the image processing model described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the image processing method and/or the training method of the image processing model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. An image processing method, comprising:

performing text recognition on the target image to obtain a plurality of text areas and text contents included in each text area in the plurality of text areas;

Extracting a plurality of first visual features characterizing visual modality information of the plurality of text regions;

extracting a plurality of first text features, wherein the plurality of first text features characterize text modal information of text contents included in the plurality of text regions;

based on the plurality of first visual features, respectively fusing at least one group of adjacent first text features in the plurality of first text features to obtain a plurality of first text fusion features, wherein the number of the plurality of first text fusion features is smaller than that of the plurality of first text features;

based on the plurality of first text features, respectively fusing at least one group of adjacent first visual features in the plurality of first visual features to obtain a plurality of first visual fusion features, wherein the number of the plurality of first visual fusion features is smaller than that of the plurality of first visual features;

enhancing the plurality of first visual fusion features and the plurality of first text fusion features based on an attention mechanism to obtain a plurality of second visual features and a plurality of second text features; and

and obtaining an image processing result based on the plurality of second visual features and the plurality of second text features.

2. The method of claim 1, wherein fusing at least one set of adjacent first text features of the plurality of first text features, based on the plurality of first visual features, respectively, to obtain a plurality of first text fusion features comprises:

generating a plurality of first text fusion weights in one-to-one correspondence with the plurality of first text features based on the plurality of first visual features; and

and fusing the plurality of first text features based on the plurality of first text fusion weights to obtain the plurality of first text fusion features.

3. The method of claim 2, wherein fusing the plurality of first text features based on the plurality of first text fusion weights to obtain the plurality of first text fusion features comprises:

and carrying out weighted average pooling on the plurality of first text features based on the plurality of first text fusion weights to obtain the plurality of first text fusion features.

4. A method according to claim 3, wherein a first hyper-parameter corresponding to a weighted average pooling of the plurality of first text features is determined based on the plurality of first visual features, the first hyper-parameter comprising a pooling window size and a step size corresponding to a weighted average pooling of the plurality of first text features.

5. The method of claim 1, wherein the plurality of first visual fusion features are derived based on the plurality of first text fusion features.

6. The method of claim 5, wherein fusing at least one set of adjacent ones of the plurality of first visual features based on the plurality of first text features, respectively, to obtain a plurality of first visual fused features comprises:

generating a plurality of first visual fusion weights corresponding to the plurality of first visual features one to one based on the plurality of first text fusion features; and

based on the plurality of first visual fusion weights, performing visual weighted average pooling on the plurality of first visual features to obtain the plurality of first visual fusion features,

wherein a second hyper-parameter corresponding to a weighted average pooling of the plurality of first visual features is determined based on the plurality of first text fusion features, the second hyper-parameter comprising a pooling window size and a step size corresponding to a weighted average pooling of the plurality of first visual features.

7. The method of claim 1, wherein extracting a plurality of first text features characterizing text modality information of text content included by each of the plurality of text regions comprises:

Segmenting text content included in each text region in the plurality of text regions to obtain at least one text segmentation corresponding to the text region;

word embedding is carried out on at least one text word corresponding to each text region in the plurality of text regions so as to obtain at least one word embedding feature corresponding to the text region; and

the plurality of first text features is obtained based on at least one word embedding feature corresponding to each of the plurality of text regions.

8. The method of claim 7, wherein the plurality of first text features are in one-to-one correspondence with the plurality of text regions, the plurality of first visual features are in one-to-one correspondence with the plurality of text regions,

wherein obtaining the plurality of first text features based on at least one word embedding feature corresponding to each text region of the plurality of text regions comprises:

and fusing at least one word embedding feature corresponding to each text region in the plurality of text regions to obtain a first text feature corresponding to the text region.

9. The method of claim 1, wherein deriving an image processing result based on the plurality of second visual features and the plurality of second text features comprises:

Upsampling the plurality of second visual features to obtain a plurality of target visual extension features, the number of the plurality of target visual extension features being the same as the number of the plurality of first visual features;

upsampling the plurality of second text features to obtain a plurality of target text extension features, the number of the plurality of target text extension features being the same as the number of the plurality of first text features; and

and obtaining the image processing result based on the plurality of target visual expansion features and the plurality of target text expansion features.

10. The method of claim 9, wherein upsampling the plurality of second visual features to obtain a plurality of target visual extension features comprises:

copying and stitching at least a portion of the plurality of second visual features to obtain the plurality of target visual extension features,

wherein upsampling the plurality of second text features to obtain a plurality of target text extension features comprises:

and copying at least one part of the second text features in the plurality of second text features and splicing the second text features with the plurality of second text features to obtain the plurality of target text extension features.

11. The method of claim 9, wherein deriving the image processing result based on the plurality of target visual extension features and the plurality of target text extension features comprises:

enhancing the plurality of target visual extension features and the plurality of target text extension features based on an attention mechanism to obtain a plurality of target visual features and a plurality of target text features; and

and obtaining the image processing result based on the plurality of target visual features and the plurality of target text features.

12. The method of claim 1, wherein deriving an image processing result based on the plurality of second visual features and the plurality of second text features comprises:

based on the plurality of second visual features, respectively fusing at least one group of adjacent second text features in the plurality of second text features to obtain a plurality of second text fusion features, wherein the number of the plurality of second text fusion features is smaller than that of the plurality of second text features;

based on the plurality of second text features, respectively fusing at least one group of adjacent second visual features in the plurality of second visual features to obtain a plurality of second visual fusion features, wherein the number of the plurality of second visual fusion features is smaller than that of the plurality of second visual features;

Enhancing the plurality of second visual fusion features and the plurality of second text fusion features based on an attention mechanism to obtain a plurality of third visual features and a plurality of third text features;

upsampling the plurality of third visual features to obtain a plurality of target visual features, the number of the plurality of target visual features being the same as the number of the plurality of first visual features;

upsampling the plurality of third text features to obtain a plurality of target text features, the number of the plurality of target text features being the same as the number of the plurality of first text features; and

13. The method of claim 12, wherein upsampling the plurality of third visual features to obtain a plurality of target visual features comprises:

upsampling the plurality of third visual features to obtain a plurality of intermediate visual features, the number of the plurality of intermediate visual features being the same as the number of the plurality of second visual features;

fusing the plurality of second visual features with corresponding ones of the plurality of intermediate visual features, respectively, to obtain a plurality of target visual fusion features; and

Upsampling the plurality of target visual fusion features to obtain the plurality of target visual features,

wherein upsampling the plurality of third text features to obtain a plurality of target text features comprises:

upsampling the plurality of third text features to obtain a plurality of intermediate text features, the number of the plurality of intermediate text features being the same as the number of the plurality of second text features;

fusing the plurality of second text features with corresponding intermediate text features in the plurality of intermediate text features respectively to obtain a plurality of target text fusion features; and

and up-sampling the multiple target text fusion features to obtain the multiple target text features.

14. The method of claim 1, wherein enhancing the plurality of first visual fusion features and the plurality of first text fusion features based on an attention mechanism to obtain a plurality of second visual features and a plurality of second text features comprises:

enhancing a plurality of visual query features corresponding to the plurality of first visual fusion features based on a plurality of text key features and a plurality of text value features corresponding to the plurality of first text fusion features to obtain the plurality of second visual features; and

And strengthening a plurality of text query features corresponding to the plurality of first text fusion features based on the plurality of visual key features and the plurality of visual value features corresponding to the plurality of first visual fusion features to obtain the plurality of second text features.

15. A method of training an image processing model, comprising:

determining a real image processing result of a sample image and a plurality of sample text regions in the sample image and text content included in each of the plurality of sample text regions;

extracting a plurality of first sample visual features characterizing visual modality information of the plurality of sample text regions;

extracting a plurality of first sample text features, the plurality of first sample text features characterizing text modality information of sample text content included in the plurality of sample text regions;

based on the plurality of first sample visual features, fusing at least one set of adjacent first sample features of the plurality of first sample features, respectively, using a neural network model to obtain a plurality of first sample fusion features, the number of the plurality of first sample fusion features being less than the number of the plurality of first sample features;

Based on the first sample features, fusing at least one group of adjacent first sample visual features in the first sample visual features by using the neural network model to obtain first sample visual fusion features, wherein the number of the first sample visual fusion features is smaller than that of the first sample visual features;

utilizing the neural network model to strengthen the first sample visual fusion features and the first sample text fusion features based on an attention mechanism so as to obtain a second sample visual features and a second sample text features;

obtaining a predicted image processing result based on the plurality of second sample visual features and the plurality of second sample text features; and

and adjusting parameters of the neural network model based on the real image processing result and the predicted image processing result to obtain an image processing model.

16. The method of claim 15, wherein based on the plurality of first sample visual features, fusing at least one set of adjacent first sample features of the plurality of first sample features, respectively, using a neural network model to obtain a plurality of first sample fusion features comprises:

Generating a plurality of first sample fusion weights in one-to-one correspondence with the plurality of first sample features using the neural network model based on the plurality of first sample visual features; and

based on the plurality of first sample fusion weights, the plurality of first sample features are fused to obtain the plurality of first sample fusion features.

17. The method of claim 16, wherein fusing the plurality of first sample features based on the plurality of first sample fusion weights to obtain the plurality of first sample fusion features comprises:

based on the plurality of first sample fusion weights, text weighted average pooling is performed on the plurality of first sample features to obtain the plurality of first sample fusion features.

18. The method of claim 17, wherein a first sample hyper-parameter corresponding to weighted average pooling of the plurality of first sample features is determined based on the plurality of first sample visual features, the first sample hyper-parameter comprising a pooling window size and a step size corresponding to weighted average pooling of the plurality of first sample features.

19. The method of claim 15, wherein the plurality of first sample visual fusion features are derived based on the plurality of first sample visual fusion features.

20. The method of claim 19, wherein based on the plurality of first sample features, fusing, with the neural network model, at least one set of adjacent first sample visual features of the plurality of first sample visual features, respectively, to obtain a plurality of first sample visual fusion features comprises:

generating a plurality of first sample visual fusion weights in one-to-one correspondence with the plurality of first sample visual features based on the plurality of first sample fusion features; and

based on the plurality of first sample visual fusion weights, performing visual weighted average pooling on the plurality of first sample visual features to obtain the plurality of first sample visual fusion features,

wherein a second sample hyper-parameter corresponding to weighted average pooling of the plurality of first sample visual features is determined based on the plurality of first sample fusion features, the second sample hyper-parameter comprising a pooling window size and a step size corresponding to weighted average pooling of the plurality of first sample visual features.

21. The method of claim 15, wherein extracting a plurality of first sample text features characterizing text modality information of sample text content included by each sample text region of the plurality of sample text regions comprises:

segmenting sample text content included in each sample text region in the plurality of sample text regions to obtain at least one sample text segmentation corresponding to the sample text region;

word embedding is carried out on at least one sample word corresponding to each sample text region in the plurality of sample text regions, so as to obtain at least one sample word embedding feature corresponding to the sample text region; and

the plurality of first text sample features are obtained based on at least one sample word embedding feature corresponding to each of the plurality of sample text regions.

22. The method of claim 21, wherein the plurality of first sample text features are in one-to-one correspondence with the plurality of sample text regions, the plurality of first sample visual features are in one-to-one correspondence with the plurality of sample text regions,

wherein obtaining the plurality of first text sample features based on at least one sample word embedding feature corresponding to each of the plurality of sample text regions comprises:

And fusing at least one sample word embedding feature corresponding to each sample text region in the plurality of sample text regions to obtain a first sample text feature corresponding to the sample text region.

23. The method of claim 15, wherein deriving a predicted image processing result based on the plurality of second sample visual features and the plurality of second sample text features comprises:

upsampling the plurality of second sample visual features to obtain a plurality of sample target visual extension features, the number of the plurality of sample target visual extension features being the same as the number of the plurality of first sample visual features;

upsampling the plurality of second sample text features to obtain a plurality of sample target text extension features, the number of the plurality of sample target text extension features being the same as the number of the plurality of first sample text features; and

and obtaining the predicted image processing result based on the plurality of sample target visual expansion features and the plurality of sample target text expansion features.

24. The method of claim 23, wherein upsampling the plurality of second sample visual features to obtain a plurality of sample target visual extension features comprises:

Copying and stitching at least a portion of the second sample visual features of the plurality of second sample visual features to obtain the plurality of sample target visual extension features,

wherein upsampling the plurality of second sample text features to obtain a plurality of sample target text extension features comprises:

25. The method of claim 23, wherein deriving the predicted image processing result based on the plurality of sample target visual extension features and the plurality of sample target text extension features comprises:

the plurality of sample target visual expansion features and the plurality of sample target text expansion features are enhanced based on an attention mechanism to obtain a plurality of sample target visual features and a plurality of sample target text features; and

and obtaining the predicted image processing result based on the plurality of sample target visual characteristics and the plurality of sample target text characteristics.

26. The method of claim 15, wherein deriving a predicted image processing result based on the plurality of second sample visual features and the plurality of second sample text features comprises:

Based on the plurality of second sample visual features, respectively fusing at least one group of adjacent second sample text features in the plurality of second sample text features to obtain a plurality of second sample text fusion features, wherein the number of the plurality of second sample text fusion features is smaller than that of the plurality of second sample text features;

based on the plurality of second sample text features, respectively fusing at least one group of adjacent second sample visual features in the plurality of second sample visual features to obtain a plurality of second sample visual fusion features, wherein the number of the plurality of second sample visual fusion features is smaller than that of the plurality of second sample visual features;

enhancing the plurality of second sample visual fusion features and the plurality of second sample text fusion features based on an attention mechanism to obtain a plurality of third sample visual features and a plurality of third sample text features;

upsampling the plurality of third sample visual features to obtain a plurality of sample target visual features, the number of the plurality of sample target visual features being the same as the number of the plurality of first sample visual features;

upsampling the plurality of third sample text features to obtain a plurality of sample target text features, the number of the plurality of sample target text features being the same as the number of the plurality of first sample text features; and

27. The method of claim 26, wherein upsampling the plurality of third sample visual features to obtain a plurality of sample target visual features comprises:

upsampling the plurality of third sample visual features to obtain a plurality of sample intermediate visual features, the number of the plurality of sample intermediate visual features being the same as the number of the plurality of second sample visual features;

fusing the plurality of second sample visual features with corresponding sample intermediate visual features in the plurality of sample intermediate visual features respectively to obtain a plurality of sample target visual fusion features; and

upsampling the plurality of sample target visual fusion features to obtain the plurality of sample target visual features,

wherein upsampling the plurality of third sample text features to obtain a plurality of sample target text features comprises:

upsampling the plurality of third sample text features to obtain a plurality of sample intermediate text features, the number of the plurality of sample intermediate text features being the same as the number of the plurality of second sample text features;

Respectively fusing the plurality of second sample text features with corresponding sample intermediate text features in the plurality of sample intermediate text features to obtain a plurality of sample target text fusion features; and

and upsampling the plurality of sample target text fusion features to obtain the plurality of sample target text features.

28. The method of claim 15, wherein enhancing the plurality of first sample visual fusion features and the plurality of first sample text fusion features based on an attention mechanism to obtain a plurality of second sample visual features and a plurality of second sample text features comprises:

enhancing a plurality of sample visual query features corresponding to the plurality of first sample visual fusion features based on a plurality of sample text key features and a plurality of sample text value features corresponding to the plurality of first sample text fusion features to obtain the plurality of second sample visual features; and

and reinforcing a plurality of sample text query features corresponding to the plurality of first sample text fusion features based on the plurality of sample visual key features and the plurality of sample visual value features corresponding to the plurality of first sample visual fusion features to obtain the plurality of second sample text features.

29. An image processing apparatus comprising:

a text recognition unit configured to perform text recognition on a target image to obtain a plurality of text regions and text contents included in each of the plurality of text regions;

a first extraction unit configured to extract a plurality of first visual features of visual modality information characterizing the plurality of text regions;

a second extraction unit configured to extract a plurality of first text features characterizing text modality information of text content included in the plurality of text regions;

the first fusion unit is configured to fuse at least one group of adjacent first text features in the first text features based on the first visual features to obtain a plurality of first text fusion features, wherein the number of the first text fusion features is smaller than that of the first text features;

a second fusing unit configured to fuse at least one group of adjacent first visual features in the plurality of first visual features based on the plurality of first text features, respectively, so as to obtain a plurality of first visual fusion features, wherein the number of the plurality of first visual fusion features is smaller than that of the plurality of first visual features;

A first enhancement unit configured to enhance the plurality of first visual fusion features and the plurality of first text fusion features based on an attention mechanism to obtain a plurality of second visual features and a plurality of second text features; and

and a first processing unit configured to obtain an image processing result based on the plurality of second visual features and the plurality of second text features.

30. A training apparatus for an image processing model, comprising:

a determination unit configured to determine a true image processing result of a sample image and a plurality of sample text regions in the sample image and text contents included in each of the plurality of sample text regions;

a third extraction unit configured to extract a plurality of first sample visual features characterizing visual modality information of the plurality of sample text regions;

a fourth extraction unit configured to extract a plurality of first sample text features characterizing text modality information of sample text content included in the plurality of sample text regions;

a third fusing unit configured to fuse at least one set of adjacent first sample features of the plurality of first sample features, respectively, using a neural network model based on the plurality of first sample visual features to obtain a plurality of first sample fusion features, the number of the plurality of first sample fusion features being less than the number of the plurality of first sample features;

A fourth fusing unit configured to fuse, based on the plurality of first sample features, at least one set of adjacent first sample visual features of the plurality of first sample visual features with the neural network model, respectively, to obtain a plurality of first sample visual fusion features, the number of the plurality of first sample visual fusion features being less than the number of the plurality of first sample visual features;

a second reinforcement unit configured to reinforce the plurality of first sample visual fusion features and the plurality of first sample text fusion features based on an attention mechanism using the neural network model to obtain a plurality of second sample visual features and a plurality of second sample text features;

a second processing unit configured to obtain a predicted image processing result based on the plurality of second sample visual features and the plurality of second sample text features; and

and the parameter adjusting unit is configured to adjust parameters of the neural network model based on the real image processing result and the predicted image processing result so as to obtain an image processing model.

31. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-14.

32. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-14.

33. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-14.