CN115422389B

CN115422389B - Method and device for processing text image and training method of neural network

Info

Publication number: CN115422389B
Application number: CN202211382996.8A
Authority: CN
Inventors: 吕鹏原; 章成全
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-04-07
Anticipated expiration: 2042-11-07
Also published as: CN115422389A

Abstract

The invention provides a method for processing text images, a neural network and a training method thereof, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as OCR (optical character recognition). The method comprises the following steps: inputting the text image into a coding subnetwork to obtain a first text image characteristic; inputting the trained plurality of text query features and the first text image feature into a first decoding subnetwork to obtain a plurality of query result features corresponding to the plurality of text query features; inputting the plurality of query result features into a classification subnetwork to obtain a plurality of classification results corresponding to the plurality of text query features; determining at least one target query feature among the plurality of text query features based on the plurality of classification results; at least one query result feature corresponding to the at least one target query feature and the first text image feature are input into the prediction subnetwork to obtain text image processing results corresponding to the plurality of at least one target query feature.

Description

Method and device for processing text image and training method of neural network

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning, image processing, and computer vision technologies, which may be applied to scenes such as OCR, and in particular, to a method for processing a text image using a neural network, a neural network training method, a neural network, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial intelligence is the subject of research that causes computers to simulate certain human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

The technology of Character detection and Recognition (OCR) in natural scenes can be widely applied to various social industries, such as education, medical treatment, finance and the like. The technologies of identification of common card bills derived from the character detection and identification technology, automatic document input, photographing and question searching and the like greatly improve the intelligent degree and the production efficiency of the traditional industry, and facilitate daily learning and life of people. In recent years, although the end-to-end character detection and identification technology in natural scenes is rapidly developed, many problems still exist and are not well solved.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a method of processing a text image using a neural network, a training method of the neural network, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a method of processing a text image using a neural network, the neural network including an encoding sub-network, a first decoding sub-network, a classification sub-network, and a prediction sub-network. The method comprises the following steps: inputting the text image into a coding sub-network to obtain a first text image characteristic; inputting the trained plurality of text query features and the first text image features into a first decoding subnetwork to obtain a plurality of query result features corresponding to the plurality of text query features; inputting the plurality of query result features into a classification sub-network to obtain a plurality of classification results corresponding to the plurality of text query features, wherein the classification results indicate whether the corresponding text query features query text instances in the text image; determining at least one target query feature from the plurality of text query features based on the plurality of classification results, wherein the at least one target query feature comprises the text query feature of the queried text instance; and inputting at least one query result feature corresponding to the at least one target query feature and the first text image feature into the prediction subnetwork to obtain text image processing results corresponding to the plurality of at least one target query feature.

According to another aspect of the present disclosure, there is provided a training method of a neural network including an encoding subnetwork, a first decoding subnetwork, a classification subnetwork, and a prediction subnetwork. The method comprises the following steps: acquiring a sample text image and respective real labels of a plurality of sample texts in the sample text image; inputting a sample text image into a coding subnetwork to obtain a first text image characteristic; inputting the plurality of text query features and the first text image feature into a first decoding subnetwork to obtain a plurality of query result features corresponding to the plurality of text query features; inputting the plurality of query result features into a classification sub-network to obtain a plurality of classification results corresponding to the plurality of text query features, wherein the classification results indicate whether the corresponding text query features query text instances in the text image; inputting the plurality of query result features and the first text image feature into a prediction subnetwork to obtain text image processing results corresponding to the plurality of text query features; determining a comprehensive loss value based on respective real labels of the sample texts and text image processing results corresponding to the text query features; and adjusting parameters of the neural network and the plurality of text query features based on the composite loss value to obtain a trained neural network and a trained plurality of text query features.

According to another aspect of the present disclosure, there is provided a neural network, including: an encoding subnetwork configured to process the text image to obtain a first text image feature; a first decoding subnetwork configured to process the trained plurality of text query features and first text image features to obtain a plurality of query result features corresponding to the plurality of text query features; a classification subnetwork configured to process the plurality of query result features to obtain a plurality of classification results corresponding to the plurality of text query features, the classification results indicating whether the corresponding text query features have queried a text instance in the text image; and a prediction subnetwork configured to process the at least one query result feature corresponding to the at least one target query feature and the first text image feature to obtain text image processing results corresponding to a plurality of the at least one target query feature, wherein the at least one target query feature comprises text query features determined from the plurality of text query features based on the plurality of classification results for the query-to-text instance.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the above method.

According to another aspect of the disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes the above-mentioned method when executed by a processor.

According to one or more embodiments of the present disclosure, the first text image feature is decoded by using a plurality of trained text query features, and the obtained query result features are used to determine which text query features query the text instance, so that the query result features corresponding to the target query features including the text query features that query the text instance and the first text image feature can be used to obtain the result of text image processing, thereby simplifying the flow of text image processing, realizing end-to-end text image processing independent of detection results, having stronger robustness, and improving the efficiency and precision of text image processing.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a flowchart of a method of processing a text image according to an example embodiment of the present disclosure;

FIG. 3 shows a flowchart of text image processing results using a prediction subnetwork in accordance with an exemplary embodiment of the present disclosure;

FIG. 4 shows a flowchart for obtaining text instance segmentation results using text instance segmentation subnetworks according to an example embodiment of the present disclosure;

FIG. 5 shows a flow chart of a method of training a neural network according to an example embodiment of the present disclosure;

FIG. 6 illustrates a flow chart for determining a composite loss value based on respective true labels of a plurality of sample texts and text image processing results corresponding to a plurality of text query features according to an exemplary embodiment of the present disclosure;

FIG. 7 shows a block diagram of a neural network, according to an example embodiment of the present disclosure;

FIG. 8 shows a block diagram of a neural network, according to an example embodiment of the present disclosure; and

FIG. 9 sets forth a block diagram of exemplary electronic devices that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, it will be recognized by those of ordinary skill in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

In the related art, the existing end-to-end OCR method generally needs to use a detection network to locate a text region, then take out features of the text region and send the features into a recognition network, or detect and recognize a single character, and then connect the characters through a post-processing operation to output text contents. However, the existing methods all depend heavily on the detection result, and the detection accuracy is slightly reduced, which causes the identification accuracy to be greatly reduced and lacks robustness.

In order to solve the above problems, the present disclosure decodes a first text image feature by using a plurality of trained text query features, and determines which text query features successfully query a text instance by using the obtained query result features, and then may obtain a result of processing a text image by using the query result features and the first text image features, thereby simplifying a flow of processing the text image, implementing end-to-end text image processing independent of a detection result, having stronger robustness, and improving a speed and an effect of processing the text image.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable the method of text recognition and/or the training method of neural networks to be performed.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) network.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use

client devices

101, 102, 103, 104, 105, and/or 106 to perform the capture operation of the text image. The client device may provide an interface that enables a user of the client device to interact with the client device, e.g., the user may capture text images using a camera of the client device or upload images stored in the client device to a server using the client device. The client device may also output information to the user via the interface, e.g., the client may output to the user the results of processing the text image uploaded by the user by the method for processing text images running on the server. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablets, personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

According to an aspect of the present disclosure, a method of processing a text image using a neural network is provided. The neural network includes an encoding subnetwork, a first decoding subnetwork, a classification subnetwork, and a prediction subnetwork. As shown in fig. 2, the method includes: step S201, inputting a text image into a coding subnetwork to obtain a first text image characteristic; step S202, inputting the trained plurality of text query features and the first text image feature into a first decoding subnetwork to obtain a plurality of query result features corresponding to the plurality of text query features; step S203, inputting the plurality of query result characteristics into a classification sub-network to obtain a plurality of classification results corresponding to the plurality of text query characteristics, wherein the classification results indicate whether the corresponding text query characteristics query a text example in a text image; step S204, determining at least one target query feature in the plurality of text query features based on the plurality of classification results, wherein the at least one target query feature comprises the text query feature of the queried text example; and step S205, inputting at least one query result characteristic corresponding to at least one target query characteristic and the first text image characteristic into a prediction sub-network to obtain text image processing results corresponding to a plurality of at least one target query characteristic.

Therefore, the first text image feature is decoded by using the trained text query features, the obtained query result features are used for determining which text query features successfully query the text instance, and then the query result features and the first text image features can be used for obtaining the text image processing result, so that the text image processing flow is simplified, end-to-end text image processing independent of the detection result is realized, the robustness is higher, and the speed and the effect of the text image processing are improved.

The text image may be any image that includes one or more text regions. In particular, the method of processing text images and neural network of the present disclosure can be used to process images of natural scenes containing arbitrarily shaped text.

In some embodiments, the encoding subnetwork can be any Convolutional Neural Network (CNN) based Network, transform based Network, hybrid of the two, or other Network structure that can be used to extract visual features from the original text image as the image feature input to the decoding subnetwork. Those skilled in the art can use existing neural networks with image feature extraction capabilities as coding sub-networks, and exemplary coding sub-networks can include VGG, resNet, densnet, mobileNet, VIT, PVT, swin Transformer, and the like. In addition, the person skilled in the art can build the coding sub-network according to the requirement. Before the original text image is input into the coding subnetwork, the text image can be preprocessed, so that the accuracy of a finally obtained text image processing result is improved. The text image feature may be a text image feature sequence obtained by performing processing such as blocking (patch) and deformation (reshape) on a feature map obtained by encoding a text image, and the text image feature sequence may include a plurality of text image sub-features corresponding to a plurality of image regions (or pixels) in the text image.

In some embodiments, the first decoding subnetwork is structurally similar to the decoder of the functional and Transformer network. The first decoding subnetwork may decode the text image features based on the trained plurality of text Query (Query) features using an attention mechanism to obtain a plurality of Query result features in one-to-one correspondence with the text Query features. The number of the plurality of text query features is predetermined during the training phase, the number indicating a maximum number of text instances that can be detected or identified in the text image. In one exemplary embodiment, there are N text query features. How to train to obtain the plurality of text query features will be described in detail below.

According to some embodiments, the first decoding subnetwork may comprise a self-attention subnetwork and a cross-attention subnetwork. Step S202 of inputting the trained plurality of text query features and text image features into a first decoding subnetwork to obtain a plurality of query result features corresponding to the plurality of text query features includes: processing the plurality of text query features based on a self-attention mechanism by utilizing a self-attention subnetwork to obtain a plurality of strengthened text query features; and decoding the text image features by using the enhanced plurality of text query features based on the cross attention by using the cross attention sub-network to obtain a plurality of query result features.

Therefore, by using the self-attention subnetwork, the incidence relation between the text query features (and potential text instances) can be established, so that the capability of the neural network for integrally executing downstream tasks such as text detection, text recognition and the like on the text image is improved, and a more accurate text image processing result can be obtained.

In some embodiments, a text image feature (e.g., the plurality of text image sub-features described above) may be converted into a plurality of text image Key (Key) features and a plurality of text image Value (Value) features, and cross attention calculations may be performed using a cross attention subnetwork to obtain a plurality of query result features based on the plurality of text query features (or the enhanced plurality of text query features output from the attention subnetwork described above), the plurality of text image Key features, and the plurality of text image Value features.

In some embodiments, the classification subnetwork may be a multi-tier perceptron. It is understood that the classification subnetwork may also adopt other network structures, and is not limited herein. The classification subnetwork may output a corresponding classification result for each query result feature, or may process the entire sequence of the multiple query result features to obtain a classification result corresponding to each query result feature (corresponding text query feature). The classification result may characterize a probability that the corresponding text query feature has queried a text instance in the text image. In some embodiments, a threshold filter may be used to determine which text query features have queried text instances based on these probabilities.

According to some embodiments, the prediction subnetwork may comprise a text instance segmentation subnetwork. As shown in fig. 3, the step S205 of inputting the at least one query result feature corresponding to the at least one target query feature and the first text image feature into the prediction subnetwork to obtain the text image processing result corresponding to the plurality of at least one target query feature may include: step S301, inputting at least one query result characteristic corresponding to at least one target query characteristic and a first text image characteristic into a text instance segmentation sub-network to obtain a text instance segmentation result corresponding to the at least one target query characteristic, wherein the text instance segmentation result indicates a text area corresponding to a text instance queried by the corresponding target query characteristic in the text image. By the method, the text instance segmentation result corresponding to each target query feature can be obtained quickly, and the text instance segmentation result can be more accurate by utilizing the query result feature and the first text image.

In some embodiments, only the query result features corresponding to the text query features of the queried text instances may be processed to obtain corresponding text instance segmentation results; or processing the query result features corresponding to all the N text query features to obtain N text instance segmentation results. The size of the N text instance segmentation results may be, for example, N × H × W. In some embodiments, H and W may be the same as or less than the height and width of the original text image.

In some embodiments, the text instance segmentation sub-network may obtain an original result with a size H × W based on a query result feature and a text image feature corresponding to the text query feature, where each pixel indicates a probability that a corresponding position in the original text image belongs to a text instance queried by the text query feature, and further may obtain a text envelope, that is, a corresponding text instance segmentation result, by performing operations such as binarization and connected region solving on the original result.

According to some embodiments, the text instance partitioning sub-network may include a mask embedding sub-network and a second decoding sub-network. As shown in fig. 4, the step S301 of inputting at least one query result feature corresponding to at least one target query feature and the first text image feature into the text instance segmentation sub-network to obtain a text instance segmentation result corresponding to at least one target query feature may include: step S401, inputting the first text image characteristic into a second decoding sub-network to obtain a second text image characteristic, wherein the second text image characteristic has a target length, a target width and a target channel number; step S402, inputting a mask into each query result feature of the at least one query result feature to embed into a sub-network, so as to obtain at least one mask embedded feature corresponding to the at least one target query feature, wherein the vector length of each mask embedded feature of the at least one mask embedded feature is the same as the number of target channels; and step S403, for each target query feature in at least one target query feature, multiplying the mask embedded feature of the target query feature by the second text image feature to obtain a text instance segmentation result corresponding to the target query feature, wherein the text instance segmentation result has a target length and a target width.

Therefore, the mask embedded features with the same channel number as the image features are obtained based on the query result features, so that the text instance segmentation results corresponding to each target query feature can be quickly and conveniently obtained.

In some embodiments, the mask embedding subnetwork may be a multi-tier perceptron. It is understood that the mask embedding sub-network may also adopt other network structures, and is not limited herein. In some embodiments, the size of the second text image feature may be, for example, C1 × H × W, and the size of the mask embedding features may be, for example, N × C1, so that after directly multiplying the second text image feature by the mask embedding features, a text instance segmentation result with a size of N × H × W, that is, N text instance segmentation results with a size of H × W, may be obtained.

According to some embodiments, the prediction subnetwork comprises a character segmentation subnetwork. Step S205 of inputting at least one query result feature corresponding to at least one target query feature and the first text image feature into a prediction subnetwork to obtain a text image processing result corresponding to a plurality of at least one target query feature may include: step S302, inputting a first text image characteristic into a character segmentation sub-network to obtain a plurality of character segmentation results, wherein the character segmentation results correspond to a plurality of character categories, and each character segmentation result in the character segmentation results indicates a character area corresponding to the character category corresponding to the character segmentation result in the text image; and step S303, aiming at each target query feature in at least one target query feature, determining the text content of the text instance queried by the target query feature by using the text instance segmentation result and the plurality of character segmentation results corresponding to the target query feature.

Therefore, a plurality of global character segmentation results corresponding to a plurality of character categories are obtained by utilizing the first text image, and the text instance segmentation results corresponding to each target query feature are combined to obtain the accurate text content of the queried text instance.

In some embodiments, the character categories may include, for example, different Chinese characters, letters, numbers, special symbols, and the like. In one exemplary embodiment, the character categories may include 26 letters and 10 numbers.

According to some embodiments, the step S302 of inputting the first text image feature into a character segmentation sub-network to obtain a plurality of character segmentation results comprises: the first text image feature input character segmentation sub-network is input to obtain a plurality of character segmentation results and a background segmentation result, the background segmentation result indicating a plurality of character regions corresponding to a plurality of characters in the text image.

In some embodiments, in step S303, for each target query feature of at least one target query feature, determining the text content of the text instance queried by the target query feature by using the text instance segmentation result and the multiple character segmentation results corresponding to the target query feature may include: for each target query feature in the at least one target query feature, determining at least one character region in a text region indicated by a text instance segmentation result corresponding to the target query feature from a plurality of character regions indicated by background segmentation results; and determining a character category corresponding to each of at least one character area based on a plurality of character segmentation results to obtain the text content of the text example queried by the target query feature.

Therefore, by generating a background segmentation result including the positions of all characters in the text image and determining the characters included in each text instance by using the text instance segmentation result, the character segmentation result and the background segmentation result, the recognition accuracy of the text content in the text instance is improved, and particularly under the condition of irregular shape or long text, the method can output an accurate recognition result.

In one exemplary embodiment, the method of processing a text image of the present disclosure may process a first text image feature of an input text image based on a plurality of text query features to predict N classification results and N text instance segmentation results. Threshold filtering can be carried out on the N classification results, positive samples are reserved, then character prediction results in a text area indicated by the text example segmentation results of each inquired text example are determined based on the character segmentation results and the background segmentation results, and post-processing is carried out on the character prediction results to obtain text contents of the text examples.

According to some embodiments, the first text image feature has a feature length H' less than the target length H and a feature width W less than the target width W. The second decoding subnetwork and the character segmenting subnetwork may each be configured to upsample the first text image feature. The character segmentation result may have a target length and a target width. Since the features extracted by the coding sub-network usually have a smaller resolution, upsampling can be performed to improve the recognition accuracy. Further, by setting the character segmentation result and the text instance segmentation result at the same resolution, it is possible to obtain a more accurate text image processing result.

In some embodiments, the character segmentation sub-network may include a third decoding sub-network for upsampling configured to process the first text image feature to obtain a third text image feature having a size of C2 × H × W, and a character segmentation result prediction sub-network configured to process the third text image feature to predict a character segmentation result. The size of the character segmentation result may be C3 × H × W, where C3 is the number of character classes.

By the method, the text with any shape and any length can be recognized efficiently and accurately.

According to another aspect of the present disclosure, a method of training a neural network is provided. The neural network includes an encoding subnetwork, a first decoding subnetwork, a classification subnetwork, and a prediction subnetwork. As shown in fig. 5, the method includes: step S501, obtaining a sample text image and respective real labels of a plurality of sample texts in the sample text image; step S502, inputting a sample text image into a coding sub-network to obtain a first text image characteristic; step S503, inputting the plurality of text query features and the first text image feature into a first decoding sub-network to obtain a plurality of query result features corresponding to the plurality of text query features; step S504, inputting the plurality of query result characteristics into a classification sub-network to obtain a plurality of classification results corresponding to the plurality of text query characteristics, wherein the classification results indicate whether the corresponding text query characteristics query a text example in a text image; step S505, inputting the plurality of query result characteristics and the first text image characteristic into a prediction subnetwork to obtain text image processing results corresponding to the plurality of text query characteristics; step S506, determining a comprehensive loss value based on respective real labels of a plurality of sample texts and text image processing results corresponding to a plurality of text query features; and step S507, based on the comprehensive loss value, adjusting parameters of the neural network and the plurality of text query features to obtain the trained neural network and the trained plurality of text query features. It is understood that the operations of step S502 to step S505 in fig. 5 are similar to the operations of step S201 to step S204 in fig. 2, and are not described herein again.

Therefore, the neural network and the plurality of text query features are trained by using the sample, so that the trained neural network and the trained plurality of text query features can be used for obtaining an accurate text image processing result.

In some embodiments, the plurality of sample texts may be a plurality of sets of texts in the image. The real label of the sample text may include a real text region, a real character region, and a real character category of the sample text. The real text area may include, for example, an outer bounding box of the sample text, and may also include information such as a position, a shape, a size, and the like of the sample text. The real character area may include an outer bounding box of the sample text of the characters in the sample text, and may also include information on the positions, shapes, sizes, and the like of the characters. The real character category may indicate what the characters in the sample text are.

In some embodiments, before training, the plurality of text query features may be generated by random initialization, or may be directly assigned. In an exemplary embodiment, the assignment may be performed using a predetermined function, such as a position encoding method shown in the following formula.

Where pos refers to the location of each text query feature in the plurality of text query features, d _model The dimension of each text query feature is referred to, and 2i and 2i +1 respectively represent the even-numbered encoding mode and the odd-numbered encoding mode in each text query feature.

Training a neural network and a plurality of text query features by utilizing a sample text image and corresponding real labels can enable the trained plurality of text query features to have the capability of querying a text example in the text image.

According to some embodiments, the composite loss value may include a classification loss value corresponding to each of the plurality of text query features. As shown in fig. 6, the step S506 of determining the composite loss value based on the real labels of the sample texts and the text image processing results corresponding to the text query features may include: step S601, matching the plurality of sample texts and the plurality of text query features based on text image processing results corresponding to the plurality of sample texts and the plurality of text query features; and step S602, aiming at each text query feature in the text query features, determining a classification loss value corresponding to the text query feature based on whether the text image processing result corresponding to the text query feature is successfully matched with the sample texts.

By the method, the purpose that the neural network and the plurality of text query features are trained by using the classification result as the supervision signal after the text image processing result is matched with the sample text is achieved, the prediction performance of the trained neural network is improved, and more effective text query features can be obtained.

According to some embodiments, the real label may comprise a real text region of the corresponding sample text, and the prediction subnetwork comprises a text instance segmentation subnetwork. Step S505 of inputting the plurality of query result features and the first text image feature into a prediction sub-network to obtain a text image processing result corresponding to the plurality of text query features may include: inputting a plurality of query result features and first text image features into a text instance segmentation sub-network to obtain text instance segmentation results corresponding to the plurality of text query features, wherein the text instance segmentation results indicate text regions in the text image corresponding to the corresponding text query features, and the plurality of sample texts and the plurality of text query features are matched based on real text regions of the plurality of sample texts and text regions of the plurality of text instance segmentation results corresponding to the plurality of text query features.

In some embodiments, the composite loss value may comprise a text instance segmentation loss value. Step S506, determining a composite loss value based on the respective real tags of the plurality of sample texts and the text image processing results corresponding to the plurality of text query features may include: step S603, determining a text instance segmentation loss value based on the real text region of the matched sample text and the text region indicated by the text instance segmentation result corresponding to the matched text query feature.

By the method, the matching of the real text region of the sample text and the text instance segmentation result obtained based on the query result feature and the text image feature is realized, and the real text region is used as a supervision signal for training, so that the prediction performance of the trained neural network is improved, and more effective text query features can be obtained.

According to some embodiments, the text instance partitioning sub-network may include a mask embedding sub-network and a second decoding sub-network. Entering the plurality of query result features and the first text image feature into a text instance segmentation sub-network to obtain text instance segmentation results corresponding to the plurality of text query features comprises: inputting the first text image feature into a second decoding subnetwork to obtain a second text image feature, wherein the second text image feature has a target length, a target width and a target channel number; respectively inputting a mask of each query result characteristic in the plurality of query result characteristics into a sub-network to obtain a plurality of mask embedded characteristics corresponding to the plurality of text query characteristics, wherein the vector length of each mask embedded characteristic in the plurality of mask embedded characteristics is the same as the number of target channels; and for each text query feature of the plurality of text query features, multiplying the mask embedded feature of the text query feature by the second text image feature to obtain a text instance segmentation result corresponding to the text query feature, wherein the text instance segmentation result has a target length and a target width.

According to some embodiments, the real label may include a real character region and a real character category of characters in the corresponding sample text. The prediction subnetwork may comprise a character segmentation subnetwork. Step S505 of inputting the plurality of query result features and the first text image feature into a prediction sub-network to obtain text image processing results corresponding to the plurality of text query features includes: the first text image feature input character segmentation sub-network is input to obtain a plurality of character segmentation results, the plurality of character segmentation results correspond to a plurality of character categories, and each of the plurality of character segmentation results indicates a character region in the text image corresponding to the character category corresponding to the character segmentation result.

In some embodiments, the composite loss value comprises a character segmentation loss value. Step S506, determining a comprehensive loss value based on the respective real tags of the plurality of sample texts and the text image processing results corresponding to the plurality of text query features includes: step S604, determining a character segmentation loss value based on the plurality of character segmentation results and the real character regions and the real character categories of the characters in the plurality of sample texts.

By the method, the neural network and the plurality of text query features are trained by using the character segmentation result as the supervision signal, the prediction performance of the trained neural network is improved, and more effective text query features can be obtained.

In some embodiments, during training, the global real character region can be directly used as a monitor for character segmentation result prediction. For text instance segmentation result prediction, a label for dynamically specifying each text instance is required. For the real text regions of the M sample texts and the text example segmentation results corresponding to the N text query features, an Intersection over Union (IoU) may be first calculated between each two text examples to obtain an IoU matrix. And then, sample texts are specified for the N text query features through the Hungarian algorithm. If a certain text example segmentation result can be matched with the real text area, the corresponding real label (ground route) is the specified sample text, and the label of the classification result of the corresponding text query feature is set to be 1, otherwise, the label is set to be 0. For text instance segmentation results that do not match the upper sample text, no penalty value is calculated.

According to some embodiments, the first text image feature has a feature length less than a target length and a feature width less than a target width, the second decoding subnetwork and the character segmentation subnetwork each being configured to upsample the first text image feature, wherein the character segmentation result has a target length and a target width.

According to some embodiments, the first decoding subnetwork may comprise a self-attention subnetwork and a cross-attention subnetwork. Step S503 of inputting the plurality of text query features and text image features into the first decoding subnetwork to obtain a plurality of query result features corresponding to the plurality of text query features may include: processing the plurality of text query features based on a self-attention mechanism by utilizing a self-attention subnetwork to obtain a plurality of strengthened text query features; and decoding the text image features by using the enhanced plurality of text query features based on the cross attention by using the cross attention sub-network to obtain a plurality of query result features.

According to another aspect of the present disclosure, a neural network is provided. As shown in fig. 7, the neural network 700 includes: an encoding subnetwork 710 configured to process the text image 702 for first text image features; a first decoding subnetwork 720 configured to process the trained plurality of text query features 704 and first text image features to obtain a plurality of query result features 706 corresponding to the plurality of text query features; a classification subnetwork 730 configured to process the plurality of query result features to obtain a plurality of classification results 708 corresponding to the plurality of text query features, the classification results indicating whether the corresponding text query features queried a text instance in the text image; and a prediction subnetwork 740 configured to process the first text image feature and at least one query result feature corresponding to the at least one target query feature to obtain text image processing results 712 corresponding to a plurality of the at least one target query feature, wherein the at least one target query feature comprises text query features determined from among the plurality of text query features based on the plurality of classification results for the query-to-text instance. It is understood that the operations of the sub-network 710 and the sub-network 740 in fig. 7 are similar to the operations of the steps S201, S203, and S205 in fig. 2, and are not described herein again.

According to some embodiments, as shown in FIG. 8, the neural network 800 prediction subnetwork 840 may include a text instance segmentation subnetwork 850 configured to process the first text image feature and at least one query result feature corresponding to the at least one target query feature to obtain a text instance segmentation result corresponding to the at least one target query feature. The text example segmentation result indicates a text region corresponding to the text example queried by the corresponding target query feature in the text image. It can be understood that the text image 802, the encoding subnetwork 810, the first decoding subnetwork 820, the plurality of text query features 804, the plurality of query result features 806, the classification subnetwork 830, the plurality of classification results 808, and the text image processing result 812 in fig. 8 are respectively similar to the corresponding network structures in fig. 7, and are not described herein again.

According to some embodiments, text instance segmentation subnetwork 850 may include: a second decoding subnetwork (not shown in the figure) configured to input the first text image feature into the second decoding subnetwork to obtain a second text image feature, the second text image feature having a target length, a target width, and a target channel number; and a mask embedding subnetwork (not shown in the figure) configured to embed each of the at least one query result feature input mask into the subnetwork to obtain at least one mask-embedded feature corresponding to the at least one target query feature, wherein a vector length of each of the at least one mask-embedded feature is the same as a target channel number. The text instance segmentation subnetwork 850 may be configured to, for each of the at least one target query feature, multiply the mask-embedded feature of the target query feature with the second text image feature to obtain a text instance segmentation result corresponding to the target query feature, the text instance segmentation result having a target length and a target width.

According to some embodiments, the prediction subnetwork 840 can include a character segmentation subnetwork 860 configured to process the first text image feature to obtain a plurality of character segmentation results 814, the plurality of character segmentation results corresponding to a plurality of character categories, and each of the plurality of character segmentation results indicating a character region in the text image corresponding to the character category to which the character segmentation result corresponds. The prediction subnetwork 840 may be configured to determine, for each of at least one target query feature, the textual content of the text instance queried by the target query feature using the text instance segmentation result and the plurality of character segmentation results corresponding to the target query feature.

According to some embodiments, character segmentation sub-network 860 may be configured to input the first text image feature into the character segmentation sub-network to obtain a plurality of character segmentation results and a background segmentation result, the background segmentation result indicating a plurality of character regions corresponding to a plurality of characters in the text image. The prediction subnetwork 840 can be configured to: for each target query feature in the at least one target query feature, determining at least one character region in the text region indicated by the text instance segmentation result corresponding to the target query feature in the plurality of character regions indicated by the background segmentation result; and determining a character category corresponding to each of the at least one character area based on the plurality of character segmentation results to obtain text content of the text example queried by the target query feature.

According to some embodiments, the first text image feature may have a feature length less than a target length and a feature width less than a target width, and the second decoding subnetwork and the character segmentation subnetwork may each be configured to upsample the first text image feature. The character segmentation result may have a target length and a target width.

According to some embodiments, the first decoding subnetwork 820 may comprise: a self-attention subnetwork (not shown) configured to process the plurality of text query features based on a self-attention mechanism to obtain an enhanced plurality of text query features; and a cross attention network (not shown in the figure) configured to decode the text image feature with the enhanced plurality of text query features based on the cross attention to obtain a plurality of query result features.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.

Referring to fig. 9, a block diagram of a structure of an electronic device 900 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 908, and a communication unit 909. The input unit 906 may be any type of device capable of inputting information to the device 900, and the input unit 906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 907 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 908 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 909 allows the device 900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 901 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning network algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as the method of processing a text image and/or the training method of a neural network. For example, in some embodiments, the method of processing text images and/or the training method of neural networks may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 908. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 900 via ROM 902 and/or communications unit 909. When loaded into RAM 903 and executed by the computing unit 901, a computer program may perform one or more steps of the method of processing text images and/or the training method of neural networks described above. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform the method of processing text images and/or the training method of the neural network.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A method of processing a text image using a neural network, the neural network including an encoding subnetwork, a first decoding subnetwork, a classifying subnetwork, and a prediction subnetwork, the prediction subnetwork including a text instance partitioning subnetwork and a character partitioning subnetwork, the method comprising:

inputting a text image into the coding subnetwork to obtain a first text image characteristic;

inputting the trained plurality of text query features and the first text image feature into the first decoding subnetwork to obtain a plurality of query result features corresponding to the plurality of text query features;

inputting the plurality of query result features into the classification sub-network to obtain a plurality of classification results corresponding to the plurality of text query features, the classification results indicating whether the corresponding text query features queried a text instance in the text image;

determining at least one target query feature in the plurality of text query features based on the plurality of classification results, the at least one target query feature comprising a text query feature that queried a text instance; and

inputting at least one query result feature corresponding to the at least one target query feature and the first text image feature into the prediction sub-network to obtain text image processing results corresponding to the plurality of at least one target query feature, comprising:

inputting at least one query result feature corresponding to the at least one target query feature and the first text image feature into the text instance segmentation sub-network to obtain a text instance segmentation result corresponding to the at least one target query feature, wherein the text instance segmentation result indicates a text region in the text image corresponding to a text instance queried by the corresponding target query feature;

inputting the first text image feature into the sub-network of character segmentations to obtain a plurality of character segmentation results, the plurality of character segmentation results corresponding to a plurality of character categories, and each of the plurality of character segmentation results indicating a character region in the text image corresponding to the character category to which the character segmentation result corresponds; and

and for each target query feature in the at least one target query feature, determining the text content of the text instance queried by the target query feature by using the text instance segmentation result corresponding to the target query feature and the plurality of character segmentation results.

2. The method of claim 1, wherein the text instance segmentation sub-network comprises a mask embedding sub-network and a second decoding sub-network, and wherein entering the at least one query result feature corresponding to the at least one target query feature and the first text image feature into the text instance segmentation sub-network to obtain the text instance segmentation result corresponding to the at least one target query feature comprises:

inputting the first text image feature into the second decoding subnetwork to obtain a second text image feature, wherein the second text image feature has a target length, a target width and a target channel number;

inputting each query result feature of the at least one query result feature into the mask embedding sub-network to obtain at least one mask embedding feature corresponding to the at least one target query feature, wherein the vector length of each mask embedding feature of the at least one mask embedding feature is the same as the number of target channels; and

and for each target query feature in the at least one target query feature, multiplying the mask embedded feature of the target query feature by the second text image feature to obtain a text instance segmentation result corresponding to the target query feature, wherein the text instance segmentation result has the target length and the target width.

3. The method of claim 1, wherein entering the first text image feature into the character segmentation sub-network to obtain a plurality of character segmentation results comprises:

inputting the first text image feature into the sub-network of character segmentations to obtain the plurality of character segmentation results and a background segmentation result, the background segmentation result indicating a plurality of character regions corresponding to a plurality of characters in the text image,

for each target query feature in the at least one target query feature, determining the text content of the text instance queried by the target query feature by using the text instance segmentation result corresponding to the target query feature and the plurality of character segmentation results comprises:

for each target query feature in the at least one target query feature, determining at least one character region in the text region indicated by the text instance segmentation result corresponding to the target query feature from the plurality of character regions indicated by the background segmentation result; and

and determining the character category corresponding to each of the at least one character area based on the plurality of character segmentation results to obtain the text content of the text example queried by the target query feature.

4. The method of claim 2, wherein the first text image feature has a feature length that is less than the target length and a feature width that is less than the target width, and wherein the second decoding subnetwork and the character segmentation subnetwork are each configured to upsample the first text image feature, wherein the character segmentation result has the target length and the target width.

5. The method of claim 1, wherein the first decoding subnetwork comprises a self-attention subnetwork and a cross-attention subnetwork, and wherein entering the trained plurality of text query features and the text image features into the first decoding subnetwork to obtain a plurality of query result features corresponding to the plurality of text query features comprises:

processing the plurality of text query features based on a self-attention mechanism by using the self-attention subnetwork to obtain a plurality of enhanced text query features; and

decoding, by the cross-attention subnetwork, the text image feature with the enhanced plurality of text query features based on cross-attention to obtain the plurality of query result features.

6. A method of training a neural network, the neural network comprising an encoding subnetwork, a first decoding subnetwork, a classification subnetwork, and a prediction subnetwork, the prediction subnetwork comprising a text instance segmentation subnetwork and a character segmentation subnetwork, the method comprising:

acquiring a sample text image and respective real labels of a plurality of sample texts in the sample text image, wherein the real labels comprise corresponding real text regions of the sample text, corresponding real character regions of characters in the sample text and real character categories;

inputting the sample text image into the coding subnetwork to obtain a first text image characteristic;

inputting a plurality of text query features and the first text image feature into the first decoding subnetwork to obtain a plurality of query result features corresponding to the plurality of text query features;

inputting the plurality of query result features and the first text image feature into the prediction subnetwork to obtain text image processing results corresponding to the plurality of text query features, comprising:

inputting the plurality of query result features and the first text image feature into the text instance segmentation sub-network to obtain text instance segmentation results corresponding to the plurality of text query features, wherein the text instance segmentation results indicate text regions in the text image corresponding to the corresponding text query features, and the plurality of sample texts and the plurality of text query features are matched based on real text regions of the plurality of sample texts and text regions of the plurality of text instance segmentation results corresponding to the plurality of text query features; and

inputting the first text image feature into the sub-network of character segmentations to obtain a plurality of character segmentation results, the plurality of character segmentation results corresponding to a plurality of character categories, and each of the plurality of character segmentation results indicating a character region in the text image corresponding to the character category to which the character segmentation result corresponds;

determining a composite loss value based on the respective real labels of the plurality of sample texts and the text image processing results corresponding to the plurality of text query features, including:

determining a text instance segmentation loss value based on a real text region of the matched sample text and a text region indicated by a text instance segmentation result corresponding to the matched text query feature, wherein the comprehensive loss value comprises the text instance segmentation loss value; and

determining a character segmentation loss value based on the plurality of character segmentation results and real character regions and real character categories of characters in the plurality of sample texts, wherein the comprehensive loss value comprises the character segmentation loss value; and

and adjusting parameters of the neural network and the plurality of text query features based on the comprehensive loss value to obtain a trained neural network and a plurality of trained text query features.

7. The method of claim 6, wherein the composite loss value comprises a classification loss value corresponding to each of the plurality of text query features, and wherein determining the composite loss value based on the authentic label for each of the plurality of sample texts and the text image processing results corresponding to the plurality of text query features comprises:

matching the plurality of sample texts with the plurality of text query features based on text image processing results corresponding to the plurality of sample texts and the plurality of text query features; and

for each text query feature of the plurality of text query features, determining a classification loss value corresponding to the text query feature based on whether a text image processing result corresponding to the text query feature successfully matches the plurality of sample texts.

8. The method of claim 6, wherein the text instance segmentation sub-network comprises a mask embedding sub-network and a second decoding sub-network, and wherein entering the plurality of query result features and the first text image feature into the text instance segmentation sub-network to obtain text instance segmentation results corresponding to the plurality of text query features comprises:

inputting each query result feature of the plurality of query result features into the mask embedding sub-network to obtain a plurality of mask embedding features corresponding to the plurality of text query features, wherein the vector length of each mask embedding feature of the plurality of mask embedding features is the same as the number of target channels; and

for each text query feature of the plurality of text query features, multiplying the mask embedded feature of the text query feature by the second text image feature to obtain a text instance segmentation result corresponding to the text query feature, wherein the text instance segmentation result has the target length and the target width.

9. The method of claim 8, wherein the first text image feature has a feature length less than the target length and a feature width less than the target width, and wherein the second decoding subnetwork and the character segmentation subnetwork are each configured to upsample the first text image feature, wherein the character segmentation result has the target length and the target width.

10. The method of claim 6, wherein the first decoding subnetwork comprises a self-attention subnetwork and a cross-attention subnetwork, and wherein entering a plurality of text query features and the text image features into the first decoding subnetwork to obtain a plurality of query result features corresponding to the plurality of text query features comprises:

11. An apparatus for processing a text image using a neural network, the apparatus comprising:

an encoding subnetwork configured to process the text image to obtain a first text image feature;

a first decoding subnetwork configured to process the trained plurality of text query features and the first text image features to obtain a plurality of query result features corresponding to the plurality of text query features;

a classification subnetwork configured to process the plurality of query result features to obtain a plurality of classification results corresponding to the plurality of text query features, the classification results indicating whether a text instance is queried in the text image by the corresponding text query feature; and

a prediction subnetwork configured to process the first text image feature and at least one query result feature corresponding to at least one target query feature to obtain text image processing results corresponding to the plurality of at least one target query feature, wherein the at least one target query feature comprises text query features that have queried a text instance determined among the plurality of text query features based on the plurality of classification results, the prediction subnetwork comprising:

a text instance segmentation sub-network configured to process at least one query result feature corresponding to the at least one target query feature and the first text image feature to obtain a text instance segmentation result corresponding to the at least one target query feature, wherein the text instance segmentation result indicates a text region in the text image corresponding to a text instance queried by the corresponding target query feature; and

a sub-network of character segmentation configured to process the first text image feature to obtain a plurality of character segmentation results, the plurality of character segmentation results corresponding to a plurality of character categories, and each of the plurality of character segmentation results indicating a character region in the text image corresponding to the character category to which the character segmentation result corresponds,

wherein the prediction sub-network is configured to determine, for each of the at least one target query feature, the text content of the text instance queried by the target query feature using the text instance segmentation result corresponding to the target query feature and the plurality of character segmentation results.

12. The apparatus of claim 11, wherein the text instance partitioning sub-network comprises:

a second decoding subnetwork configured to input the first text image feature into the second decoding subnetwork to obtain a second text image feature, the second text image feature having a target length, a target width, and a target number of channels; and

a mask embedding subnetwork configured to input each of the at least one query result feature into the mask embedding subnetwork, respectively, to obtain at least one mask-embedded feature corresponding to the at least one target query feature, wherein a vector length of each of the at least one mask-embedded feature is the same as the target number of channels,

wherein the sub-network of text instance segmentations is configured to, for each of the at least one target query feature, multiply the mask-embedded feature of that target query feature with the second text image feature to obtain a text instance segmentation result corresponding to that target query feature, the text instance segmentation result having the target length and the target width.

13. The apparatus of claim 11, wherein the character segmentation sub-network is configured to input the first text image feature into the character segmentation sub-network to obtain the plurality of character segmentation results and a background segmentation result, wherein the background segmentation result indicates a plurality of character regions corresponding to a plurality of characters in the text image,

wherein the prediction subnetwork is configured to:

14. The apparatus of claim 12, wherein the first text image feature has a feature length less than the target length and a feature width less than the target width, and wherein the second decoding subnetwork and the character segmentation subnetwork are each configured to upsample the first text image feature, wherein the character segmentation result has the target length and the target width.

15. The apparatus of claim 11, wherein the first decoding subnetwork comprises:

a self-attention subnetwork configured to process the plurality of text query features based on a self-attention mechanism to obtain an enhanced plurality of text query features; and

a cross attention network configured to decode the text image features with the enhanced plurality of text query features based on cross attention to obtain the plurality of query result features.

16. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

17. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to any one of claims 1-10.