CN113822143A

CN113822143A - Text image processing method, device, equipment and storage medium

Info

Publication number: CN113822143A
Application number: CN202110872093.7A
Authority: CN
Inventors: 赵志远; 王洪振; 黄珊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-12-21

Abstract

The embodiment of the application provides a text image processing method, a text image processing device, text image processing equipment and a storage medium, and is applicable to the fields of image processing, cloud computing, artificial intelligence and the like. The method comprises the following steps: extracting an initial characteristic diagram of a text image to be processed; determining text content characteristics and spatial position characteristics of each text region in at least one text region contained in the text image to be processed according to the initial characteristic diagram; for each text region, splicing the text content features and the spatial position features of the text region to obtain the region features of the text region; and determining the sequencing result of each text region based on the region characteristics of each text region to obtain the text recognition result of each text region, and sequencing the text recognition results of each text region based on the sequencing result to obtain the text recognition result of the image to be processed. By adopting the embodiment of the application, the text contents in the text image can be effectively sequenced, and the applicability is high.

Description

Text image processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, and a storage medium for processing a text image.

Background

The OCR (Optical Character Recognition) layout analysis is always a very important direction, with the development of science and technology, information shows explosive growth, when facing a large amount of OCR texts, the function of acquiring content information by using manpower is very limited, and the cost is high, so more and more OCR layout analysis methods are produced. Existing OCR layout analysis schemes include CNN (or GCN (Graph connected Network)) based text classifiers that classify text, tables, and pictures (or titles, authors, abstracts, etc.) in text.

However, it is found that most of the existing OCR layout analysis solutions classify the text content of OCR, and the integral sorting of discrete text information is also very important in layout analysis, but there is no solution to the problem of sorting discrete text.

Disclosure of Invention

The embodiment of the application provides a text image processing method, a text image processing device, text image processing equipment and a storage medium, which can effectively sort text contents in text images and have high applicability

In one aspect, an embodiment of the present application provides a method for processing a text image, where the method includes:

extracting an initial characteristic diagram of a text image to be processed;

determining text content characteristics and spatial position characteristics of each text area in at least one text area contained in the text image to be processed according to the initial characteristic diagram;

for each text region, splicing the text content features and the spatial position features of the text region to obtain the region features of the text region;

determining a sorting result of each text region based on the region characteristics of each text region, wherein the sorting result represents the output sequence of the text recognition result of each text region;

and acquiring a text recognition result of each text region, and sequencing the text recognition results of the text regions based on the sequencing result to obtain a text recognition result of the image to be processed.

On the other hand, an embodiment of the present application provides a text image processing apparatus, including:

the initial characteristic image extraction module is used for extracting an initial characteristic image of the text image to be processed;

the initial feature map processing module is used for determining text content features and spatial position features of each text region in at least one text region contained in the text image to be processed according to the initial feature map;

the region feature determination module is used for splicing the text content features and the spatial position features of the text regions to obtain the region features of the text regions for each text region;

a sorting result determining module, configured to determine a sorting result of each text region based on a region feature of each text region, where the sorting result represents an output order of a text recognition result of each text region;

and the text sequencing module is used for acquiring the text recognition result of each text region, sequencing the text recognition result of each text region based on the region sequencing result, and obtaining the text recognition result of the image to be processed.

Wherein, the initial feature map processing module is configured to:

determining the position relation among the characteristic points in the initial characteristic diagram based on the positions of the characteristic points in the initial characteristic diagram;

extracting the features of the initial feature map based on the feature values of the feature points in the initial feature map and the position relationship between the feature points in the initial feature map to obtain a target feature map;

determining a classification result corresponding to each feature point in the target feature map according to the target feature map, wherein the classification result represents that each feature point in the target feature map belongs to a text region or a background region;

determining at least one text region in the target feature map according to the classification result corresponding to each feature point in the target feature map;

for each text region, determining the text content feature corresponding to the text region according to the feature value of each feature point corresponding to the text region in the target feature map, and determining the spatial position feature corresponding to the text region according to the position of each feature point corresponding to the text region in the target feature map.

Optionally, the initial feature map processing module is configured to:

determining the distance between each feature point in the initial feature map based on the position of each feature point in the initial feature map;

and constructing a graph structure based on the characteristic values of the characteristic points in the initial characteristic graph and the distances between the characteristic points in the initial characteristic graph, and performing characteristic extraction on the graph structure to obtain a target characteristic graph, wherein each characteristic point in the initial characteristic graph corresponds to one node in the graph structure, and connecting edges are arranged between the nodes corresponding to the characteristic points of which the distances are smaller than or equal to a set value.

Optionally, the area characteristic determining module is configured to:

fusing the feature values of the same channel in the feature points corresponding to the text region in the target feature map to obtain fused feature values corresponding to the channels;

and determining the text content characteristics corresponding to the text area based on the fusion characteristic value corresponding to the text area.

Optionally, the sorting result determining module is configured to:

predicting a probability corresponding to each of the region features at each time step in the feature sequence based on a feature sequence including the region feature of each of the text regions, and determining a ranking result of each of the text regions based on the probability corresponding to each of the region features at each time step;

the probability corresponding to one of the region features at one time step represents the probability that the ranking of the text region corresponding to the region feature in each of the text regions corresponds to the time step.

Optionally, the sorting result determining module is configured to:

coding the region characteristics of each text region in the characteristic sequence to obtain the coding result of the characteristic sequence;

for each time step, predicting the probability corresponding to each region feature at the time step based on the coding result and the historical output result corresponding to the time step;

the historical output result corresponding to the first time step is a preset feature, and for each time step except the first time step, the historical output result corresponding to the time step comprises a prediction result corresponding to the previous time step of the time step, and the prediction result is determined based on the region feature of the text region corresponding to the maximum probability in the probabilities corresponding to the previous time step.

Optionally, the sorting result determining module is configured to:

coding the region characteristics of each text region in the characteristic sequence to obtain the corresponding hidden state characteristics of each region characteristic and the coding result of the characteristic sequence;

for each time step, determining a first characteristic corresponding to the time step based on the coding result and a historical output result corresponding to the time step; and extracting the features of the hidden state to obtain second features corresponding to the region features, and obtaining the probability corresponding to the region features at the time step according to the correlation between the second features corresponding to the region features and the first features.

Optionally, the text sorting module is configured to:

and for each text area, obtaining a text recognition result of the text area based on the text content characteristics of the text area.

Optionally, the processing apparatus of the text image determines, according to the initial feature map, a text content feature and a spatial position feature of each text region in at least one text region included in the to-be-processed text image, and for each text region, concatenates the text content feature and the spatial position feature of the text region, where obtaining the region feature of the text region is implemented by a graph processing model, where the graph processing model is trained by a model training module, and the model training module is configured to:

acquiring a training sample set, wherein the training sample set comprises at least one sample text image, a text region and a background region of the sample text image are marked with a first sample label, and the first sample label represents a real result of the corresponding region belonging to the text region or the background region;

extracting sample initial characteristics of each sample text image;

for each sample text image, inputting sample initial features of the sample text image into an initial image processing model to obtain a prediction classification result of each feature point in a sample target feature image corresponding to the sample text image, wherein the prediction classification result represents a prediction result of each feature point in the sample target feature image belonging to a text region or a background region;

determining a real result that each feature point in a sample target feature map corresponding to each sample text image belongs to a text region or a background region based on the first sample label of each sample text image;

and determining a first training loss value based on the real result and the prediction result, and training the initial graph processing model based on the first training loss value and the training sample set until the first training loss value meets a first training end condition, and determining the model at the end of training as the graph processing model.

Optionally, the processing device of the text image determines that the sorting result of each text region is realized by a text sorting model based on the region feature of each text region; wherein each text region of the sample text image is labeled with a second sample label, the second sample label represents a real ranking result of each text region of the sample text image, the text ranking model is trained by the model training module, and the model training module is configured to:

determining sample feature sequences based on the graph processing model, wherein each sample feature sequence comprises region features of text regions of a sample text image;

for each sample text image, inputting a sample feature sequence corresponding to the sample text image into an initial text sorting model to obtain a prediction sorting result of a text region of the sample text image;

and determining a second training loss value based on the real sequencing result and the predicted sequencing result, training the initial text sequencing model based on the second training loss value and each sample feature sequence, and determining the model at the end of training as the text sequencing model when the second training loss value meets a second training end condition.

In another aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the processor and the memory are connected to each other;

the memory is used for storing computer programs;

the processor is configured to execute the processing method of the text image provided by the embodiment of the application when the computer program is called.

In another aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the text image processing method provided in the embodiment of the present application.

In another aspect, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the processing method of the text image provided by the embodiment of the application.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

by determining the text content characteristics and the spatial position characteristics of each text region in the text image to be processed, the output sequence of the text recognition results of each text region can be determined effectively based on the position and the spatial characteristics of each text region corresponding to the discrete text information in the text image to be processed, the accuracy of the output sequence of the text recognition results is improved, the readability of the obtained text recognition results of the text image is improved, and the applicability is high.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a text image processing method according to an embodiment of the present application;

FIG. 2 is a network architecture diagram of a feature extraction model provided by embodiments of the present application;

FIG. 3 is a flowchart of a method for determining text content features and spatial location features provided by an embodiment of the present application;

FIG. 4 is a network architecture diagram of a graph processing model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a scenario for determining a text region according to an embodiment of the present application;

fig. 6 is a schematic network architecture diagram of a text image processing method according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a text image processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Optionally, in the embodiment of the present application, when the text recognition results of each text region in the text image to be processed are sorted, a Natural Language Processing (NLP) technology in an artificial intelligence technology is specifically involved. Among them, natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic flowchart of a text image processing method according to an embodiment of the present application. As shown in fig. 1, a method for processing a text image according to an embodiment of the present application may include the following steps:

and step S11, extracting an initial feature map of the text image to be processed.

In this embodiment of the application, the text image to be processed may be any image including text information, such as a web site screenshot, a web page snapshot, a document picture, and the like, which may include text information, and a web page image including discrete text information, and the like, which is not limited herein.

Specifically, after the text image to be processed is obtained, while the resolution of the text image to be processed is reduced and the data processing amount is reduced, the text image to be processed may be preprocessed based on a feature extraction tool, a feature extraction architecture constructed based on a neural network, and the like, so as to obtain an initial feature map of the text image to be processed.

The feature extraction structure constructed based on the Neural network may include a feature extraction structure constructed based on a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Long-short-term memory Neural network (LSTM), and the like, such as a VGG-16 architecture based on the CNN network, and may be specifically determined based on actual application scene requirements, which is not limited herein.

As an example, after the text image to be processed is acquired, the text image to be processed may be processed based on the feature extraction model of the VGG-16 architecture to extract an initial feature map of the text image to be processed. The VGG-16 architecture adopts a convolution kernel of 3 x 3 and a maximum pooling layer of 2 x 2, and specifically comprises 16 hidden layers, wherein the hidden layers comprise 13 convolution layers and 3 full-connection layers, so that the initial feature map of the text image to be processed has good generalization performance while the initial feature map of the text image to be processed is effectively extracted under the action of the hidden layers.

In order to ensure that the initial feature map of the text image to be processed is in the form of a point set, the feature extraction model based on the VGG-16 in the embodiment of the application can cancel a full connection layer in the VGG-16 architecture, so as to construct a new feature extraction model. As shown in fig. 2, fig. 2 is a network architecture diagram of a feature extraction model provided in the embodiment of the present application. The feature extraction model provided by the embodiment of the present application may be composed of a plurality of CBR network layers and a pooling layer (Max Pool), where the CBR network layer is an integrated network layer composed of a convolutional network (Conv), a batch normalized network (BatchNorm), and an activation function (ReLu), different CBRs have different data processing capabilities, and the pooling layer may be a maximum pooling layer.

As an example, for a high-resolution text image I _ s to be processed, the resolution of the text image I _ s to be processed may be adjusted to 1024 × 1024, and then the data size of the 3 channels of text images to be processed input to the feature extraction model based on the convolutional neural network is (1024, 1024, 3), and the data size of the initial feature map I _ cnn obtained after the processing by the feature extraction model based on the convolutional neural network is (32, 32, 512). The number of data channels is reduced to 8 by multi-layer one-dimensional convolution, that is, the final data size of the initial image feature I _ cnn of the text image to be processed is (h, w, c) _ (h is 32, w is 32, and c is 8).

Where h denotes the height of the initial image feature, w denotes the width of the initial image feature, and c denotes the number of channels of the initial image feature. It should be particularly noted that the feature extraction model provided in the embodiment of the present application is only an example, and may be determined based on the requirements of an actual application scenario, and is not limited herein.

Step S12, determining text content characteristics and spatial position characteristics of each text area in at least one text area contained in the text image to be processed according to the initial characteristic diagram.

In the embodiment of the application, the initial feature map of the text image to be processed can represent features of various items of information in the image to be processed, such as features of text information and background information. The background information of the text image to be processed includes, but is not limited to, color background information, picture information and other non-text information, and may be determined based on the requirements of the actual application scene, which is not limited herein.

Based on the above, after the initial feature map of the text image to be processed is determined, the text content features and the spatial position features of each text region in the text image to be processed can be determined based on the initial feature map of the text image to be processed.

For example, the text paragraph 1, the text paragraph 2, and the background information are included in the text image to be processed, and after the initial feature map of the text image to be processed is determined, the text content feature and the spatial location feature of the text paragraph 1 and the text paragraph 2 in the text image to be processed may be determined based on the initial feature map.

Fig. 3 shows a specific implementation manner of determining text content features and spatial position features of each text region in at least one text region included in the text image to be processed according to the initial map features. Referring to fig. 3, fig. 3 is a flowchart of a method for determining text content features and spatial location features according to an embodiment of the present application, and specifically includes the following steps:

and step S121, determining the position relation among the characteristic points in the initial characteristic diagram based on the positions of the characteristic points in the initial characteristic diagram.

Specifically, the positional relationship between the feature points in the initial feature map may be represented by the distance between the feature points in the initial feature map. And the distance between the characteristic points in the initial characteristic diagram is the distance between the relative positions of the characteristic points in the initial characteristic diagram.

And S122, extracting the features of the initial feature map based on the feature values of the feature points in the initial feature map and the position relation between the feature points in the initial feature map to obtain a target feature map.

In the embodiment of the present application, in the case where the positional relationship between the feature points in the initial feature map can be represented by the distance between the feature points in the initial feature map, before feature extraction of the initial feature, the map structure may be constructed based on the feature value of each feature point in the initial feature map and the positional relationship between the feature points in the initial feature map.

Each feature point in the initial feature graph corresponds to a node in the graph structure, the feature value of each feature point in the initial feature graph is based on the node feature value of the corresponding node in the graph structure, and connecting edges are arranged between nodes corresponding to feature points, wherein the distance between the feature points in the initial feature graph is smaller than or equal to a set value.

In other words, for the connecting edges between the nodes in the graph structure, the distance between any feature point in the initial feature graph and other feature points can be determined, and the connecting edges are established between the feature points with the distance less than or equal to the set value and the feature points, so that the graph structure consisting of the nodes and the edges can be constructed.

Optionally, when determining the connecting edges between the nodes in the graph structure, the determination may also be based on the feature values of the feature points in the initial feature graph. For example, for two feature points in the initial feature map having the same feature value and/or having a difference value of the feature values smaller than or equal to a set value, it may be determined that the two feature points correspond to two nodes in the map structure having a continuous edge therebetween.

Optionally, when determining the connecting edges between the nodes in the graph structure, the determining may also be based on the feature values of the feature points in the initial graph feature and the distances between the feature points in the initial graph feature. If two characteristic points exist in the initial graph characteristic, wherein the difference of the characteristic values is smaller than or equal to a set value, and the distance between the characteristic points is smaller than or equal to the set value, the two characteristic points can be determined to have a connecting edge between two nodes in the graph structure.

In the embodiment of the application, after the graph structure is constructed based on the initial graph features, feature extraction can be performed on the graph structure based on the graph processing model to obtain the target feature graph.

The Graph processing model includes, but is not limited to, a neural Network model constructed based on a Graph Convolutional neural Network (GCN) and a related Graph processing tool, such as a model architecture based on a depth Graph Convolutional neural Network, a dense Convolutional neural Network, and the like, and may be determined based on actual application scene requirements, which is not limited herein. Referring to fig. 4, fig. 4 is a schematic diagram of a network architecture of a graph processing model provided in the embodiment of the present application. The network architecture shown in fig. 4 includes a plurality of dense connection modules, convolutional layers, pooling layers, and the like that are sequentially connected, and in order to reduce the problem of gradient disappearance caused by deepening of the network depth, all layers can be directly connected based on the network structure shown in fig. 4 on the premise of maximum information transmission between layers including the graph processing model, each layer splices inputs of all previous layers, and then transmits output characteristics to all subsequent layers.

It should be particularly noted that the network structure shown in fig. 4 is only an example of the graph processing model in the embodiment of the present application, and a network architecture of the graph processing model may be specifically determined based on requirements of an actual application scenario, which is not limited herein.

Further, after a graph structure is constructed based on the feature values of the feature points in the initial feature map and the position relationship between the feature points in the initial feature map, the graph structure may be processed based on the graph processing model to perform feature extraction on the graph structure to obtain a target feature map.

As an example, after a text image to be processed is processed based on a feature extraction model, the size of the obtained data I _ cnn of the initial feature map is (h, w, c) _ (h is 32, w is 32, c is 8), and a graph structure can be constructed and feature extraction is performed on the graph structure based on feature values of feature points in the initial feature map and a position relationship between the feature points in the initial feature map through a graph processing model, so as to obtain a target feature map.

And S123, determining a classification result corresponding to each feature point in the target feature map according to the target feature map.

In this embodiment of the present application, each feature point in the target feature map may be classified based on the map processing model, and it is determined that each feature point belongs to the local region or the background region based on the classification result, that is, it is determined whether information corresponding to each feature point belongs to the text region or the background region in the text map to be processed based on the classification result. The classification result of the feature points in the target feature map characterizes that each feature point in the target feature map belongs to a text region or a background region

And for each feature point in the target feature map, performing secondary classification on the feature point to obtain a classification result. If the classification result of each feature point of the target feature map is a first value (e.g., 1) or is greater than a set probability value, it may be determined that the feature point represents that the feature point belongs to the text region. If the classification result of each feature point in the target feature map is a second value (e.g., 0) or not greater than a set probability value, it may be determined that the feature point represents that the feature point belongs to the background region.

For example, through a graph processing model, an initial feature map I _ cnn with a data size of (h, w, c) _ (h is 32, w is 32, c is 8) can be finally converted into 1024 feature points, and a graph structure is further constructed based on the position relationship and feature values of the 1024 feature points, so that a classification result of the 1024 feature points is obtained.

And step S124, determining at least one text area in the target feature map according to the classification result corresponding to each feature point in the target feature map.

In the embodiment of the present application, based on the classification result of each feature point in the target feature map, the feature points that are the same in classification result and adjacent to each other may be determined as one connected domain, so that a target connected domain in which the classification result characterizes the text region is determined from each connected domain, and the target connected domain is determined as the text region in the target feature map.

As shown in fig. 5, fig. 5 is a schematic view of a scene for determining a text region according to an embodiment of the present application. And carrying out ResNet, DenseNet and the like on the initial feature map of the text image to be processed through the feature extraction model, so that the number of network layers is increased, wherein each layer corresponds to one intermediate feature map, and the last layer corresponds to a target feature map. And further processing the target feature map to obtain a classification result of each feature point in the target feature map. For each feature point in the target feature map, if the classification result of the feature point indicates that the feature point belongs to the text region, the feature point may be marked by using a first identifier (for example, a first preset value), and if the classification result of the feature point indicates that the feature point belongs to the background region, the feature point may be marked by aligning using a second identifier (for example, a second preset value). And then based on the marking result of each feature point in the target feature map, determining the feature points which have the same classification result and are adjacent to each other as a connected domain. If the feature points representing the text region are marked by the first preset value 1, the connected domain corresponding to the text region in the target feature map is as shown in fig. 5.

Step S125, for each text region, determining a text content feature corresponding to the text region according to the feature value of each feature point corresponding to the text region in the target feature map, and determining a spatial position feature corresponding to the text region according to the position of each feature point corresponding to the text region in the target feature map.

In the embodiment of the present application, since the initial feature map includes a plurality of channels, the target feature map includes feature maps of the plurality of channels, and the target feature map includes a feature map corresponding to each channel. In this case, for each text region in the target feature map, the feature value of each feature point corresponding to the text region in the target feature map may be determined.

Further, for each feature point of the text region in the target feature map, the feature points may be fused with the feature values of the same channel, so as to obtain a fused feature value of the same channel corresponding to each feature point. For example, for each feature point of the text region in the target feature map, an average value of different feature values of each feature point corresponding to the same channel may be determined as a fused feature value of each feature point corresponding to the channel, and then a fused feature value of each feature point corresponding to different channels may be obtained.

Optionally, for each feature point of the text region in the target feature map, fusing the feature values of the feature points corresponding to the same channel may be performed by taking a maximum value, taking a minimum value, and the like, and may be specifically determined based on the requirements of the actual application scenario, which is not limited herein.

Further, for each text region of the target feature map, a fused feature value sequence may be further obtained based on the fused feature values of the text region corresponding to different channels corresponding to the feature points in the target feature map. And the length of the fusion characteristic value sequence is consistent with the number of channels corresponding to the target characteristic diagram. Based on this, for each text region, the fusion feature value sequence corresponding to the text region may be determined as the text content feature corresponding to the text region.

In the embodiment of the present application, after each text region in the target feature map is determined, the spatial position feature corresponding to the text region may be determined based on the position of each feature point corresponding to each text region in the target feature map.

Specifically, the position of the text region in the target feature map may be determined based on the position, such as coordinates, of each feature point corresponding to the text region in the target feature map. And determining the height and width of the text region in the target feature map according to the position of each feature point corresponding to the text region in the target feature map, so as to determine the spatial position feature (x, y, w, h) of the text region based on the position and the corresponding height and width of the text region in the target feature map. Wherein x and y are used for determining the position of the text region in the target feature map, w represents the height of the text region in the target feature map, and h represents the height of the text region in the target feature map.

And step S13, for each text region, splicing the text content features and the spatial position features of the text region to obtain the region features of the text region.

In the embodiment of the present application, for each text region included in the target feature map, the text content feature and the spatial position feature corresponding to the text region may be fused to obtain the region feature of the text region.

When the text content features and the spatial position features corresponding to each text region are fused, the text content features and the spatial position features corresponding to each text region can be spliced to obtain the region features corresponding to the text region.

As an example, the region feature corresponding to each text region in the target feature map may be represented as S_i＝CONCAT(s_i(x, y, w, h)) i ∈ 1,2 …, n. Where i denotes an index of a text area, CONCAT(s)_iAnd (x, y, w, h)) represents a text content feature s of the text region i_iAnd (4) splicing with the spatial position features (x, y, w, h), wherein n represents the number of text areas in the target feature map.

Step S14 is to determine the result of ranking of each text region based on the region feature of each text region.

In the embodiment of the application, after the region features of the text regions in the text image to be processed are determined based on the implementation manner, a feature sequence including the region features of the text regions may be determined, and the ranking result of the text regions is determined based on the feature sequence including the region features of the text regions.

Specifically, the above feature sequence may be input into a text ranking model to determine a ranking result of each text region in the text image to be processed based on the text ranking model. The text sorting model may be a text sorting model constructed based on a Pointer Network (PN), or may be a text sorting model constructed based on other neural Network architectures, such as a Sequence2Sequence architecture, and may be specifically determined based on actual application scene requirements, which is not limited herein.

Specifically, the feature sequence may be input into a text ranking model to predict a probability corresponding to each region feature in the feature sequence at each time step, and a ranking result of the text region may be determined based on the probability corresponding to each region feature at each time step. The probability corresponding to one region feature in the feature sequence at one time step represents the probability that the sequence of the text region corresponding to the region feature in each text region corresponds to the time step.

As an example, if the feature sequence includes a region feature 1 corresponding to the text region 1, a region feature 2 corresponding to the text region 2, and a region feature 3 corresponding to the text region 3. At this time, a feature sequence including the region feature 1, the region feature 2, and the region feature 3 is input into the text ranking model. At a first time step, the probabilities that region feature 1, region feature 2, and region feature 3 correspond to the first time step are determined. If the probability of the region feature 2 corresponding to the first time step is the highest, the rank of the text region corresponding to the region feature 2 is determined as the first rank.

Further, at the second time step, the probabilities that the region feature 1, the region feature 2, and the region feature 3 correspond to the second time step are determined, and further, based on the probability that each region feature corresponds to the second time step, the ranking of the text regions corresponding to the region feature (except the region feature 2) corresponding to the maximum probability is determined as the second ranking, and the rankings of the text regions corresponding to the remaining region features are determined as the third ranking. And sequentially arranging the first sequence, the second sequence and the third sequence to obtain a sequencing result of each text region in the text image to be processed.

In the embodiment of the present application, when determining the probability corresponding to each time step of each region feature in the feature sequence, the region feature of each text region in the feature sequence may be encoded to obtain an encoding result of the feature sequence. And for each time step, determining the probability corresponding to each region feature at the time step based on the coding result and the historical output result corresponding to the current time step.

The historical output result corresponding to the first time step is a preset feature, and for each time step except the first time step, the historical output result corresponding to the time step comprises a prediction result corresponding to a previous time step of the time step, and the prediction result is determined based on the region feature of the text region corresponding to the maximum probability in the probabilities corresponding to the previous time step.

As an example, if the feature sequence includes a region feature 1 corresponding to the text region 1, a region feature 2 corresponding to the text region 2, and a region feature 3 corresponding to the text region 3. At this time, the feature sequence including the region feature 1, the region feature 2 and the region feature 3 is input into the text sorting model, and each region feature in the feature sequence is encoded to obtain an encoding result of the feature sequence. At a first time step, based on the preset features and the coding features, a probability that each region feature corresponds to the first time step may be determined. If the probability of the region feature 2 corresponding to the first time step is the largest, the region feature 2 may be determined as the historical output result corresponding to the next time step.

Further, for the second time step, a probability that each region feature corresponds to the second time step may be determined based on the coding feature and region feature 2. If the other regional characteristics except the regional characteristic 2 are summarized, and the probability that the regional characteristic 1 corresponds to the second time step is the largest, the regional characteristic 1 can be determined as the historical output result corresponding to the next time step. By analogy, the corresponding probability of each region feature at each time step can be obtained.

In the embodiment of the present application, when the region features of each text region in the feature sequence are encoded, the hidden state features and the encoding results of the feature sequence corresponding to each region feature can be obtained in the encoding process.

For each time step, determining a first characteristic corresponding to the time step based on the coding result and the historical output result corresponding to the time step, wherein the first characteristic is the position of the time stepCorresponding attention characteristics. Wherein, the first feature (attention feature) corresponding to each time step can be expressed as

Wherein, W¹The attention-related parameter for calculating the attention feature may be determined based on the actual model architecture and the actual application scenario requirement, which is not limited herein. Wherein Z is^GThe coding result corresponding to the characteristic of each region in the characteristic sequence is obtained. Wherein t represents a time step, when t is 1, the first time step is represented, and the historical output result corresponding to the first time step is the preset characteristic v^input(ii) a At each other time step (when t is more than 1) after the first time step, the historical output result corresponding to the time step is the prediction result h corresponding to the previous time step_t-1。

For each time step, feature extraction can be performed on the hidden state features corresponding to the region features to obtain second features corresponding to the region features, and the second features can be represented as k_i＝W²h_i. Wherein, W²The calculation of the relevant parameters of the second feature may be specifically determined based on the actual model architecture and the actual application scenario requirements, and is not limited herein. Wherein h is_iThe hidden state feature corresponding to each region feature is represented.

Further, when the probability corresponding to each region feature at each time step is determined based on the encoding result and the history book output result corresponding to the time step, the probability corresponding to each region feature at the time step may be determined based on the correlation between the second feature and the first feature corresponding to each region feature.

Specifically for each time step, the correlation coefficient a of the first feature and the second feature corresponding to each region feature can be determined based on the first feature and the second feature corresponding to each region feature at the time step_i. Wherein, the correlation coefficient of the first feature and the second feature corresponding to each region feature can be expressed as

Wherein n represents the region features corresponding to all text regions, and n' is the index of the region feature corresponding to the maximum probability determined at any time step before the current time step. For any region feature i in each time step, if the maximum probability of the historical output result corresponding to any time step between the time steps comprises the probability corresponding to the region feature i, the region feature i corresponds to the correlation coefficient a of the time step_iIs a preset value A.

Otherwise, the correlation coefficient a corresponding to the region feature i_iIs composed of

Where q is a first feature corresponding to the time step, k_iFor the second feature corresponding to this time step, d is the dimension of the feature sequence.

Further, for each time step, after determining the correlation coefficient of each region feature i corresponding to the time step, the probability of each region feature i corresponding to the time step may be determined based on the softmax function

And then determining the maximum probability from the probabilities of the region features corresponding to the time step, and determining the historical output result corresponding to the next time step based on the maximum probability.

And step S15, acquiring the text recognition results of the text regions, and sequencing the text recognition results of the text regions based on the sequencing result to obtain the text recognition result of the image to be processed.

In the embodiment of the present application, the text recognition result of each text region may be obtained based on a natural language processing technology, such as an OCR technology, and then the text recognition result of each text region may be sorted based on the sorting result of each text region. And determining an output order of the text recognition results of the text regions based on the sorting results of the text regions, and further outputting the text recognition results of the text regions based on the output order.

And for each text region, determining a text recognition result of the text region based on the text content characteristics corresponding to the text region.

In this embodiment of the application, the determining, according to the initial feature map, text content features and spatial position features of each text region in at least one text region included in the to-be-processed text image in step S12 in fig. 1, and the splicing, for each text region, the text content features and the spatial position features of the text region to obtain the region features of the text region in step S13 may be implemented by using a map processing model.

The graph processing model is specifically determined in the following way:

acquiring a training sample set, wherein the training sample set comprises at least one sample text image, a text region and a background region of the sample text image are marked with a first sample label, and the first sample label represents a real result that the corresponding region belongs to the text region or the background region;

extracting sample initial characteristics of each sample text image;

for each sample text image, inputting sample initial characteristics of the sample text image into an initial image processing model to obtain a prediction classification result of each characteristic point in a sample target characteristic image corresponding to the sample text image, wherein the prediction classification result represents a prediction result of each characteristic point in the sample target characteristic image belonging to a text region or a background region;

and determining a first training loss value based on the real result and the prediction result, training the initial graph processing model based on the first training loss value and the training sample set, and determining the model at the end of training as the graph processing model when the first training loss value meets a first training end condition.

For a specific manner of determining the prediction classification result of each feature point in the sample target feature map corresponding to each sample text image, refer to the embodiments of determining the classification result corresponding to each feature point in the target feature map corresponding to the text image to be processed shown in fig. 1 to 2, which is not described herein again.

The first training loss value may be determined based on a cross entropy loss function or other loss functions, and may be determined based on actual application scenario requirements, which is not limited herein.

As an example, the first training loss value may be based on

To be determined. Where r represents each feature point in the sample target feature map, y_pdIndicates the result of prediction, y_gtRepresenting the true result and n representing the number of feature points in the sample feature map.

According to the first training loss value and each sample text image in the training sample set, iterative training can be performed on the initial graph processing model based on the implementation mode, and relevant parameters in the initial graph processing model are adjusted in each training process. When the first training loss value meets the training end condition, the model at the training end can be determined as the final graph processing model. The training end condition may be that the first training loss value reaches a convergence state, or that the first training loss value is lower than a preset threshold, and the like, and may be specifically determined based on a requirement of an actual application scenario, which is not limited herein.

In this embodiment of the application, in step S14 in fig. 1, the determination of the ranking result of each text region may be implemented by a text ranking model based on the region feature of each text region.

Under the condition that each text area of each sample text image in the training sample set is marked with a second sample label, and the second sample label represents a real sequencing result of each text area of the sample text image, the text sequencing model is determined in the following mode:

determining sample feature sequences based on the graph processing model, wherein each sample feature sequence comprises region features of text regions of one sample text image;

for each sample text image, inputting a sample feature sequence corresponding to the sample text image into the initial text sorting model to obtain a prediction sorting result of a text region of the sample text image;

and determining a second training loss value based on the real sequencing result and the predicted sequencing result, training the initial text sequencing model based on the second training loss value and each sample characteristic sequence, and determining the model at the end of training as the text sequencing model when the second training loss value meets a second training end condition.

For a specific manner of predicting the sorting result of the text region of each sample text image, reference may be made to the embodiments of determining the sorting result of each text region in the text image to be processed shown in fig. 1 to 2, which are not described herein again.

The second training loss value may be determined based on a cross entropy loss function or other loss functions, and may be determined based on actual application scenario requirements, which is not limited herein.

Determining each sample characteristic sequence according to the second training loss value and the graph processing model, performing iterative training on the initial text ranking model based on the implementation mode, and adjusting relevant parameters of the initial text ranking model in each training process. When the second training loss value meets the training end condition, the model at the training end can be determined as the final text ranking model. The training end condition may be that the second training loss value reaches a convergence state, or that the second training loss value is lower than a preset threshold, and the like, and may be specifically determined based on a requirement of an actual application scenario, which is not limited herein.

Optionally, when the sample initial features of each sample text image are extracted, the initial features can also be obtained through training of a feature extraction model. And before processing the model on the initial graph, the sample text images in the training sample set can be input into the initial feature extraction model to obtain the sample initial features of each training text image. And inputting the initial characteristics of each sample into an initial graph processing model to obtain a sample characteristic sequence of each sample text image, and training the initial text ranking model based on the sample characteristic sequence obtained by the initial graph processing model. After the first training loss value and the second training loss value are respectively determined by the implementation mode, the total training loss value is determined based on the first training loss value and the second training loss value.

And further performing iterative training on the initial feature extraction model, the initial graph processing model and the initial text ranking model according to the total training loss value and the training sample set, and adjusting relevant parameters of the initial feature extraction model, the initial text ranking model and the initial graph processing model in each training process. When the total training loss value meets the training end condition, the model at the training end can be determined as a final feature extraction model, a graph processing model and a text ordering model. The training end condition may be that the total training loss value reaches a convergence state, or that the total training loss value is lower than a preset threshold, and the like, and may be specifically determined based on a requirement of an actual application scenario, which is not limited herein.

The graph processing model and the text sorting model can be determined based on the implementation mode, and then the output sequence of the text recognition results of all the text regions summarized by the text image to be processed is determined based on the feature extraction model, the graph processing model and the text sorting model. Referring to fig. 6, fig. 6 is a schematic network architecture diagram of a text image processing method according to an embodiment of the present application. As shown in fig. 6, an initial feature map of the text image to be processed may be determined based on the feature extraction model. Based on the graph processing model, determining text content characteristics and spatial position characteristics of each text region in at least one text region contained in the text image to be processed according to the initial characteristic graph; and for each text region, splicing the text content features and the spatial position features of the text region to obtain the region features of the text region. Through the text sequencing model, the sequencing result of each text region can be determined based on the feature sequence obtained by the graph processing model, so that the output sequence of the text recognition result of each text region in the text image to be processed is determined.

The sample text images in the training sample set may be obtained by obtaining a user history access record, a text image sample, Big data (Big data), or a Database (Database), a cloud storage (cloud storage), or a block chain (Blockchain) for storing the text images, and may be specifically determined based on actual application scene requirements, which is not limited herein. Among other things, the database can be regarded as an electronic file cabinet, a place for storing electronic files, and can be used for storing a sample training set in the present application.

Based on technologies such as data mining in big data, text images can be mined to form a training sample set in the application.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A blockchain is essentially a decentralized database, a string of data blocks that are associated using cryptography. In the present application, each data block in the block chain may store the training sample set.

The cloud storage is a new concept extended and developed from a cloud computing concept, and means that a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network are aggregated to cooperatively work through application software or application interfaces through functions such as cluster application, a grid technology, a distributed storage file system and the like, and a large number of text images are stored together.

In the embodiment of the application, by determining the text content characteristics and the spatial position characteristics of each text region in the text image to be processed, the output sequence of the text recognition results of each text region can be determined effectively based on the position and the spatial characteristics of each text region corresponding to discrete text information in the text image to be processed, the accuracy of the output sequence of the text recognition results is improved, the readability of the text recognition results of the obtained text image is improved, and the applicability is high.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a text image processing apparatus according to an embodiment of the present application. The processing device of the text image provided by the embodiment of the application comprises:

an initial feature map extraction module 71, configured to extract an initial feature map of the text image to be processed;

an initial feature map processing module 72, configured to determine, according to the initial feature map, text content features and spatial position features of each text region in at least one text region included in the to-be-processed text image;

a region feature determining module 73, configured to splice, for each text region, text content features and spatial position features of the text region to obtain region features of the text region;

a sorting result determining module 74, configured to determine a sorting result of each text region based on a region feature of each text region, where the sorting result represents an output order of a text recognition result of each text region;

the text sorting module 75 is configured to obtain a text recognition result of each text region, and sort the text recognition results of each text region based on the region sorting result to obtain a text recognition result of the image to be processed.

In some embodiments, the initial feature map processing module 72 is configured to:

In some embodiments, the region characteristic determining module 73 is configured to:

In some embodiments, the above ranking result determining module 74 is configured to:

In some embodiments, the text sorting module 75 is configured to:

In some embodiments, the processing apparatus of the text image determines, according to the initial feature map, a text content feature and a spatial position feature of each text region in at least one text region included in the to-be-processed text image, and for each text region, concatenates the text content feature and the spatial position feature of the text region, and obtaining the region feature of the text region is implemented by a map processing model, where the map processing model is trained by a model training module, and the model training module is configured to:

extracting sample initial characteristics of each sample text image;

In some embodiments, the processing device of the text image determines that the sorting result of each text region is realized by a text sorting model based on the region characteristics of each text region; wherein each text region of the sample text image is labeled with a second sample label, the second sample label represents a real ranking result of each text region of the sample text image, the text ranking model is trained by a model training module, and the model training module is configured to:

In a specific implementation, the processing apparatus for text images provided in this embodiment of the present application can execute the processing method for text images provided in this embodiment of the present application through each built-in functional module, and the implementation principles are similar, and are not described herein again.

The processing device for text images provided by the embodiment of the present application may be a computer program (including program code) running in a computer device, for example, the device is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application.

In some embodiments, the processing Device of the text image provided by the embodiments of the present Application may be implemented by combining hardware and software, and by way of example, the processing Device of the text image provided by the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the processing method of the text image provided by the embodiments of the present Application, for example, the processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. As shown in fig. 8, the electronic device 1000 in the present embodiment may include: the processor 1001, the network interface 1004, and the memory 1005, and the electronic device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 8, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the electronic device 1000 shown in fig. 8, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be configured to call the device control application stored in the memory 1005 to implement the website type determination method provided by the embodiment of the present application.

It should be understood that in some possible embodiments, the processor 1001 may be a Central Processing Unit (CPU), and the processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In a specific implementation, the electronic device 1000 may execute, through each built-in functional module thereof, an implementation manner provided in each step in the text image processing method provided in the embodiment of the present application, which may be specifically referred to the implementation manner provided in each step, and is not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement the implementation manners provided in the steps of the text image processing method provided in the embodiment of the present application, which may be specifically referred to the implementation manners provided in the steps, and are not described herein again.

The computer-readable storage medium may be an internal storage unit of the apparatus and/or the electronic device provided in any of the foregoing embodiments, for example, a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, which are provided on the electronic device. The computer readable storage medium may further include a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), and the like. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the electronic device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the implementation manners provided by the steps in the text image processing method provided by the embodiment of the application.

The terms "first", "second", and the like in the claims and in the description and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or electronic device that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or electronic device. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments. The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not intended to limit the scope of the present application, which is defined by the appended claims.

Claims

1. A method for processing a text image, comprising:

extracting an initial characteristic diagram of a text image to be processed;

determining text content characteristics and spatial position characteristics of each text region in at least one text region contained in the text image to be processed according to the initial characteristic map;

and acquiring a text recognition result of each text region, and sequencing the text recognition results of each text region based on the sequencing result to obtain the text recognition result of the image to be processed.

2. The method according to claim 1, wherein the determining the text content feature and the spatial position feature of each text region in at least one text region included in the text image to be processed according to the initial feature map comprises:

performing feature extraction on the initial feature map based on the feature values of the feature points in the initial feature map and the position relationship between the feature points in the initial feature map to obtain a target feature map;

for each text region, determining text content characteristics corresponding to the text region according to the characteristic values of the characteristic points corresponding to the text region in the target characteristic diagram, and determining spatial position characteristics corresponding to the text region according to the positions of the characteristic points corresponding to the text region in the target characteristic diagram.

3. The method according to claim 2, wherein the determining the position relationship between the feature points in the initial feature map based on the positions of the feature points in the initial feature map comprises:

the extracting features of the initial feature map based on the feature values of the feature points in the initial feature map and the position relationship between the feature points in the initial feature map to obtain a target feature map includes:

4. The method according to claim 2, wherein the target feature map includes feature maps of a plurality of channels, and for each text region, the determining the text content feature corresponding to the text region according to the feature value of each feature point corresponding to the text region in the target feature map includes:

and determining the text content characteristics corresponding to the text region based on the fusion characteristic value corresponding to the text region.

5. The method according to claim 1, wherein the determining the ranking result of each text region based on the region feature corresponding to each text region comprises:

predicting the probability corresponding to each region feature at each time step in the feature sequence based on the feature sequence containing the region feature of each text region, and determining the sequencing result of each text region based on the probability corresponding to each region feature at each time step;

the probability corresponding to one region feature at one time step characterizes the probability that the ranking of the text regions corresponding to the region features in each text region corresponds to the time step.

6. The method according to claim 5, wherein the predicting the probability corresponding to each region feature in the feature sequence at each time step based on the feature sequence including the region feature of each text region comprises:

coding the region characteristics of each text region in the characteristic sequence to obtain a coding result of the characteristic sequence;

7. The method according to claim 6, wherein the encoding the region feature of each text region in the feature sequence to obtain the encoding result of the feature sequence comprises:

the predicting, for each time step, a probability corresponding to each of the region features at the time step based on the coding result and a historical output result corresponding to the time step includes:

for each time step, determining a first characteristic corresponding to the time step based on the coding result and a historical output result corresponding to the time step; and extracting the features of the hidden state features to obtain second features corresponding to the region features, and obtaining the probability corresponding to the region features at the time step according to the correlation between the second features corresponding to the region features and the first features.

8. The method according to claim 1, wherein the obtaining of the text recognition result of each text region comprises:

and for each text region, obtaining a text recognition result of the text region based on the text content characteristics of the text region.

9. The method according to claim 1, wherein the determining, according to the initial feature map, the text content features and the spatial position features of each text region in at least one text region included in the to-be-processed text image, and the splicing, for each text region, the text content features and the spatial position features of the text region to obtain the region features of the text region are realized by a map processing model;

wherein the graph processing model is determined by:

extracting sample initial characteristics of each sample text image;

and determining a first training loss value based on the real result and the prediction result, training the initial graph processing model based on the first training loss value and the training sample set, and determining the model at the end of training as the graph processing model until the first training loss value meets a first training end condition.

10. The method according to claim 9, wherein the determining the ranking result of each text region based on the region feature of each text region is implemented by a text ranking model;

wherein each text region of the sample text image is labeled with a second sample label, the second sample label represents a real ranking result of each text region of the sample text image, and the text ranking model is determined by:

11. An apparatus for processing a text image, the apparatus comprising:

the region feature determination module is used for splicing the text content features and the spatial position features of the text regions to obtain the region features of the text regions;

the sequencing result determining module is used for determining a sequencing result of each text region based on the region characteristics of each text region, and the sequencing result represents the output sequence of the text recognition result of each text region;

12. An electronic device comprising a processor and a memory, the processor and the memory being interconnected;

the memory is used for storing a computer program;

the processor is configured to perform the method of any of claims 1 to 10 when the computer program is invoked.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method of any one of claims 1 to 10.