LU503005B1

LU503005B1 - A layout-unconstrained method based on graph reasoning network for reading text block

Info

Publication number: LU503005B1
Application number: LU503005A
Authority: LU
Inventors: Ziyan Li; Lianwen Jin
Original assignee: Univ South China Tech
Priority date: 2022-11-05
Filing date: 2022-11-05
Publication date: 2023-05-05

Abstract

The invention discloses a layout-unconstrained method based on graph reasoning network for reading text block, which belongs to the technical field of pattern recognition and artificial intelligence; and comprises the following steps: Acquiring text pictures with unconstrained layout, and constructing a convolution network; extracting the visual feature map of the text picture through the convolution network, and performing character recognition on the visual feature map pixel by pixel; based on the output value of the convolution network, optimizing the convolution network by aggregating the cross-entropy loss function to obtain an unordered character set; constructing a graph reasoning network, and reasoning the relationship between characters in the character set through the graph reasoning network to obtain a character connection set; integrating the character connection set, and translating the integrated character connection set into reading order to obtain the recognition result of the text picture. The invention can predict a more accurate character sequence; through the character connection prediction model of graph inference network, the language information in independent corpus can be dug in depth.

Description

A LAYOUT-UNCONSTRAINED METHOD BASED ON GRAPH

REASONING NETWORK FOR READING TEXT BLOCK

TECHNICAL FIELD

The invention belongs to the technical field of pattern recognition and artificial intelligence, and in particular to a layout-unconstrained method based on graph reasoning network for reading text block.

BACKGROUND

Natural scene character recognition has a wide range of application scenes, and it is a research hotspot in the field of artificial intelligence. At present, there are two kinds of scene text recognition problems-regular text recognition and irregular text recognition. Compared with the former, irregular text recognition has aroused more research interest of scholars, because in an open scene, the problem of irregular text recognition is more challenging and the corresponding method has more practical value. With the society's increasing pursuit of spiritual culture, text instances in natural scenes become more and more abundant, and the arrangement of characters 1s no longer limited to regular linear expressions, irregular and artistic character layout design forms tend to be broadly observable. In recent years, although scholars have explored several important directions for irregular text recognition, including arbitrary shape text, perspective deformed text, low-quality text, disturbing text and so on, they have made remarkable progress. However, there is no relevant literature to explore the text instance recognition of irregular character layout (text block reading).

For the existing mainstream text recognition methods, if it is directly applied to the problem of text block reading, it will not achieve ideal results. The reasons are as follows: First, the implicit language model of RNN-based method does not clearly define the linkage relationship between characters, and its language model is also limited by the dictionary capacity of training samples (synthetic text images); secondly, in the method based on sequence modeling, the height information of CNN feature map is usually compressed, which makes it lose the recognition ability of text HUS03005 instances with two-dimensional layout; thirdly, the existing open training set samples’ character position clues are too single, which leads to the poor generalization of the method based on attention mechanism for text instances with unconstrained character layout; fourthly, the segmentation-based method lacks an explicit language model, so it can't solve the complicated reading order problem in text block examples.

SUMMARY

The purpose of the present invention is to provide a layout-unconstrained method based on graph reasoning network for reading text block, so as to solve the problems existing in the prior art.

To achieve the above purpose, the present invention provides a layout-unconstrained method based on graph reasoning network for reading text block, which comprises:

Acquiring text pictures with unconstrained layout, and constructing a convolution network; extracting the visual feature map of the text picture through the convolution network, and performing character recognition on the visual feature map pixel by pixel; based on the output value of the convolution network, optimizing the convolution network by aggregating the cross-entropy loss function to obtain an unordered character set; constructing a graph reasoning network, and reasoning the relationship between characters in the character set through the graph reasoning network to obtain a character connection set; integrating the character connection set, and translating the integrated character connection set into reading order to obtain the recognition result of the text picture.

Preferably, the process of performing character recognition on the visual feature map pixel by pixel comprises:

Pre-processing the text picture, taking the pre-processed text picture as input,

extracting the visual feature map through the convolution network, and converting the HUS03005 depth dimension into the category number of the alphabet through the full connection layer in the convolution network.

Preferably, the process of obtaining the unordered character set comprises:

Based on the output value of the convolution network and the number of categories, obtaining a probability matrix of character categories, and respectively summing, sorting and counting the probability matrix to obtain an unordered character set.

Preferably, the process of summing, sorting and counting the probability matrix respectively comprises:

Based on the probability matrix, summing the probability matrix through the time step dimension to obtain an aggregated probability vector; optimizing the convolution network through the aggregation cross-entropy loss function, and sorting the aggregation probability vectors by the probability values of character categories to obtain sorted probability vectors; based on the sorted probability vector, performing character counting by integer corresponding probability values element by element to obtain a character sequence length N, and intercepting the first N counted characters to obtain an unordered character set.

Preferably, the character connection set comprises a character global linkage prediction set and a character local linkage prediction set.

Preferably, the process of obtaining the character global linkage prediction set comprises:

Based on the unordered character set, taking each character in the unordered character set as a graph node, where the graph node is characterized by embedding and splicing the category embedding and serial number embedding of corresponding characters; through the graph attention layer in the graph reasoning network, performing relationship modeling for each graph node feature to obtain global modeling features, performing nonlinear activation for the global modeling features by the Softmax function, and taking the corresponding index of the maximum activation value by the HUS03005 argmax function to obtain the global linkage prediction set of characters.

Preferably, the process of obtaining the character local linkage prediction set comprises:

Taking each character in the unordered character set as a composition anchor point, and taking several characters adjacent to the composition anchor point in the local graph as graph nodes, and obtaining the normalized node features by subtracting the original features of the composition anchor point from the original features of all graph nodes; through graph attention layer in graph reasoning network, performing relationship modeling for the node features of local graphs to obtain local modeling features, and performing nonlinear activation for the local modeling features by

Sigmoid function to obtain a local linkage prediction set based on composition anchor points.

Preferably, the process of obtaining the recognition result of the text picture comprises:

Selecting a corresponding linkage prediction set based on the global linkage prediction set of characters and the local linkage prediction set of characters according to the confidence degree of connection classification;

Based on the linkage prediction set, using each character in the linkage prediction set as the star node of the linked list in turn, obtaining the connection prediction of the nodes through recursion and constructing a new node of the linked list, and a unidirectional connection linked list is constructed based on the new node; calculating the length of all unidirectional linked lists, and obtaining the recognition result of text pictures by taking the corresponding linked list with the longest length as the reading order of the character set.

The invention has the following technical effects:

According to the invention, acquiring text pictures with unconstrained layout, and constructing a convolution network; extracting the visual feature map of the text picture through the convolution network, and performing character recognition on the visual feature map pixel by pixel, based on the output value of the convolution HUS03005 network, optimizing the convolution network by aggregating the cross-entropy loss function to obtain an unordered character set; constructing a graph reasoning network, and reasoning the relationship between characters in the character set through the 5 graph reasoning network to obtain a character connection set; integrating the character connection set, and translating the integrated character connection set into reading order to obtain the recognition result of the text picture.

The invention can effectively alleviate the problem of scale sensitivity through the aggregation cross-entropy character recognition, and can predict a more accurate character sequence through the convolution network probability matrix; through the character connection prediction model of graph reasoning network, it can break away from the dictionary limitation of training text pictures and dig the language information in independent corpus.

BRIEF DESCRIPTION OF THE FIGURES

The drawings that form a part of this application are used to provide a further understanding of this application. The illustrative embodiments of this application and their descriptions are used to explain this application, and do not constitute undue limitations on this application. In the attached drawings:

FIG. 1 is a flowchart of a text block recognition method in an embodiment of the present invention;

FIG. 2 is a flow chart of cross-entropy character recognition based on the first K bits aggregation in the embodiment of the present invention;

FIG. 3 is a flowchart of an integration module in an embodiment of the present invention.

DESCRIPTION OF THE INVENTION

It should be noted that the embodiments in this application and the features in the embodiments can be combined with each other without conflict. The application will be described in detail with reference to the drawings and examples.

It should be noted that the steps shown in the flowchart of the figure can be HUS03005 executed in a computer system such as a set of computer-executable instructions, and, although a logical sequence is shown in the flowchart, in some cases, the steps shown or described can be executed in a sequence different from that here.

Embodiment 1

As shown in FIG. 1, the embodiment provides a layout-unconstrained method based on graph reasoning network for reading text block, which comprises:

In this embodiment, the specific process of the text block recognition method comprises:

Model input

Using the synthetic text block and the public text line as training samples; taking the text blocks collected in the real scene as test samples; preprocessing each sample, assigning corresponding character category labels, and global and local connection relationship labels; the process of preprocessing each sample comprises: under the condition of keeping the aspect ratio fixed, scaling the longest side of the text picture in the training sample and the test sample to 256 pixels, and scaling the other side length

. . . . LU503005 according to the corresponding aspect ratio to obtain the preprocessed sample.

Character recognition module

S2.1, adopting the full volume network as the backbone network to extract the visual features, in which the full volume network adopts the fourth layer to the antepenultimate layer of ResNet-50;

S2.2, connecting the last layer feature map of the backbone network to a fully connected layer, with the purpose of converting the depth dimension of the last layer feature map into the number of prediction categories, and obtaining the probability prediction matrix y, of characters, where, / ={x, |1<i< HW}, t represents the time step dimension, H and W are the number of rows and columns of the probability matrix y; , respectively; where Æ is the category dimension, k={x |1<i<|C |}, C, represents the alphabet set.

Unordered character set: as shown in FIG. 2,

S3.1, giving the probability matrix y, , firstly summing in the time step dimension to obtain the aggregate probability vector y, , and optimizing it by using the aggregate cross-entropy (ACE) loss function, where the expression of aggregate cross-entropy is:

T

IC. | N y; 1k

L(1,S)=—}, —*In Zt em T T where, /, S and N, are the input sample, the sequence label corresponding to the sample, and the frequency of the character k appearing in the sequence label S respectively.

S3.2, based on the probability value, after descending sorting y,, a sorting

T

— f ! vector zu FO) = F(y,),k EC, is obtained, where F(g is a descending t=1 sorting function, and each element of z,, consists of a character category and its corresponding probability value z/;

S3.3, calculating the length of character sequence N =7—|C,|, where 7 isthe total number of time steps and |C_ | is the predicted number of blank symbol;

S3.4, counting the characters by category based on each element of z,, in which the probability value z“ is integer by function H(g), and the integer value is the predicted number of the corresponding category C in the sample, finally, by intercepting the first N characters counted, an unordered character set S” is obtained, and its expression is:

NU AK (NE

SS" =M,(H(z,)*C,) where M (2 isthe merge function, whicn is used to merge the counting results into a character list in sequence and intercept the first N elements in it. In addition,

H(g is an integer function, and its expression is: 1 if2<x<0.5,

H(x)= . round(x) otherwise. where round(-) is rounding function, À is tolerance factor which can improve the regression performance of character recognition.

Linkage reasoning module

S4.1, composing the global graph network, taking each character in the above

S” as a graph node, where the features of each graph node are spliced by embedding the category and serial number of the corresponding characters, and are denoted as h={h,h,..,h,}, where h ={c,,s,}, c, and s, are the category features and serial number features of node 1 respectively;

S42, the stacked M-layer Graph Attention Layer is used to model the relationship of each node feature h of the global graph, and the modeling features x ={x,x,,...,x,} are obtained. The expression of the attention layer of the graph is:

M

I+1 _ /

Vi = oD a, Wy,) j=1

Where y! is the characteristic of node i in the I-th layer, o() is the nonlinear activation function, œ, is the value of the adjacent matrix in the i-th row and the j-th column, where œ, is learned by the attention mechanism. In addition, the input characteristic of the first layer of attention in the graph is 4.

S4.3, after the feature x is activated by the Softmax function, the corresponding index of the maximum activation is obtained by the argmax function to obtain the global connection prediction O, ={n,rn,...,r,} of each character, where r =argmax(p,), p, is the global connection probability of the character 1, and the expression is: exp(Wx,)

PTE

> ; exp(Wx,) where W is a learnable linear transformation.

S4.4, composing the local graph network, taking each character of the set S” as an anchor point in turn, looking for K characters adjacent to the anchor point as graph nodes, and marking them as v={v,,v,,...,v.}, where the original features of graph nodes (including anchor points) are obtained by embedding and splicing the corresponding character categories and serial numbers. For each local graph, the original features of all nodes A, are subtracted from the original features h, of anchor points to obtain normalized node features, marked as

W={h —h,h —h,. .h_—h}, where Kis set to 3 by experiment;

S4.5, the attention layer of stacked M-layer graph with the same structure as S4.2 is used to model the relationship of each node feature A’ of the local graph to obtain the modeling feature x’, where the weight of the attention layer of the graph is

. . . . LU503005 independent and does not share the weight with the attention layer of S4.2.

S4.6, after activating the Sigmoid function of the features, the connection prediction O,(q)={p|, ps... Pr} based on the anchor point q can be obtained, where p; represents the connection probability of the node i in the local graph is represented, and its expression is: , 1

OT exp(-Wx) p i finally, the linkage prediction set O, ={0,(q,),0(q,),...,0(q,)} of N local graphs can be obtained. Giving the prediction O,(q,) of node 1 in the local graph, it can be further calculated using the following formula:

SD if max(O,(q, ))<0.5 ! = . f = \argmax(p') otherwise. xe[1,Æ] therefore, the connection set of local graph network can be expressed as , , ,

O, = {oo

Model output

As shown in FIG. 3, the integration module integrates the connection prediction of the global graph and the local graph, which comprises the following steps:

SS, first, judging whether the local connection 7’ of node i is an empty set. (1)

If yes, directly using the global graph to connect and predict 7. (2) If not, the confidence of the two connection predictions is compared for selection, where 4 is the adjustable significant factor, represents the significance of the local graph prediction compared with the global graph prediction, and is set by experiment.

According to the predicted integration result, the character reading order is obtained through the following steps: firstly, each character is taken as the root node of a new linked list in turn, and a new node can be recursively constructed according to the connection prediction to obtain a linked list. Then, taking the corresponding connection sequence with the longest length in all linked lists as the character reading 705008 order to obtain the final text block recognition result.

Technical effects of this embodiment are as follows:

In this embodiment, According to the invention, acquiring text pictures with unconstrained layout, and constructing a convolution network; extracting the visual feature map of the text picture through the convolution network, and performing character recognition on the visual feature map pixel by pixel; based on the output value of the convolution network, optimizing the convolution network by aggregating the cross-entropy loss function to obtain an unordered character set; constructing a graph reasoning network, and reasoning the relationship between characters in the character set through the graph reasoning network to obtain a character connection set; integrating the character connection set, and translating the integrated character connection set into reading order to obtain the recognition result of the text picture.

The above are only the preferred embodiments of this application, but the scope of protection of this application is not limited to this. Any changes or substitutions that can be easily thought of by those skilled in the technical field within the technical scope disclosed in this application should be covered by the scope of protection of this application. Therefore, the scope of protection of this application should be based on the scope of protection of the claims.

Claims

CLAIMS LU503005

1. À layout-unconstrained method based on graph reasoning network for reading text block, characterized by comprising the following steps: acquiring text pictures with unconstrained layout, and constructing a convolution network; extracting the visual feature map of the text picture through the convolution network, and performing character recognition on the visual feature map pixel by pixel; based on the output value of the convolution network, optimizing the convolution network by aggregating the cross-entropy loss function to obtain an unordered character set: constructing a graph reasoning network, and reasoning the relationship between characters in the character set through the graph reasoning network to obtain a character connection set; integrating the character connection set, and translating the integrated character connection set into reading order to obtain the recognition result of the text picture.

2. The layout-unconstrained method based on graph reasoning network for reading text block according to claim 1, characterized in that, the process of performing character recognition on the visual feature map pixel by pixel comprises: pre-processing the text picture, taking the pre-processed text picture as input, extracting the visual feature map through the convolution network, and converting the depth dimension into the category number of the alphabet through the full connection layer in the convolution network.

3. The layout-unconstrained method based on graph reasoning network for reading text block according to claim 2, characterized in that, the process of obtaining the unordered character set comprises: based on the output value of the convolution network and the number of categories, obtaining a probability matrix of character categories, and respectively summing, sorting and counting the probability matrix to obtain an unordered character set.

4. The layout-unconstrained method based on graph reasoning network for HUS03005 reading text block according to claim 3, characterized in that, the process of summing, sorting and counting the probability matrix respectively comprises: based on the probability matrix, summing the probability matrix through the time step dimension to obtain an aggregated probability vector; optimizing the convolution network through the aggregation cross-entropy loss function, and sorting the aggregation probability vectors by the probability values of character categories to obtain sorted probability vectors; based on the sorted probability vector, performing character counting by integer corresponding probability values element by element to obtain a character sequence length N, and intercepting the first N counted characters to obtain an unordered character set.

5. The layout-unconstrained method based on graph reasoning network for reading text block according to claim 1, characterized in that, the character connection set comprises a character global linkage prediction set and a character local linkage prediction set.

6. The layout-unconstrained method based on graph reasoning network for reading text block according to claim 5, characterized in that, the process of obtaining the character global linkage prediction set comprises: based on the unordered character set, taking each character in the unordered character set as a graph node, where the graph node is characterized by embedding and splicing the category embedding and serial number embedding of corresponding characters; through the graph attention layer in the graph reasoning network, performing relationship modeling for each graph node feature to obtain global modeling features, performing nonlinear activation for the global modeling features by the Softmax function, and taking the corresponding index of the maximum activation value by the argmax function to obtain the global linkage prediction set of characters.

7. The layout-unconstrained method based on graph reasoning network for reading text block according to claim 6, characterized in that, the process of obtaining the character local linkage prediction set comprises: HUS03005 taking each character in the unordered character set as a composition anchor point, and taking several characters adjacent to the composition anchor point in the local graph as graph nodes, and obtaining the normalized node features by subtracting the original features of the composition anchor point from the original features of all graph nodes; through graph attention layer in graph reasoning network, performing relationship modeling for the node features of local graphs to obtain local modeling features, and performing nonlinear activation for the local modeling features by Sigmoid function to obtain a local linkage prediction set based on composition anchor points.

8. The layout-unconstrained method based on graph reasoning network for reading text block according to claim 7, characterized in that, the process of obtaining the recognition result of the text picture comprises: selecting a corresponding linkage prediction set based on the global linkage prediction set of characters and the local linkage prediction set of characters according to the confidence degree of connection classification; based on the linkage prediction set, using each character in the linkage prediction set as the star node of the linked list in turn, obtaining the connection prediction of the nodes through recursion and constructing a new node of the linked list, and a unidirectional connection linked list is constructed based on the new node; calculating the length of all unidirectional linked lists, and obtaining the recognition result of text pictures by taking the corresponding linked list with the longest length as the reading order of the character set.