CN110889404A

CN110889404A - Irregular text recognition system and method based on correction network

Info

Publication number: CN110889404A
Application number: CN201911145879.8A
Authority: CN
Inventors: 张雨柔; 李锐; 于治楼
Original assignee: Shandong Inspur Artificial Intelligence Research Institute Co Ltd
Current assignee: Shandong Inspur Artificial Intelligence Research Institute Co Ltd
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-03-17

Abstract

The invention discloses an irregular text recognition system and method based on a correction network, the recognition system of the invention comprises a text correction network and a text recognition network, the invention also relates to an irregular text recognition method based on the correction network, comprising the following steps: converting the irregular text picture into a regular text picture through a text correction network; and identifying the regular text picture through a text identification network and outputting corresponding text information. The recognition method of the invention firstly corrects the irregular text through the text correction network, for example, the text in the picture is presented in the horizontal direction, irrelevant noise information in the picture is removed, and then the recognition is carried out through the subsequent text recognition network, the geometric constraint can be avoided by processing the irregular text picture based on the correction network, various complicated irregular text pictures can be corrected, the difficulty of subsequent text recognition is reduced, and the recognition efficiency is higher.

Description

Irregular text recognition system and method based on correction network

Technical Field

The invention relates to the technical field of computer vision, in particular to an irregular text recognition system and method based on a correction network.

Background

The text recognition technology in the natural scene can help people to better and more conveniently acquire information in real life and help people to know the surrounding environment. However, the text contained in natural scenes is mostly irregular text, and may be in the form of curved, cut, or text containing a lot of noise information. At present, the recognition technology for the regular text has been well developed by virtue of the advantages of a deep network, but cannot be directly applied to the recognition problem of the irregular text, and most of the methods based on the attention mechanism are adopted for the irregular text picture at present, so that the method does not need to correct the irregular text, and directly combines the attention force to locate the position of the text information required to be concerned in each step on the original picture to directly recognize the text information in the picture. Due to some limitations of previous methods, such as the need for more supervisory information during training, the possibility of introducing more noise using radial transformation,

disclosure of Invention

The invention aims to provide a system and a method for recognizing irregular texts based on a correction network, aiming at the defects.

The technical scheme adopted by the invention is as follows:

an irregular text recognition system based on a correction network, comprising a text correction network and a text recognition network, wherein:

text correction network: the system is used for converting the irregular text picture into a regular text picture;

text recognition network: and the text picture is used for identifying the rule and generating text information.

As an optimization, the text correction network of the present invention includes a prediction network and a picture gridding module, wherein:

predicting the network: obtaining the position deviation of each corresponding pixel when the irregular text picture is converted into the regular text picture based on the convolutional neural network;

the picture gridding module: and generating a grid graph for the irregular text picture, obtaining the coordinate information of each pixel on the irregular text picture, integrating the coordinate information of each pixel with the corresponding position deviation, and outputting the converted coordinate information of each pixel.

As optimization, the text recognition network adopts an encoder-decoder structure, the encoder adopts a convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts a bidirectional LSTM and combines an attention mechanism.

The invention also relates to an irregular text recognition method based on the correction network, which comprises the following steps:

converting the irregular text picture into a regular text picture through a text correction network;

and identifying the regular text picture through a text identification network and outputting corresponding text information.

As an optimization, the step of converting the irregular text picture into the regular text picture comprises:

obtaining the position deviation coordinates of each pixel corresponding to the irregular text picture converted into the regular text picture based on the convolutional neural network;

acquiring the original position coordinates of each pixel on the original irregular text picture based on a regularization processing mode;

normalizing the original position coordinates of each pixel to obtain normalized coordinates of each pixel;

and summing the normalized coordinates of each pixel and the corresponding position deviation coordinates to obtain the conversion position coordinates of each pixel.

As an optimization, the step of obtaining the position deviation coordinates of each pixel corresponding to the conversion from the irregular text picture to the regular text picture based on the convolutional neural network of the present invention includes:

taking the pixel value of the irregular text picture as input, obtaining a dual-channel characteristic graph through a prediction network, wherein one channel corresponds to an X-axis deviation coordinate, the other channel corresponds to a Y-axis deviation coordinate, and the prediction network is established based on a convolutional neural network;

and the size of the feature map is smaller than that of the irregular text picture, and the feature map is converted into the size of the irregular text picture through a resize function, so that the position deviation coordinate corresponding to each pixel on the irregular text picture is obtained.

As optimization, the prediction network comprises five layers: the first layer comprises a maximum pooling layer, the second layer comprises a convolutional layer and a maximum pooling layer, the third layer comprises a convolutional layer and a maximum pooling layer, the fourth layer comprises a convolutional layer and a maximum pooling layer, and the fifth layer comprises a convolutional layer, wherein: the convolution layers of the second layer to the fourth layer are all followed by a batch normalization layer and a RELU activation function layer, and the convolution layer of the fifth layer is followed by a batch normalization layer and a Tanh activation function layer.

As an optimization, the step of normalizing the original position coordinates of each pixel to obtain the normalized coordinates of each pixel in the present invention includes:

and acquiring the width w and the height h of the original picture, and dividing the original position coordinate of each pixel by [ w/2, h/2] to obtain a normalized coordinate.

As an optimization, the steps of identifying the regular text pictures and outputting corresponding text information in the invention are as follows: the text recognition network adopts a coder-decoder structure, the coder adopts a convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts a bidirectional LSTM and combines an attention mechanism to finally obtain output based on character probability distribution.

As optimization, the establishment of the text correction network and the text recognition network comprises the following steps:

building a text correction network structure: build the prediction network of text correction network based on convolutional neural network to the pixel value of irregular text picture is output, builds five layers of structure, and the first layer includes that the one deck is the biggest pooling layer to be constituteed, and the second floor includes that one deck convolution layer and one deck are the biggest pooling layer, and the third layer includes that one deck convolution layer and one deck are the biggest pooling layer, and the fourth layer includes that three-layer convolution layer and one deck are the biggest pooling layer, and the fifth layer includes one deck convolution layer, wherein: the convolution layers from the second layer to the fourth layer are all followed by a batch normalization layer and a RELU activation function layer, and the convolution layer from the fifth layer is followed by a batch normalization layer and a Tanh activation function layer, and the deviation prediction information of two channels is output;

building a text recognition network structure: establishing a text recognition network based on an encoder-decoder structure, establishing an encoder based on a convolutional neural network and a cyclic neural network, establishing a decoder based on a bidirectional LSTM, and outputting characters based on character probability distribution in combination with an attention mechanism;

establishing a data set: selecting a data set, and dividing the data set into a training set and a test set;

network training: the method comprises the steps of learning network parameters of a text correction network and a text recognition network by adopting a course learning strategy, training the text recognition network by regular text pictures, fixing the text recognition network, training the text correction network by irregular texts, and finally training the text correction network and the text recognition network simultaneously in an end-to-end mode.

The invention has the following advantages:

1. the identification method of the invention firstly corrects the irregular text through the text correction network, for example, the text in the picture is presented in the horizontal direction, irrelevant noise information in the picture is removed, and then the identification is carried out through the subsequent text identification network, the geometric constraint can be avoided by processing the irregular text picture based on the correction network, various complicated irregular text pictures can be corrected, the difficulty of subsequent text identification is reduced, and the identification efficiency is higher;

2. the structural attention mechanism in the text recognition network can obtain more context text information and stronger robustness, and improves the recognition accuracy;

3. the invention trains the network structure in a weak supervision mode during network training, only needs original pictures and corresponding text labels, and does not need other additional supervision information.

4. The invention adopts the strategy of course learning during training and iteratively trains and updates the network structure, so that the network training effect is better and the efficiency is higher.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

The invention is further described below with reference to the accompanying drawings:

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The present invention is further described in the following with reference to the drawings and the specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not to be construed as limiting the present invention, and the embodiments and the technical features of the embodiments can be combined with each other without conflict.

It is to be understood that the terms first, second, and the like in the description of the embodiments of the invention are used for distinguishing between the descriptions and not necessarily for describing a sequential or chronological order. The "plurality" in the embodiment of the present invention means two or more.

The term "and/or" in the embodiment of the present invention is only an association relationship describing an associated object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, B exists alone, and A and B exist at the same time. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship.

Example one

As shown in fig. 1, the present embodiment provides an irregular text recognition system based on a modified network, which includes a text modified network and a text recognition network, wherein:

text correction network: the system is used for converting the irregular text picture into a regular text picture; specifically, the text correction network includes a prediction network and a picture gridding module, wherein: the prediction network is used for obtaining the position deviation of each corresponding pixel when the irregular text picture is converted into the regular text picture based on the convolutional neural network; the picture gridding module is used for generating a grid graph from the irregular text picture, obtaining the coordinate information of each pixel on the irregular text picture, integrating the coordinate information of each pixel with the corresponding position deviation, and outputting the converted coordinate information of each pixel.

Text recognition network: the system is used for identifying the regular text pictures and generating text information, the text identification network adopts an encoder-decoder structure, the encoder adopts a convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts a bidirectional LSTM and combines an attention mechanism.

Example two

The embodiment provides a method for constructing an irregular text recognition system based on a correction network, which comprises the following steps:

establishing a data set: selecting data sets, such as IIIT5K-Words, Street View Text, ICDAR2003, ICDAR2013, CUTE80, ICDAR2015, etc., and dividing the data sets into training sets and test sets;

EXAMPLE III

As shown in fig. 1, the irregular text recognition method based on the modified network provided by the embodiment based on the irregular text recognition system trained in the second embodiment includes the following steps:

s1, converting the irregular text picture into a regular text picture through a text correction network; the steps specifically comprise the following contents:

s11, obtaining the position deviation coordinates of each pixel corresponding to the conversion from the irregular text picture to the regular text picture based on the convolutional neural network: specifically, an irregular text picture is input into a prediction network, a pixel value of the irregular text picture is used as input, a double-channel characteristic diagram is obtained through the prediction network, one channel corresponds to an X-axis deviation coordinate, the other channel corresponds to a Y-axis deviation coordinate, the corresponding relation between the channel and an X axis and the corresponding relation between the channel and the Y axis are correspondingly judged in a processing mode, the X axis corresponds to the width direction of the irregular text picture, and the Y axis corresponds to the height direction of the irregular picture; and the sizes of the generated feature images of the two channels are smaller than the size of the irregular text picture, and the feature images are converted into the sizes of the resize irregular text picture by a bilinear interpolation method to obtain the position deviation coordinates corresponding to each pixel on the irregular text picture. It should be noted here that the corresponding channel values on the feature map output by the prediction network are all within the range of [ -1,1], and when the size of the original irregular text picture of the feature map resize is expressed by a linear interpolation method, the size of the irregular text picture is expressed by a pixel value, that is, the pixels of the feature map are equally equal to the pixel values of the original irregular text picture by the linear interpolation method, so that the obtained position deviation coordinates of each pixel are within the range of [ -1,1 ];

in the above, the prediction network is established based on a convolutional neural network, and the prediction network includes five layers: the first layer comprises a maximum pooling layer, the second layer comprises a convolutional layer and a maximum pooling layer, the third layer comprises a convolutional layer and a maximum pooling layer, the fourth layer comprises a convolutional layer and a maximum pooling layer, and the fifth layer comprises a convolutional layer, wherein: the convolution layers from the second layer to the fourth layer are all followed by a batch normalization layer and a RELU activation function layer, and the convolution layer from the fifth layer is followed by a batch normalization layer and a Tanh activation function layer;

s12, acquiring the original position coordinates of each pixel on the original irregular text picture based on the regularization processing mode; the original position coordinates obtained at the position are consistent with the characteristic diagram by taking the center of the irregular text image as the origin of coordinates, taking the width direction of the irregular text image as the X-axis direction and the right direction as the positive direction of the X-axis, taking the height direction of the irregular text image as the Y-axis direction and the downward direction as the positive direction of the Y-axis. The size of the irregular text picture in the processing process is measured in length units, so that the obtained original position coordinate value of the pixel is based on the numerical value of the length units;

s13, carrying out normalization processing on the original position coordinates of each pixel to obtain the normalized coordinates of each pixel, wherein the specific processing process comprises the steps of obtaining the width w and the height h of an original picture, dividing the original position coordinates of each pixel by [ w/2, h/2] to obtain the normalized coordinates, and after normalizing the original position coordinates of the pixels, enabling the obtained normalized coordinates to correspond to the position deviation coordinates so as to facilitate subsequent summation processing;

s14, summing the normalized coordinates of each pixel and the corresponding position deviation coordinates to obtain the conversion position coordinates of each pixel, and splicing all the converted pixels into regular text pictures;

and S2, recognizing the regular text picture through a text recognition network and outputting corresponding text information, wherein the text recognition network adopts an encoder-decoder structure, the encoder is based on a convolutional neural network and a cyclic neural network, the convolutional neural network can adopt network structures such as AlexNet, VGG, ResNet and the like, and the decoder is based on bidirectional LSTM and combines an attention mechanism to finally obtain output based on character probability distribution. The technology for generating text information by using regular text pictures is mature, and the technical scheme of the embodiment can be understood and optimized by referring to the prior art.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. An irregular text recognition system based on a correction network is characterized in that: the method comprises a text correction network and a text recognition network, wherein:

2. The revised network-based irregular text recognition system of claim 1, wherein: the text correction network comprises a prediction network and a picture gridding module, wherein:

predicting the network: acquiring the position deviation of each corresponding pixel when the irregular text picture is converted into the regular text picture based on the convolutional neural network;

the picture gridding module: and generating a grid graph for the irregular text picture, obtaining the coordinate information of each pixel on the irregular text picture, synthesizing the coordinate information of each pixel and the corresponding position deviation, outputting the converted coordinate information of each pixel, and further obtaining the regular text picture.

3. The revised network-based irregular text recognition system of claim 1, wherein: the text recognition network adopts an encoder-decoder structure, the encoder adopts a convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts a bidirectional LSTM and combines an attention mechanism.

4. An irregular text recognition method based on a correction network is characterized in that: the method comprises the following steps:

5. The irregular text recognition method based on modified network as claimed in claim 4, wherein: the step of converting the irregular text picture into the regular text picture comprises:

6. The irregular text recognition method based on modified network of claim 5, wherein: the step of obtaining the position deviation coordinates of each pixel corresponding to the conversion from the irregular text picture to the regular text picture based on the convolutional neural network comprises:

7. The irregular text recognition method based on modified network of claim 6, wherein: the prediction network comprises five layers: the first layer comprises a maximum pooling layer, the second layer comprises a convolutional layer and a maximum pooling layer, the third layer comprises a convolutional layer and a maximum pooling layer, the fourth layer comprises a convolutional layer and a maximum pooling layer, and the fifth layer comprises a convolutional layer, wherein: the convolution layers of the second layer to the fourth layer are all followed by a batch normalization layer and a RELU activation function layer, and the convolution layer of the fifth layer is followed by a batch normalization layer and a Tanh activation function layer.

8. The irregular text recognition method based on modified network of claim 6, wherein: the step of obtaining the normalized coordinate of each pixel by normalizing the original position coordinate of each pixel comprises the following steps:

9. The irregular text recognition method based on modified network as claimed in claim 4, wherein: the step of identifying the regular text picture and outputting corresponding text information comprises: the text recognition network adopts an encoder-decoder structure, the encoder adopts a convolutional neural network for feature extraction, and the decoder adopts a bidirectional LSTM cyclic neural network and combines an attention mechanism to finally obtain output based on character probability distribution.

10. The irregular text recognition method based on modified network as claimed in claim 4, wherein: the establishment of the text correction network and the text recognition network comprises the following steps: