CN110674777A

CN110674777A - Optical character recognition method in patent text scene

Info

Publication number: CN110674777A
Application number: CN201910940612.1A
Authority: CN
Inventors: 饶云波; 郭毅; 程亦茗; 张孟涵; 王艺霖
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-01-10

Abstract

The invention belongs to the technical field of computer vision, image processing and convolutional neural networks, and particularly relates to an optical character recognition method in a patent text scene. The invention combines the CNN and the LSTM, has the advantages of the CNN and the LSTM, and solves the problems that the CNN has weak processing on the sequence correlation and the LSTM has insufficient extraction on the image characteristics. The invention combines a new lost function computing method CTC to solve the problem that the sample data is difficult to align in the text recognition process in a mode of not needing alignment.

Description

Optical character recognition method in patent text scene

Technical Field

The invention belongs to the technical field of computer vision, image processing and convolutional neural networks, and particularly relates to an optical character recognition method in a patent text scene.

Background

With the continuous update of computer hardware and software and the gradual aging of Artificial Intelligence (AI), the deep learning is applied to the field of optical character recognition, which has very practical significance. The optical character recognition is to convert the characters of various bills, newspapers, books, manuscripts and other printed matters into image information by means of optical input methods such as scanning and the like, and then to convert the image information into information which can be recognized by a computer by utilizing a character recognition technology. Because of the influence factors including the habit of the writer, the printing quality of the document, the scanning quality of the scanner, the recognition method, the learning and testing samples, etc., the accuracy of the method is affected. From image to result output, the character with wrong recognition is corrected by image input, image pre-processing, character feature extraction, comparison and recognition, and manual correction, and the result is finally output.

The OCR technology has wide application prospect, the algorithm of the current text recognition is already applied in the industry, and a plurality of software aiming at the optical character recognition are provided in the market, so that the optical character field has great application value.

Current OCR technologies can be classified into two categories according to feature extraction methods:

(1) the traditional method comprises the following steps: firstly, a method based on connected domain analysis is used for positioning the text position in the picture, then row and column segmentation is carried out through binarization, row and column projection analysis and rules, and finally output is obtained through semantic error correction. The disadvantages are mainly: (1) it takes a lot of time to extract features, and usually artificially designed features (such as histogram of oriented gradient, etc.) are used to train a character recognition model, and the generalization capability of such single features is rapidly reduced when the font is changed. (2) The accuracy is seriously reduced under the conditions of overlapping and noise interference due to excessive dependence on the character segmentation result. (3) Generally, a good effect can be obtained only in a simple scene, and the effect is poor in a complex scene.

(2) Deep learning based optical character recognition: training of character recognition engines is a typical image classification problem. The current deep learning-based method utilizes the advantages of CNN in the aspect of extracting high-level semantics of images and the advantages of LSTM in processing time sequences, abandons the way of matching manually designed features and design templates, and carries out an End-to-End (End to End) recognition network model through a neural network, the recognition effect of the recognition network model can reach over 90% generally in a simple scene, and compared with the traditional method, the text recognition effect in a complex scene is improved more remarkably. However, the parameters are too many, the calculation amount is too large, a deeper network structure is often required to be constructed to realize accurate feature extraction, and the problem of gradient disappearance exists in the too-deep network structure. Since text information generally has a pre-and post-sequence correlation, CNN is much weaker than LSTM in extracting sequence correlation features. LSTM can handle feature extraction for existing time series, however, traditional LSTM can only handle short term memory because a sequence that is too long results in the disappearance of the gradient.

Disclosure of Invention

The invention aims to realize efficient and accurate recognition of texts in a patent scene and improve the automation degree of patent entry by fully utilizing the effectiveness of LSTM (one of RNN) in processing and predicting events with time sequences and the advantage of CNN in extracting deep semantics based on deep learning.

The technical scheme of the invention mainly comprises two parts, wherein the first step is to build and train a network model, the whole network model is divided into a text detection network and a text recognition network, the second step is to recognize by using the network model, and an overall algorithm frame diagram is shown as an attached figure 1. The method is realized by the following steps:

preparing a sample set, namely preparing a sample set by taking patent text pictures in tif format as the sample set, wherein the patent text pictures contain Chinese, English, numbers and punctuations, and simultaneously performing data enhancement by image processing methods such as stretching, blurring, random cutting, perspective transformation, reverse color and the like to obtain the sample set.

A deep neural network model is built, the whole network model is built by CNN and Bi-LSTM (long-short term memory neural network), a text region is generated first and then a detection result is generated, and the network structure is shown as an attached figure 2.

A text detection network model is characterized in that a brand-new infrastructure network is built by using 3 convolution layers and 3 compression Excitation modules (SE blocks), each compression Excitation module comprises two output branches, one branch does not carry out any treatment, the other branch passes through a pooling layer, a full connection layer, a Relu Excitation layer, a full connection layer and a sigmoid Excitation layer, and finally two branch results are added and then output. When the new network is used for calculation, different weights are given to the features of each channel, so that the feature extraction is more consistent with the application of an actual scene. The infrastructure network is shown in figure 3.

The text detection network is a problem in the field of target detection. Lower layer networks are better able to feel small targets and higher layer networks are better able to feel large targets, including context. Therefore, the feature extraction network takes a plurality of feature outputs into consideration during design, and forms a multi-scale feature extraction network. In practical problems, the extracted features of different channels should not have the same weight, so in the network extraction process, we set the features of different channels to have different weight outputs.

And (3) a text recognition network model is built by using Bi-LSTM and CNN, and a CTC algorithm is used for replacing the traditional smoothLoss loss function. The network is built using 4 depth separable modules and 1 Bi-LSTM module. The input data is a text sequence picture output by a text detection network, firstly, feature extraction is carried out through a depth separable module, the feature sequence is input into Bi-LSTM to carry out frame sequence prediction, then translation is carried out through CTC, and finally output is carried out. The structure of the text recognition network model is shown in fig. 4.

And training the network model by using the data set, and iteratively updating the network parameters to obtain an optimal model.

The model training comprises two parts, namely training of a text detection network and training of a text recognition network.

Text detection network training:

1. through forward propagation, text picture feature information is fully extracted by a convolution module, and the size of a feature map provided by a basic network module is W, H and C. W is the feature map width, H is the feature map height, and C is the number of channels output.

2. After C3X 3 convolution kernels, the data are input into a Bi-LSTM network to obtain W X256 dimensional output. And then through a 512-dimensional fully connected layer. And outputting, wherein the output layer is divided into 2 parts, the first part is subjected to coordinate regression by using 512 x (4+10), 512 represents that each point has 512 feature numbers, 10 represents that each point has 10 prediction box sizes, 10 candidate boxes with different scales are generated, 4 represents that one prediction box scale is described by a quadruple, and the quadruple represents coordinates of two points (xmin, xmax, ymin and ymax). The second part uses 512 × (2+10) for class prediction, 512 and 10 meaning the same as the first part, 2 indicating background or not.

3. A total of W × H × 10 prediction frames are generated for each picture, and the frames are deleted using an NMS (maximum suppression) method, with the threshold set to 0.7.

4. And calculating the offset of each candidate frame relative to the real frame for predicting frame regression.

5. Obtaining a final prediction frame according to the category score and the coordinates; the overall loss function consists of the addition of the classification loss function and the regression prediction function,

represents a function of the loss of classification,

representing the regression loss function, first part

Supervised learning of anchors using softmax function to learn whether text information is contained, s_iScore, s, representing the ith category ^*1 denotes whether or not the value is true; the second part

Is an L1smooth function and is used for learning the bias regression of anchors containing texts in the y direction, wherein v_jFor the jth texted prediction box size, beta represents the task weight, N_sAnd N_vIs a normalization parameter, which represents the number of samples of the corresponding task; the formula is as follows:

6. and combining the obtained prediction boxes by using a text line construction method. Recursively merges the two boxes into a group until no merge is possible. The merging conditions are as follows: 1) closest to the target frame and less than 50 pixels away; 2) the cross-over ratio is more than 0.7.

7. And updating the weight parameters of each network layer through back propagation according to the loss function.

And finishing the text detection network training.

Text recognition network training:

1. by forward propagation, the size of an input picture is 1 multiplied by W multiplied by 32, the feature information of the text picture is extracted through four depth separable convolution modules, and the final output size is

2. Because the features extracted by the CNN can not be directly output to the Bi-LSTM, a feature vector sequence needs to be extracted, each feature vector is generated on a feature map from left to right according to rows, each column contains 512 features, each feature vector is 512-dimensional, and the feature vectors are obtained together here

A feature vector.

3. Then, through 1 Bi-LSTM module with 256 hidden nodes, a feature vector is transmitted into each time step in the Bi-LSTM, and the feature vectors share the same

Finally obtaining the softmax probability distribution of the character to form a character

The posterior probability matrix of x character class number is used as input to the CTC algorithm.

4. And (4) finding the label sequence with the highest probability combination through a CTC algorithm, and outputting.

5. The loss function O is formulated as follows, where X is the input sequence, Y is the output sequence, and p (l | X) represents the probability of the output sequence l under X characters.

6. And similarly, performing back propagation according to the loss function, and updating the network weight parameter.

The method has the advantages that the method is different from the traditional method, the Bi-LSTM and the CNN are used for carrying out feature training, the CNN and the LSTM are combined, a new network structure model is provided, the CTC algorithm is used for carrying out probability prediction at the final stage of character output, and finally the image is processed by the traditional method, so that the recognition effect of the optical character in the final patent scene is greatly improved. With the development of technologies such as artificial intelligence and the like, methods such as deep learning and the like are introduced into the industry from the academic world, so that the method has strong practical significance. Due to advances in hardware and algorithms, the current demand for accuracy and degree of automation in recognition is also increasing.

The invention combines the CNN and the LSTM, has the advantages of the CNN and the LSTM, and solves the problems that the CNN has weak processing on the sequence correlation and the LSTM has insufficient extraction on the image characteristics. The invention combines a new lost function computing method CTC to solve the problem that the sample data is difficult to align in the text recognition process in a mode of not needing alignment. Aiming at the problem of optical character recognition in a patent scene, a traditional method is introduced for preprocessing and splitting a characteristic region, most of current OCR applications do not perform operations such as background detection and character direction adjustment on irregular pictures, and optimization of optical character recognition on patent pictures is lacked. As can be seen from the previous diagrams, the presence or absence of the targeted processing has a great influence on the final effect. The application prospect shown by the invention is wide, and the deep learning-based method has better practical value and research significance for OCR application and research under a specific scene.

Drawings

FIG. 1 is an algorithmic framework of the present invention;

FIG. 2 is a diagram of a neural network model of the present invention;

FIG. 3 is a diagram of an infrastructure network architecture;

FIG. 4 is a diagram of a text recognition network architecture;

FIG. 5 is a data set and tag map, (a) is a data tag map, and (b) is a data presentation map;

FIG. 6 is a flow chart of the Train algorithm;

FIG. 7 is a graph of network operation results;

FIG. 8 is a feature region segmentation map;

FIG. 9 is an Excel class screenshot;

FIG. 10 is a write module effect display diagram;

FIG. 11 is a graph testing FIG. 1, (a) is the raw input graph, and (b) is the model test result graph;

FIG. 12 is a graph testing FIG. 2, (a) is the raw input graph, and (b) is the model test result graph;

fig. 13 is a graph showing comparison of model effects, (a) is an original input graph, (b) is a graph showing a result of model test using a preprocessing method, and (c) is a graph showing a result of model test without using a preprocessing method.

Detailed Description

The following describes the applicability of the invention in connection with a simulation example.

Defining a training environment:

CPU-i7 8700k、GPU NVIDIA GeForce 2080Ti、OS ubuntu 16.0.4。

data verification environment:

CPU 2.7GHz Intel Core i5、GPU Intel Iris Graphics 6100、Mac OS X10.14.6。

the development language used python3.5 and open-source framework Keras, tensirflow as the back-end, and third-party libraries such as Opencv, Numpy, etc. were introduced.

1. Data set preparation

Patent text pictures in tif format are adopted, a data set comprises 50 ten thousand original pictures, wherein the original pictures comprise Chinese, English, numbers and punctuation, and data enhancement is carried out by image processing methods such as stretching, blurring, random cutting, perspective transformation, reverse color and the like, so that about 300 ten thousand pictures in the final data set are obtained. The data set is divided into a training set and a verification set according to the proportion of 99:1, a data label is made through a text _ render tool, and a label file train.txt and picture data are generated, as shown in the attached figure 5.

2. Begin training

The iteration number epoch is set to 4, the batch-size is set to 16, and the picture length and width are respectively limited to 280 × 32. The learning rate lr dynamically changes according to epoch, and the specific formula is as follows.

lr＝0.0006×0.3^epoch

Py file is run, a session is created first, then a network structure and a data set path are loaded, and a training algorithm flowchart is shown in fig. 6. The screenshot of the run results is shown in FIG. 7.

And obtaining a weight.h5 file after training is finished, and then performing a step of recognizing and writing the characters of the patent text.

3. And (4) preprocessing the layout of the patent picture, and then identifying.

1) The input picture is first scaled and cropped to a standard 224 x 224 size picture. This step is to prevent the lack of precision caused by the irregular pictures.

2) And removing the noise of the irregular picture by using a filter, and performing binarization, rotation and the like to protrude the characteristics of the optical characters.

3) And establishing a coordinate system by taking the upper left corner as an origin, and extracting the coordinates of the area where the content to be identified is located. The corresponding area is cut to generate an intermediate picture, so that the characteristic area is enlarged as shown in figure 8, and a large amount of irrelevant information is reduced.

4) Writing into an Excel document, and reading and writing the Excel document by using python package openpyxl. The classes of data that need to be written are shown in fig. 9. First, whether a document is newly created or added is judged by a compare _ excel (self, sheet) - > pool function. Because the types of patent pictures are more, data may need to be written into an existing line or a new line, and the invention carries out multiple judgment through the collection of keyword patent numbers, data names and the like. And finally, writing the data into an Excel document, as shown in the figure 10.

After a series of image processing, the network model is used for identification, and the test effect is good, as shown in fig. 11 and fig. 12.

As can be seen from the above diagrams, the method obviously improves the recognition accuracy, the algorithm recognition result can be displayed as an intermediate output stream and can be manually modified, and the final result can be automatically stored in an Excel table. The final identification precision is very high, and industrial deployment application can be basically carried out.

The invention provides a novel network structure and an algorithm model: Bi-LSTM + CNN + CTC algorithm. The text detection network adopts an SE-block structure, a new basic network structure is constructed for feature extraction, the module fully considers the influence of different channel dimensions on features in the feature extraction, and the feature extraction effect is better compared with other feature extraction network models. The text recognition network uses a new depth-Conv module to construct a CNN module, and the loss function calculates the character probability by using a brand-new CTC algorithm to replace the smoothLoss function. Under the condition of keeping the model precision, the model parameters are greatly reduced, and the calculated amount is reduced.

In the recognition stage, the picture is preprocessed to unify the size of the picture, then the characteristic region is recognized and cut, and the trained network model is used for recognition to generate an intermediate result. Since optical character recognition currently has no way to achieve one hundred percent accuracy, manual review is still necessary. If the image is directly input into the network model for identification without image preprocessing, the effect is poor, the importance of the image preprocessing on the final result can be seen through the graph 13, and the accuracy of identification directly without preprocessing is improved qualitatively compared with the accuracy of identification directly without preprocessing.

Claims

1. An optical character recognition method under a patent text scene is characterized by comprising the following steps:

s1, obtaining patent text pictures in tif format, and preprocessing the patent text pictures to be used as a sample set;

s2, establishing a deep neural network model, including a text detection network model and a text recognition network model;

the text detection network model consists of 3 convolutional layers, 3 compression excitation modules and 1 Bi-LSTM, wherein one convolutional layer is connected with one compression excitation module; each compression excitation module comprises two output branches, one branch does not carry out any treatment, the other branch sequentially passes through a pooling layer, a full connection layer, a Relu excitation layer, a full connection layer and a sigmoid excitation layer, and finally two branch results are added and then output; the last compression excitation module is connected with the Bi-LSTM after passing through a convolution kernel of 3 multiplied by 3, and finally output through a full connection layer;

the text recognition network model is composed of Bi-LSTM and CNN, the network model firstly passes through a depth separable module composed of CNN, the module comprises 3 x 3 convolution layers with the same number as input channels, batch normalization is carried out after superposition, then a 1 x 1 convolution layer is passed through, and finally the depth separable module is output after batch normalization, activation function and maximum pooling layer; the last depth separable module is connected with the Bi-LSTM module and is connected with the sequence translation module;

s3, training the deep neural network model in the step S2 by using the sample set obtained in the step S1 to obtain a trained neural network model, which specifically comprises the following steps:

training a text detection network model: through forward propagation, extracting text picture characteristic information by using a convolution module, wherein the sizes of characteristic graphs proposed by a basic network module are W, H and C; w is the width of the feature map, H is the height of the feature map, and C is the number of output channels;

extracting target candidate region features through C3 × 3 convolution kernels and preset preselected frame sizes, inputting the target candidate region features into a Bi-LSTM network to obtain W × 256-dimensional output, outputting the W × 256-dimensional output through a 512-dimensional full-connection layer, wherein the output layer is divided into 2 parts, the first part is subjected to coordinate regression by using 512 × (4+10), 512 represents 512 feature numbers of each point, 10 represents 10 groups of preselected frame sizes of each point, and 4 represents the composition of the preselected frame sizes (xmin, xmax, ymin, ymax) and represents coordinates of the two points; the second part uses 512 × (2+10) for category prediction, 512 and 10 have the same meaning as the first part, and 2 indicates both the background and not the background;

generating WXH multiplied by 10 different preselection frames in all the pictures, deleting the frames by using a maximum value inhibition method, and setting a threshold value to be 0.7;

calculating the offset of each candidate frame relative to the real frame for predicting frame regression;

obtaining a final prediction frame according to the category score and the coordinates; the overall loss function consists of the addition of the classification loss function and the regression prediction function,

represents a function of the loss of classification,

representing the regression loss function, first part

Supervised learning of the prediction box using the softmax function whether the prediction box contains text information, s_iScore, s, representing the ith category^*1 denotes whether or not the value is true; the second part

Is an L1smooth function for learning biased regression of a prediction box containing text in the y direction, where v_jFor the jth texted prediction box size, beta represents the task weight, N_sAnd N_vIs a normalization parameter, which represents the number of samples of the corresponding task; the formula is as follows:

combining the obtained prediction frames by a text line construction method, recursively combining the two frames into a group until the two frames cannot be combined, wherein the combination conditions are as follows: 1) closest to the target frame and less than 50 pixels away; 2) the cross-mixing ratio is more than 0.7;

updating the weight parameters of each network layer through back propagation according to the loss function;

training a text recognition network model:

by forward propagation, the size of an input picture is 1 multiplied by W multiplied by 32, the feature information of the text picture is extracted through four depth separable convolution modules, and the final output size is

Because the features extracted by the CNN can not be directly output to the Bi-LSTM, a feature vector sequence needs to be extracted, each feature vector is generated on a feature map from left to right according to rows, each column contains 512 features, each feature vector is 512-dimensional, and the feature vectors are obtained together

A feature vector;

then, through 1 Bi-LSTM module with 256 hidden nodes, a feature vector is transmitted into each time step in the Bi-LSTM, and the feature vectors share the sameFinally, obtaining the softmax probability distribution of the character,

as input to the CTC algorithm;

through a CTC algorithm, finding a tag sequence with the highest probability combination, and outputting the tag sequence;

the loss function O is expressed as follows, where X is the input sequence, Y is the output sequence, and p (l | X) represents the probability of the output sequence l at X characters:

performing backward propagation according to the loss function, and updating the network weight parameter;

and S4, inputting the patent text picture to be recognized into the trained neural network model to obtain an optical character recognition result.