CN111027562A

CN111027562A - Optical character recognition method based on multi-scale CNN and RNN combined with attention mechanism

Info

Publication number: CN111027562A
Application number: CN201911241447.7A
Authority: CN
Inventors: 李得元; 代超; 何帆; 周振
Original assignee: China Power Health Cloud Technology Co ltd
Current assignee: China Power Health Cloud Technology Co ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-04-17
Anticipated expiration: 2039-12-06
Also published as: CN111027562B

Abstract

The invention discloses an optical character recognition method based on a multi-scale CNN and an RNN combined with an attention mechanism, which relates to the technical field of image optical character recognition, and comprises the steps of obtaining a plurality of pictures containing characters to construct a data set, preprocessing the pictures in the data set, and obtaining image data and a vector label; inputting image data and a vector label into a preset network model, and extracting characteristics through a convolution module, a recurrent neural network and an attention mechanism module in the network model in sequence to obtain a characteristic matrix; inputting the characteristic matrix into a CTC module in the network model for decoding, calculating a CTC loss function, performing optimization adjustment on the parameters of the preset network model through back propagation of the loss function until the network model is converged, and outputting a trained network model; the method and the device have the advantages of accurate recognition result and good recognition effect.

Description

Optical character recognition method based on multi-scale CNN and RNN combined with attention mechanism

Technical Field

The invention relates to the technical field of image optical character recognition, in particular to an optical character recognition method based on a multi-scale CNN and an RNN combined with an attention mechanism.

Background

Optical Character Recognition (OCR) refers to a process in which an electronic device (e.g., a scanner or a digital camera) examines a Character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into computer text by a Character Recognition method, i.e.: the method is a technology for converting characters in a paper document into an image file of a black-and-white dot matrix in an optical mode aiming at print characters, and converting the characters in the image into a text format through recognition software for further editing and processing by word processing software. Image optical character recognition is generally that image information is acquired by a scanner or a digital camera or the like and stored in an image file, and then OCR software reads and analyzes the image file and extracts a character string therein through character recognition.

In previous OCR tasks, the recognition process was divided into two steps: single word cutting and sorting tasks. Generally, a text file of a series of characters is first cut into a single font by using a projection method, and then sent to a CNN for character classification. However, this method is somewhat obsolete, and now, end-to-end character recognition based on deep learning is more popular, that is, we do not need to explicitly add a link of character segmentation, but convert character recognition into a sequence learning problem, although input images have different scales and different text lengths, after the input images are translated by CNN and RNN, the whole text image can be recognized after the input images are output, that is, the segmentation of characters is also merged into deep learning.

At present, end-to-end OCR based on deep learning has two main technologies: the two major methods are mainly different in a final output layer (translation layer), namely how to convert sequence feature information learned by a network into a final recognition result, and both the two technologies adopt a network structure of CNN + RNN in a feature learning stage, wherein CNN is a convolutional neural network and RNN is a cyclic neural network, but the CRNN OCR adopts a CTC algorithm during alignment or decoding, and the attention OCR adopts an attention mechanism.

Although both the two technologies achieve good recognition effect on OCR at present, aiming at the characteristic learning stage of CRNN OCR, in the CNN stage, because only the convolutional neural network is adopted for information extraction, the situation of incomplete extracted information is easy to occur, and the recognition result is wrong; in the RNN stage, the sequence features are extracted only by the recurrent neural network, which cannot ensure that the sequence features are completely extracted, resulting in poor recognition effect.

Disclosure of Invention

The invention aims to: in order to solve the problems that incomplete information extraction and incomplete sequence feature extraction are easy to occur in the feature learning stage of the conventional CRNN OCR, so that a recognition result is wrong, and the recognition effect is poor, the invention provides an optical character recognition method based on a multi-scale CNN and an RNN combined with an attention mechanism.

The invention specifically adopts the following technical scheme for realizing the purpose:

the optical character recognition method based on the multi-scale CNN and the RNN combined with the attention mechanism comprises the following steps:

s1: acquiring a plurality of picture construction data sets containing characters, and preprocessing pictures in the data sets to obtain image data and vector labels;

s2: inputting image data and a vector label into a preset network model, and extracting characteristics through a convolution module, a recurrent neural network and an attention mechanism module in the network model in sequence to obtain a characteristic matrix;

s3: inputting the characteristic matrix into a CTC module in a network model for decoding, calculating a CTCloss loss function, performing optimization adjustment on the parameters of the preset network model through back propagation of the loss function until the network model converges, and outputting a trained network model;

s4: and carrying out optical character recognition on the picture to be recognized by utilizing the trained network model to obtain a final recognition result.

Further, in S1, preprocessing the picture in the data set to obtain image data, specifically: the picture is read into RGB format, then the picture is scaled to (32,256,3), and the picture pixel value is normalized to obtain the image data.

Further, in S1, preprocessing the picture in the data set to obtain a vector tag, specifically: and transcoding the characters in the picture into 2-valued vectors according to the dictionary to obtain vector labels.

Further, the specific structure of the convolution module is as follows:

the input end of the first convolution layer is connected with a first convolution layer, the first convolution layer comprises 64 convolution kernels, the size of each convolution kernel is 3 x 3, the step length is 2, the activation function is Relu, and the output end of the first convolution layer is connected with the input end of a second convolution layer;

the second convolutional layer comprises 128 convolutional kernels, the size of each convolutional kernel is 3 x 3, the step length is 2, the activation function is Relu, and the output end of the second convolutional layer is connected with the input end of the third convolutional layer;

the third convolution layer comprises four branches, and the output ends of the four branches are connected and then connected with the input end of the fourth convolution layer;

the fourth convolutional layer comprises 256 convolutional kernels, the size of each convolutional kernel is 3 x 3, the step length is (2,1), the activation function is Relu, and the output end of the fourth convolutional layer is connected with the input end of the fifth convolutional layer;

the fifth convolutional layer comprises 512 convolutional kernels, the size of each convolutional kernel is 3 x 3, the step length is (2,1), the activation function is Relu, and the output end of the fifth convolutional layer is connected with the input end of the sixth convolutional layer;

the sixth convolutional layer comprises 512 convolutional kernels, the size of each convolutional kernel is 2 x 1, the step length is 1, the activation function is Relu, and the output end of the sixth convolutional layer is connected with the input end of the seventh convolutional layer;

and the seventh convolution layer comprises a Squeeze module, and the Squeeze module performs Squeeze operation on the input characteristics to remove the first dimension so as to obtain the output characteristics of the convolution module.

Further, the four branches of the third convolutional layer are respectively:

the first branch is a convolution branch and comprises 128 convolution kernels, each convolution kernel is 1 x 1 in size, and the activation function is Relu;

the second branch is a depth separable convolution branch comprising 128 convolution kernels, each convolution kernel having a size of 3 x 3, a step size of 1, an expansion rate of 1, and an activation function of Relu;

the third branch is a depth separable convolution branch and comprises 128 convolution kernels, the size of each convolution kernel is 3 x 3, the step size is 1, the expansion rate is 3, and the activation function is Relu;

the fourth branch is a depth separable convolution branch comprising 128 convolution kernels, each convolution kernel having a size of 3 x 3, a step size of 1, an expansion rate of 5, and an activation function of Relu.

Further, the specific structure of the recurrent neural network and the attention mechanism module is as follows:

the first Layer comprises two branches, the first branch comprises a Position Embedding Layer, the output end of the Position Embedding Layer is respectively connected with an adder A and a first Multi-head attachment Layer, the output end of the first Multi-head attachment Layer is connected with the adder A, the output end of the adder A is connected with the first Layer Normalization Layer, the output end of the first Layer Normalization Layer is respectively connected with the adder B and the first Position-wise Feed-Forward Layer, the output end of the first Position-wise Feed-Forward Layer is connected with the adder B, the output end of the adder B is connected with the second Layer Normalization Layer, and the output end of the second Layer Normalization Layer is connected with the adder C;

the second branch comprises a first bidirectional LSTM layer, and the output end of the first bidirectional LSTM layer is connected with an adder C;

the second Layer comprises two branches, the first branch comprises a second Multi-head attachment Layer connected with the output end of the second Layer Normalization Layer, the output end of the second Multi-head attachment Layer and the output end of the second Layer Normalization Layer are connected with an adder D, the output end of the adder D is connected with a third Layer Normalization Layer, the output end of the third Layer Normalization Layer is respectively connected with a fourth Position-wise Feed-Forward Layer and an adder E, the output end of the fourth Position-wise Feed-Forward Layer is connected with the adder E, the output end of the adder E is connected with the fourth Layer Normalization Layer, and the output end of the fourth Layer Normalization Layer is connected with the adder F;

the second branch comprises a second bidirectional LSTM layer, the output end of the adder C is connected with the second bidirectional LSTM layer, and the output end of the second bidirectional LSTM layer is connected with the adder F;

the third Layer comprises a fifth Layer, the output end of the adder F is connected with the fifth Layer, the output end of the fifth Layer is connected with the third bidirectional LSTM Layer, the output end of the third bidirectional LSTM Layer is connected with the full-connection Layer, the neuron number of the full-connection Layer is the number of characters plus 1, and finally the characteristic matrix is output.

Further, in S3, the Adam gradient descent algorithm is used to calculate the CTC loss function.

Further, the S4 specifically includes:

s4.1: reading the picture to be recognized into an RGB format, zooming the size of the picture to be recognized to (32,256,3), and then normalizing the pixel value of the picture to be recognized to obtain the image data to be recognized;

s4.2: inputting image data to be recognized into a trained network model, and extracting features through a convolution module, a cyclic neural network and an attention mechanism module in the trained network model to obtain a feature matrix to be recognized;

s4.3: decoding the characteristic matrix to be recognized by utilizing a CTC module in the trained network model to obtain a decoding result;

s4.4: and comparing the decoding result with the dictionary to obtain a final recognition result.

The invention has the following beneficial effects:

1. in the characteristic learning stage, the method adds multi-scale convolution to the CNN stage, can obtain information with wider visual field, adopts the combination of the RNN and the attention mechanism to jointly extract sequence characteristics for the RNN stage, and can ensure that the sequence characteristics are fully extracted, so that the identification result is more accurate, and the identification effect is better.

Drawings

FIG. 1 is a schematic process flow diagram of an embodiment of the present invention.

Fig. 2 is a schematic diagram of a network model according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a convolution module according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of a recurrent neural network and attention mechanism module configuration in accordance with an embodiment of the present invention.

Fig. 5 is a schematic diagram of a picture to be recognized according to an embodiment of the present invention.

Detailed Description

For a better understanding of the present invention by those skilled in the art, the present invention will be described in further detail below with reference to the accompanying drawings and the following examples.

Example 1

As shown in fig. 1, the present embodiment provides an optical character recognition method based on a multi-scale CNN and an RNN with attention mechanism, including:

the image data is obtained by preprocessing the pictures in the data set, and specifically comprises the following steps: reading the picture into an RGB format, then scaling the picture to (32,256,3), and carrying out normalization processing on picture pixel values to obtain image data;

preprocessing the pictures in the data set to obtain a vector label, which specifically comprises the following steps: transcoding characters in the picture into 2-valued vectors according to a dictionary to obtain vector labels;

s2: inputting image data and a vector label into a preset network model shown in FIG. 2, and extracting features sequentially through a convolution module, a recurrent neural network and an attention mechanism module in the network model to obtain a feature matrix;

as shown in fig. 3, the specific structure of the convolution module is as follows:

the input characteristics of the convolution module are image data and vector labels, the image data and the vector labels are input into the convolution module from input ends, the input ends are connected with a first convolution layer, the first convolution layer comprises 64 convolution kernels, the size of each convolution kernel is 3 x 3, the step length is 2, the activation function is Relu, and the output end of the first convolution layer is connected with the input end of a second convolution layer;

the first branch is a convolution branch and comprises 128 convolution kernels, the size of each convolution kernel is 1 x 1, and the activation function is Relu;

the fourth branch is a depth separable convolution branch and comprises 128 convolution kernels, the size of each convolution kernel is 3 x 3, the step size is 1, the expansion rate is 5, and the activation function is Relu;

after the output characteristics of the four branches are spliced, inputting the output characteristics into a fourth convolution layer;

the seventh convolution layer comprises a Squeeze module, the Squeeze module performs a Squeeze operation on the input features, the first dimension is removed, the size of an output feature matrix is changed from (1,32,512) to (32,512), and the output features of the convolution module are obtained;

as shown in fig. 4, the specific structure of the recurrent neural network and the attention mechanism module is as follows:

the output characteristics of the convolution module are used as the input characteristics of the cyclic neural network and the Attention mechanism module and input into the cyclic neural network and the Attention mechanism module, the first Layer comprises two branches, the first branch comprises a Position Embedding Layer, the output end of the Position Embedding Layer is respectively connected with an adder A and a first Multi-head attachment Layer, the output end of the first Multi-head attachment Layer is connected with the adder A, the output end of the adder A is connected with a first Layer Normalization Layer, the output end of the first Layer Normalization Layer is respectively connected with an adder B and a first Position-wise Feed-Forward Layer, the output end of the first Position-wise Feed-Forward Layer is connected with the adder B, the output end of the adder B is connected with a second Layer Normalization Layer, and the output end of the second Layer Normalization Layer is connected with an adder C;

the third Layer comprises a fifth Layer, the output end of the adder F is connected with the fifth Layer, the output end of the fifth Layer is connected with the third bidirectional LSTM Layer, the output end of the third bidirectional LSTM Layer is connected with the full connection Layer Dense, the neuron number of the full connection Layer is the number of characters plus 1, and finally a characteristic matrix is output;

s3: inputting the characteristic matrix into a CTC module in a network model for decoding, wherein the CTC module is a CTC decoder in the embodiment, calculating a CTC loss function by using an Adam gradient descent algorithm, performing optimization adjustment on the parameters of the preset network model through back propagation of the loss function until the network model converges, and outputting the trained network model;

s4: carrying out optical character recognition on the picture to be recognized shown in the figure 5 by utilizing the trained network model to obtain a final recognition result, which specifically comprises the following steps:

s4.4: and comparing the decoding result with the dictionary to obtain a final recognition result: [ 'healthy', 'body', 'check', 'knot', 'fruit' ].

The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention, the scope of the present invention is defined by the appended claims, and all structural changes that can be made by using the contents of the description and the drawings of the present invention are intended to be embraced therein.

Claims

1. The method for recognizing the optical characters based on the multi-scale CNN and the RNN combined with the attention mechanism is characterized by comprising the following steps:

s3: inputting the characteristic matrix into a CTC module in the network model for decoding, calculating a CTC loss function, performing optimization adjustment on the parameters of the preset network model through back propagation of the loss function until the network model is converged, and outputting a trained network model;

2. The method for recognizing optical characters according to claim 1, wherein in S1, the image in the data set is preprocessed to obtain image data, specifically: the picture is read into RGB format, then the picture is scaled to (32,256,3), and the picture pixel value is normalized to obtain the image data.

3. The method for recognizing optical characters according to claim 1, wherein in S1, the pictures in the data set are preprocessed to obtain vector labels, specifically: and transcoding the characters in the picture into 2-valued vectors according to the dictionary to obtain vector labels.

4. The method for recognizing optical characters according to claim 1, wherein the convolution module has a specific structure:

5. The optical character recognition method of claim 4, wherein the four branches of the third convolutional layer are respectively:

6. The optical character recognition method of claim 1, wherein the recurrent neural network and attention mechanism module are specifically configured as follows:

7. The method for optical character recognition according to claim 1, wherein in S3, a CTC loss function is calculated using Adam gradient descent algorithm.

8. The method for optical character recognition according to any one of claims 1-7, wherein S4 is specifically: