CN115497095A

CN115497095A - OCR character recognition method and system based on attention mechanism

Info

Publication number: CN115497095A
Application number: CN202211182141.0A
Authority: CN
Inventors: 张盛洪; 张国慧; 张志坚; 罗瑞明; 王硕君; 英树祥; 邓雄文; 梁岸平; 蒋秀
Original assignee: Guangdong Power Grid Co Ltd; Jiangmen Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Jiangmen Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2022-12-20

Abstract

The invention provides an OCR character recognition method and system based on an attention mechanism, which can reserve more text features by using a multi-scale feature fusion method with the attention mechanism, thereby improving the condition of text omission. In addition, when the final feature map is obtained, the coordinate attention is used for capturing the feature correlation at a long distance, and the detection of long texts is facilitated. Meanwhile, simple post-processing is adopted, so that the accuracy and reasoning speed of text detection are improved, and the recognition result of the text is more accurate.

Description

OCR character recognition method and system based on attention mechanism

Technical Field

The invention belongs to the technical field of text recognition, and particularly relates to an OCR character recognition method and system based on an attention mechanism.

Background

At present, a user uploads a business license is a means for obtaining authentication, generally, the content of the business license needs to be filled, for a text with more contents, the filling process is time-consuming and labor-consuming, and is easy to be filled in by mistake, and in the prior art, the step of text recognition of the business license is complicated and the calculation amount is large, so that the efficiency of text recognition is reduced.

OCR character recognition technology refers to the process of an electronic device (e.g., a scanner or digital camera) examining printed characters on paper and then translating the shape into computer text using character recognition methods. The existing recognition method based on OCR is mostly realized based on the traditional model, for the traditional model, the consumption of detection time is often larger, the effect of detecting long texts is not good, the detection omission often occurs for the small-scale texts of the multi-scale texts, or the accuracy is low under the condition of complex background, such as fuzzy images.

When the image text is detected, the prior art often detects missed detection of small-scale texts, usually detects a plurality of bounding boxes under the condition of long text lines, and has poor detection effect, poor robustness of a model and overlong reasoning time when the definition of the image is not enough.

Disclosure of Invention

In view of this, the present invention aims to solve the problems of text omission and poor detection effect of bounding boxes in the conventional OCR character recognition technology.

In order to solve the technical problems, the invention provides the following technical scheme:

in a first aspect, the present invention provides an OCR character recognition method based on an attention mechanism, including the following steps:

preprocessing an image of an input picture to be recognized, and constructing a required word bank;

sending the processed picture into a text detection network to obtain a text bounding box coordinate, and performing text feature detection on the processed picture by the text detection network based on an attention mechanism;

clipping the input image according to the coordinates of the text bounding box to obtain a series of pictures only containing one line of text;

and sending the cut pictures into a text recognition network in sequence, and obtaining a final text recognition result after word bank comparison.

Further, image preprocessing is performed on the input picture to be recognized, and a required word bank is constructed, specifically including:

reading an input image and decoding the image into an image matrix with an RGB format;

keeping the width-to-height ratio of the image, and scaling the short edge in the image to 736 pixels;

normalizing the image matrix;

and constructing a corresponding word bank for the characters to be identified.

Further, the processed picture is sent to a text detection network to obtain a text bounding box coordinate, and the text detection network performs text feature detection on the processed picture based on an attention mechanism, and specifically includes:

the processed image is sent to a residual backbone network for preliminary extraction of features;

the residual error network has four residual error modules, the last layer of characteristic diagram of each residual error module is taken out to construct a characteristic pyramid which is respectively marked as the 1 st, 2 nd, 3 rd and 4 th layers from top to bottom;

firstly, performing attention feature fusion on the features of the 1 st layer and the 2 nd layer and performing convolution operation to obtain corrected feature maps of the 1 st layer and the 2 nd layer;

performing the attention feature fusion operation on the corrected layer 2 feature map and the layer 3 feature map, and then performing the attention feature fusion operation on the obtained corrected layer 3 feature map and the layer 4 feature map;

sampling all layers of the corrected feature pyramid to the scale of a low-layer feature map for splicing;

performing secondary feature re-correction on the spliced feature map through coordinate attention;

setting a pixel threshold value to be 0.2, setting a value larger than 0.2 in the final characteristic diagram to be 1, and setting a value smaller than or equal to 0.2 to be 0 to obtain a binary diagram;

in the binary image, 1 represents a text area, 0 represents a non-text area, a text outline is obtained by using a function in opencv, and a text box with the maximum confidence is selected as a final text outline, so that the coordinates of a text boundary box are obtained.

Further, the method for clipping the input image according to the coordinates of the text bounding box to obtain a series of pictures only containing one line of text lines specifically comprises the following steps:

clipping the input image according to the coordinates of the text bounding box;

arranging the cut images from top to bottom and from left to right;

these pictures were scaled to a size of 32x100 pixels.

Further, the cut pictures are sequentially sent to a text recognition network, and a final text recognition result is obtained after word bank comparison, and the method specifically comprises the following steps:

sending the cut pictures into a text recognition network in sequence;

extracting text features through a CNN network, and converting a feature graph into a feature sequence;

sending the text to an RNN (radio network) circulating network for prediction and identification of the text;

and inputting the predicted recognition result into a CTC algorithm network, and obtaining a final recognition result after word bank comparison.

In a second aspect, the present invention provides an OCR character recognition system based on attention mechanism, including:

the preprocessing unit is used for preprocessing the image of the input picture to be recognized and constructing a required word bank;

the first processing unit is used for sending the processed picture into a text detection network to obtain a text boundary box coordinate, and the text detection network performs text feature detection on the processed picture based on an attention mechanism;

the second processing unit is used for cutting the input image according to the coordinates of the text bounding box to obtain a series of pictures only containing one line of text lines;

and the recognition unit is used for sequentially sending the cut pictures into a text recognition network and obtaining a final text recognition result after word bank comparison.

Further, in the preprocessing unit, image preprocessing is performed on the input picture to be recognized, and a required word bank is constructed, which specifically includes:

reading an input image and decoding the input image into an image matrix with an RGB format;

normalizing the image matrix;

and constructing a corresponding word bank for the characters to be identified.

Further, in the first processing unit, the processed picture is sent to a text detection network to obtain coordinates of a text bounding box, and the text detection network performs text feature detection on the processed picture based on an attention mechanism, and specifically includes:

performing the attention feature fusion operation on the corrected layer 2 feature map and the layer 3 feature map, and performing the attention feature fusion operation on the obtained corrected layer 3 feature map and the layer 4 feature map;

Further, in the second processing unit, the input image is cropped according to the coordinates of the text bounding box to obtain a series of pictures only including a line of text lines, which specifically includes:

clipping the input image according to the coordinates of the text bounding box;

arranging the cut images from top to bottom and from left to right;

these pictures were scaled to a size of 32x100 pixels.

Further, in the recognition unit, the cut pictures are sequentially sent to a text recognition network, and a final text recognition result is obtained after word bank comparison, which specifically comprises:

sending the cut pictures into a text recognition network in sequence;

sending the text into an RNN (radio network) to perform text prediction and identification;

In conclusion, the invention provides an OCR character recognition method and system based on an attention mechanism, which can retain more text features by using a multi-scale feature fusion method with the attention mechanism, thereby improving the text omission condition. In addition, when the final feature map is obtained, the coordinate attention is used for capturing the feature correlation at a long distance, and the detection of long texts is facilitated. Meanwhile, simple post-processing is adopted, so that the accuracy and reasoning speed of text detection are improved, and the recognition result of the text is more accurate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive labor.

Fig. 1 is a schematic flowchart of an OCR character recognition method based on an attention mechanism according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a text detection network according to an embodiment of the present invention;

FIG. 3 is a block diagram of an attention feature fusion provided by an embodiment of the present invention;

FIG. 4 is a diagram illustrating a convolution operation according to an embodiment of the present invention;

fig. 5 is a diagram of a coordinate attention structure provided by an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

OCR character recognition technology refers to the process of an electronic device (e.g., a scanner or digital camera) examining printed characters on paper and then translating the shape into computer text using character recognition methods. The conventional OCR-based recognition method is mostly realized based on a conventional model, for the conventional model, the detection time consumption is often large, the effect of detecting long texts is poor, detection omission often occurs for small-scale texts of multi-scale texts, or the accuracy is low under the condition of complex backgrounds, such as blurred images.

When the picture text is detected, the prior art usually detects missed detection of small-scale texts, usually detects a plurality of bounding boxes under the condition of long text lines, and has poor detection effect, poor robustness of a model and overlong reasoning time when the definition of the picture is insufficient.

Based on the method, the invention provides an OCR character recognition method and system based on an attention mechanism.

An embodiment of an OCR character recognition method based on the attention mechanism according to the present invention is described in detail below.

Referring to fig. 1, the present embodiment provides an OCR character recognition method based on an attention mechanism, including:

step 1: and preprocessing the image of the input picture to be recognized and constructing a required word bank.

And 2, step: and sending the processed picture into a text detection network to obtain the coordinates of the text bounding box, and carrying out text feature detection on the processed picture by the text detection network based on an attention mechanism.

And 3, step 3: and cutting the input image according to the coordinates of the text bounding box to obtain a series of pictures only containing one line of text lines.

And 4, step 4: and sending the cut pictures into a text recognition network in sequence, and obtaining a final text recognition result after word bank comparison.

In an alternative embodiment, the preprocessing and constructing the word stock in step 1 includes:

1.1: the read input image is decoded into an image matrix having an RGB format.

1.2: the width-to-height ratio of the image is maintained and the short edge in the image is scaled to 736 pixels.

1.3: and normalizing the image matrix.

1.4: and constructing a word bank for the characters to be identified.

In an alternative embodiment, the structure of the text detection network described in step 2 is shown in fig. 2. The process of further processing using the text detection network is as follows:

2.1: and sending the processed image into a residual backbone network for preliminary feature extraction.

2.2: the residual error network has four residual error modules, the last layer of feature map of each residual error module is taken out to construct a feature pyramid, and the feature pyramid is respectively marked as the 1 st, 2 nd, 3 rd and 4 th layers from top to bottom.

2.3: the features of layers 1 and 2 are first fused together by attention features, and the fused structure is shown in fig. 3.

And performing up-sampling on the feature map of the layer 1 to enable the feature map to have the same width as the feature map of the layer 2, performing pixel-by-pixel addition, and then performing convolution module operation by two branches, wherein one branch is used for compressing spatial pixels to 1 through global pooling firstly, and then performing convolution module operation, and the other branch is used for directly performing convolution module operation.

The convolution operation is shown in FIG. 4: the 1x1 convolution is firstly carried out to compress the channels so as to reduce the memory consumption, the activation function is carried out after the normalization to increase the nonlinear relation among the features, then the 1x1 convolution is carried out to expand the channels to the original channel number, and the normalization is carried out again.

And adding the feature graphs obtained after the two branches pixel by pixel, calculating attention weight by using a Sigmoid activation function, and multiplying the attention weight by the feature graphs of the 1 st layer and the 2 nd layer pixel by pixel respectively to obtain the feature graphs of the 1 st layer and the 2 nd layer after correction.

2.4: and performing the attention feature fusion operation on the corrected layer 2 feature map and the layer 3 feature map, and performing the attention feature fusion operation on the obtained corrected layer 3 feature map and the layer 4 feature map.

2.5: and (5) performing upsampling on each layer of the corrected feature pyramid to the scale of the feature map of the lower layer for splicing.

2.6: and performing secondary feature re-correction on the obtained spliced feature map through coordinate attention. Coordinate attention is shown in fig. 5:

the feature map is subjected to global pooling along an X axis and a Y axis respectively, the feature map is spliced along spatial dimension after Reshape operation, nonlinearity of the feature map is increased through a convolution module, then the feature map is split into two paths by using split operation, attention weights of the X axis and the Y axis are obtained through convolution operation and a Sigmoid activation function, and the feature map after secondary correction is obtained through pixel-by-pixel multiplication in sequence.

2.7: setting the pixel threshold value to be 0.2, setting the value which is greater than 0.2 in the final characteristic diagram to be 1, and setting the value which is less than or equal to 0.2 to be 0, so as to obtain a binary diagram.

2.8: in the binary image, 1 represents a text region, 0 represents a non-text region, a text contour is obtained by using a function in opencv, and a text box with the maximum confidence is selected as a final text contour. Whereby the coordinates of the text bounding box can be determined.

In an alternative embodiment, the specific process of step 3 is as follows:

3.1: and (4) cutting the input image according to the coordinate points obtained in the step (2).

3.2: the clipped images are arranged from top to bottom and from left to right.

3.3: these pictures were scaled to a size of 32x100 pixels.

In an alternative embodiment, the specific process of step 4 is as follows:

4.1: and (4) sequentially sending the pictures in the step (3) to a text recognition network.

4.2: text features are extracted through a CNN network, and feature graphs are converted into feature sequences.

4.3: and sending the text into an RNN circulating network for text prediction and identification.

The embodiment provides an OCR character recognition method based on an attention mechanism, more text features can be reserved by using a multi-scale feature fusion method with the attention mechanism, and therefore the text omission condition is improved. In addition, when the final feature map is obtained, the coordinate attention is used for capturing the feature correlation at a long distance, and the detection of long texts is facilitated. Meanwhile, simple post-processing is adopted, the accuracy and reasoning speed of text detection are improved, and the recognition result of the text is more accurate.

Compared with the prior art, the character recognition method provided by the embodiment has the following advantages:

1. compared with the traditional method adopting the optical character recognition, the method adopting the new deep learning model has the advantages of higher efficiency, less time consumption, less training amount required to be consumed and higher text recognition precision.

2. The fusion attention is embedded into the multi-scale feature pyramid, the inconsistency among scales is corrected through an attention mechanism during feature fusion, more scale information is reserved, and therefore the text detection effect on different scales is better

3. Finally, coordinate attention is used to obtain a final feature map, and the attention can capture the correlation among features at a longer distance, especially for long texts, the detection error of boundaries can be reduced, and therefore the detection effect is better for texts with different lengths.

4. Simple binarization post-processing operation is used, and the inference time of the model is improved

The foregoing is a detailed description of an embodiment of an OCR character recognition method based on attention mechanism according to the present invention, and the following is a detailed description of an embodiment of an OCR character recognition system based on attention mechanism according to the present invention.

The embodiment provides an attention mechanism-based OCR character recognition system, which includes: the device comprises a preprocessing unit, a first processing unit, a second processing unit and a recognition unit.

In this embodiment, the preprocessing unit is configured to perform image preprocessing on an input picture to be recognized, and construct a required lexicon.

Specifically, in the preprocessing unit, image preprocessing is performed on an input picture to be recognized, and a required word bank is constructed, which specifically includes:

normalizing the image matrix;

and constructing a corresponding word stock for the characters to be recognized.

In this embodiment, the first processing unit is configured to send the processed picture to a text detection network to obtain a text bounding box coordinate, and the text detection network performs text feature detection on the processed picture based on an attention mechanism.

Specifically, in the first processing unit, the processed picture is sent to a text detection network to obtain a text bounding box coordinate, and the method specifically includes:

the processed image is sent to a residual backbone network for primary extraction of features;

performing secondary re-correction on the characteristics of the spliced characteristic diagram through coordinate attention;

setting a pixel threshold value to be 0.2, setting a value which is larger than 0.2 in the final characteristic diagram to be 1, and setting a value which is smaller than or equal to 0.2 to be 0 to obtain a binary diagram;

In this embodiment, the second processing unit is configured to crop the input image according to the coordinates of the text bounding box, so as to obtain a series of pictures including only one line of text.

Specifically, in the second processing unit, the cropping is performed on the input image according to the coordinates of the text bounding box to obtain a series of pictures only including a line of text lines, and the method specifically includes:

clipping the input image according to the coordinates of the text bounding box;

arranging the cut images from top to bottom and from left to right;

these pictures were scaled to a size of 32x100 pixels.

In this embodiment, the recognition unit is configured to sequentially send the cut pictures to a text recognition network, and obtain a final text recognition result after word library comparison.

Specifically, in the recognition unit, the cut pictures are sequentially sent to a text recognition network, and a final text recognition result is obtained after word bank comparison, which specifically comprises:

sending the cut pictures into a text recognition network in sequence;

text features are extracted through a CNN network, and a feature graph is converted into a feature sequence;

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An OCR character recognition method based on an attention mechanism is characterized by comprising the following steps:

sending the processed picture into a text detection network to obtain a text bounding box coordinate, wherein the text detection network performs text feature detection on the processed picture based on an attention mechanism;

cutting the input image according to the coordinates of the text bounding box to obtain a series of pictures only containing one line of text lines;

2. An OCR character recognition method based on an attention mechanism as claimed in claim 1, wherein the image preprocessing is performed on the input picture to be recognized, and a required lexicon is constructed, specifically comprising:

normalizing the image matrix;

and constructing a corresponding word bank for the characters to be identified.

3. An OCR character recognition method based on an attention mechanism as claimed in claim 2, wherein the step of sending the processed picture to a text detection network to obtain coordinates of a text bounding box, wherein the text detection network performs text feature detection on the processed picture based on the attention mechanism specifically comprises:

and 1 in the binary image represents a text area, 0 represents a non-text area, a text outline is obtained by using a function in opencv, and a text box with the maximum confidence coefficient is selected as a final text outline, so that the coordinates of a text boundary box are obtained.

4. An OCR character recognition method based on an attention mechanism as claimed in claim 3, wherein the cropping of the input image according to the coordinates of the text bounding box to obtain a series of pictures containing only one line of text includes:

clipping the input image according to the coordinates of the text bounding box;

arranging the cut images from top to bottom and from left to right;

these pictures were scaled to a size of 32x100 pixels.

5. An OCR character recognition method based on an attention mechanism as claimed in claim 4, wherein the clipped pictures are sequentially sent to a text recognition network, and a final text recognition result is obtained after the word stock comparison, specifically comprising:

sending the cut pictures into a text recognition network in sequence;

6. An attention-based OCR character recognition system comprising:

the first processing unit is used for sending the processed pictures into a text detection network to obtain the coordinates of a text bounding box, and the text detection network is used for carrying out text feature detection on the processed pictures based on an attention mechanism;

the second processing unit is used for cutting the input image according to the coordinates of the text bounding box to obtain a series of pictures only containing a line of text lines;

7. An OCR character recognition system based on an attention mechanism as claimed in claim 6, wherein in the preprocessing unit, the image preprocessing is performed on the input picture to be recognized, and a required lexicon is constructed, specifically comprising:

normalizing the image matrix;

8. An attention-based OCR character recognition system according to claim 7, wherein in the first processing unit, the processed picture is sent to a text detection network, and the text detection network performs text feature detection on the processed picture based on the attention mechanism to obtain coordinates of a text bounding box, specifically comprising:

the residual error network has four residual error modules, the last layer of feature map of each residual error module is taken out to construct a feature pyramid, and the feature pyramid is respectively marked as the 1 st, 2 nd, 3 rd and 4 th layers from top to bottom;

and 1, a text area is represented in the binary image, 0 represents a non-text area, a text outline is obtained by using a function in opencv, and a text box with the maximum confidence is selected as a final text outline, so that the coordinates of a text boundary box are obtained.

9. An attention-based OCR character recognition system according to claim 8 and wherein said second processing unit is operative to crop the input image according to the coordinates of the text bounding box to obtain a series of pictures containing only one line of text, and specifically comprises:

clipping the input image according to the coordinates of the text bounding box;

arranging the cut images from top to bottom and from left to right;

these pictures were scaled to a size of 32x100 pixels.

10. An OCR character recognition system based on an attention mechanism as claimed in claim 9, wherein in the recognition unit, the clipped pictures are sequentially sent to a text recognition network, and a final text recognition result is obtained after the word stock comparison, specifically comprising:

sending the cut pictures into a text recognition network in sequence;