CN110598690B

CN110598690B - End-to-end optical character detection and recognition method and system

Info

Publication number: CN110598690B
Application number: CN201910707220.0A
Authority: CN
Inventors: 蔡华; 陈运文; 王文广; 纪达麒; 马振宇; 周炳诚
Original assignee: Datagrand Information Technology Shanghai Co ltd
Current assignee: Datagrand Information Technology Shanghai Co ltd
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2023-04-28
Anticipated expiration: 2039-08-01
Also published as: CN110598690A

Abstract

The invention discloses an end-to-end optical character detection and recognition method and a system, wherein the recognition method comprises the following steps: extracting image features to obtain a region of interest; classifying the region of interest to obtain angle information of a frame of the region of interest; segmenting an interested region to obtain text image contour information in the region; dividing the text image into a plurality of circles based on polar coordinates based on the angle information and the text image contour information, and adjusting the coordinates of the circles and the delineating content so as to trim the text image; and identifying the trimmed text image. The invention merges a method for realizing isomorphism transformation by a transformation network, and realizes the accurate transformation of a curved text region.

Description

End-to-end optical character detection and recognition method and system

Technical Field

The invention belongs to the field of character recognition, and particularly relates to an end-to-end optical character detection and recognition method and system.

Background

The traditional OCR method divides character detection and character recognition into two separated parts, namely, inputting a picture, firstly detecting the characters, detecting the positions of the characters, then carrying out character recognition, namely, matting out the detected characters and sending the text to a recognition network. Such an aspect is time consuming and second does not share the detected and identified features. The disadvantage of this method is that the text may be detected insufficiently accurately, which may cause difficulties for recognition, such as the text edge being framed by some blank areas, etc.

Meanwhile, the existing OCR method is not ideal in recognition effect on the bent text, and has the difficulty that a horizontal detection frame or a quadrilateral detection frame is subjected to affine transformation, a text area cannot be accurately positioned, the text area in the horizontal detection frame and the quadrilateral detection frame only occupies a small part, most of the text area is background, and the horizontal or inclined detection frame cannot twist the text, so that the recognition method of a convolutional neural network (CRNN) based on a long-short time sequence memory network (LSTM) has poor effect. Moreover, since the design of the Convolutional Neural Network (CNN) itself for image feature extraction does not take special consideration of rotation invariance, the ability of CNN to extract rotation invariance features is generally weak. The CNN can only learn the rotation invariance by itself in this way of data enhancement (manually mirroring, rotating, scaling, etc. the samples).

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an end-to-end optical character detection and recognition method and system.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

an end-to-end optical character detection recognition method, the recognition method comprising: extracting image features to obtain a region of interest; classifying the region of interest to obtain angle information of a frame of the region of interest; segmenting an interested region to obtain text image contour information in the region; dividing the text image into a plurality of circles based on polar coordinates based on the angle information and the text image contour information, and adjusting the coordinates of the circles and the delineating content so as to trim the text image; and identifying the trimmed text image.

Preferably, the extracting the image features includes: inputting the image into a feature pyramid network to obtain a trunk feature map of the image; and inputting the trunk feature map into a region generation network to obtain the region of interest.

Preferably, the classifying the region of interest includes: the regions of interest are classified into specific categories and regression is performed on the borders of the regions of interest.

Preferably, the segmenting the region of interest comprises: deconvolving the region of interest generates a mask of the literal image.

Preferably, the dividing the text image into a plurality of polar coordinate-based circles based on the angle information and the text image contour information includes: finding out the center line of the text image based on the angle information and the text image contour information; drawing a first circle by taking one end of the central line as the center of a circle; subsequent circles are drawn at predetermined intervals along the centerline until the text image is entirely divided into areas delineated by the circles.

Preferably, the finding the center line of the text image includes: selecting a point on the boundary of the text image; determining a tangent line passing through the point and then determining a vertical line passing through the point and perpendicular to the tangent line; moving the point along the vertical line into the boundary of the text image until the distance between the point and the vertical line passing through the two ends of the text image is equal, wherein the point is a point on the central line; and obtaining the central line of the text image after fitting a plurality of the points.

Preferably, the text image after the trimming is identified by adopting a convolutional recurrent neural network.

An end-to-end optical character detection recognition system, the recognition system comprising: the image feature extraction module is used for extracting image features to obtain an interested region; the classification module classifies the region of interest to obtain angle information of a frame of the region of interest, and is connected with the image feature extraction module; the segmentation module is used for segmenting the region of interest to obtain text image contour information in the region and is connected with the image feature extraction module; the equal deformation transformation module is used for dividing the text image into a plurality of circles based on polar coordinates based on the angle information and the text image contour information, adjusting the coordinates of the circles and the delineation content thereof so as to trim the text image, and is connected with the image feature extraction module, the classification module and the segmentation module; and

the character recognition module is used for recognizing the trimmed text image and is connected with the equal deformation conversion module.

Compared with the prior art, the invention has the beneficial effects that:

1. in the intelligent recognition system, an isodegeneration conversion module is fused, so that the accurate conversion of a curved text region is realized;

2. the network is a multi-task learning structure, and can obtain multi-task learning of element classification, text recognition and instance segmentation;

3. in the network structure, image pyramid features are extracted through a convolution module, and detection and recognition of texts with different scales are realized.

4. The system does not limit the characters, and is suitable for intelligent detection and recognition of all language characters;

5. the extracted image features are used by a classification module, a segmentation module, a deformation transformation module and the like, the features are not repeatedly extracted, and the efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic overall structure of an embodiment of the present invention.

Fig. 2 is a schematic diagram of a convolutional network structure of the image feature extraction module.

Fig. 3 is a schematic diagram of a network structure of the classification module.

Fig. 4 is a schematic diagram of a network structure of the segmentation module.

FIG. 5 is a schematic diagram of a sliding, centering, and isodenaturing transformation.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.

As shown in fig. 1, the embodiment mainly comprises an image feature extraction module, an element classification and instance segmentation module, an isovariational transformation module and a character recognition module.

1 image feature extraction module

As shown in fig. 2, the image feature extraction module provides shared image feature information for the whole system, so as to improve the calculation efficiency and the accuracy of the calculation result.

Image feature pyramids may be constructed by a Feature Pyramid Network (FPN) using the output features of the convolutional network blocks. Different sizes of objects have different features, simple objects can be distinguished by shallow features, complex objects can be distinguished by deep features, a convolution network structure is divided into 5 parts in fig. 2, outputs of the parts respectively correspond to [ C1, C2, C3, C4, C5], deep convolution is performed on an input image, then a 1x1 convolution layer is added to [ C1, C2, C3, C4, C5], features on a convolution block are extracted to obtain an image feature pyramid structure [ P1, P2, P3, P4, P5], upsampling operation is performed on features of P5 so that the features are corresponding in size to the features of the C4 after 1*1 convolution, then an addition operation (corresponding element addition) is performed on the processed features, the obtained result is input into P4, and meanwhile, the P5 is subjected to 3*3 convolution to obtain relevant features as input of a Region Proposal Network (RPN). The same operation is sequentially carried out on P4, P3 and P2, the processed low-level features and the processed high-level features are accumulated, and the purpose of the operation is that the low-level features can provide more accurate position information, and the positioning information of a deep network is error caused by multiple downsampling and upsampling operations, so that the deep network is combined and used, a deeper feature pyramid is constructed, multiple layers of feature information are fused, and different features are output. That is, the performance of the standard feature extraction pyramid is enhanced by adding a second pyramid that can select advanced features from the first pyramid and pass them on to the underlying layer. Through this process, it allows the features of each stage to be combined with the high-level and low-level features. The idea behind this is to obtain a strong semantic information, which can improve the detection performance, and construct the feature pyramid with a deeper layer, which is to use more robust information.

The RPN region generation network is a lightweight neural network that scans images with a sliding window and finds regions where objects are present. The region scanned by the RPN is called an anchor point, and corresponds to a rectangle distributed on the image region, the sliding window is realized by a convolution process of the RPN, and the RPN does not directly scan the image, but scans the backbone feature map. This allows the RPN to effectively multiplex the extracted features and avoid repetitive calculations. The characteristic diagrams of [ P1, P2, P3, P4, P5] with different scales are generated into a plurality of anchor blocks by an RPN network, a part RoI (region of interest) is reserved after NMS (non-maximum suppression) operation, and the characteristic diagrams of [ P1, P2, P3, P4, P5] with different scales are respectively aligned due to different step sizes, and then the characteristic diagrams are connected and input into the tasks of full connection element classification, full convolution pixel segmentation and isodegeneration transformation.

2 sorting module

As shown in fig. 3, the ROI classifier implements classification and regression yields a bounding box, unlike RPN which can only resolve two categories, foreground or background, this network is deeper and can classify a region into a specific category. Meanwhile, the frame can be finely adjusted, and the position and the size of the frame are further finely adjusted so as to encapsulate the target.

3 split module

The text can be accurately detected by using the example segmentation method, and a mask of a text region is generated. Deconvolution is carried out on the ROI characteristic region to obtain a mask region which is consistent with the size of the input picture, and a text is obtained. Window dimension channel

4-class denaturation conversion module

As shown in fig. 5, in the transformation structure of the present embodiment, an angle information of the text region can be obtained by a regression box for the continuous text region in the classification module. The angle information and the segmented text region information are then used to find the center line of the text region, and based on this center line and the contour boundaries of the text region, we can extend the text region horizontally. Any shape of text, such as horizontal text, multidirectional text, curved text, can be fit well.

The present embodiment randomly selects one pixel as a starting point and centers it. The search process then branches into two opposite directions—sliding and centering until end. This process will generate two ordered points in two opposite directions and can combine to generate a final central axis that conforms to the progress of the text and accurately describes the shape. In addition, the embodiment also utilizes the local geometric attribute to describe the structure of the text instance, and converts the predicted curved text instance into a canonical form, which greatly lightens the work of the subsequent recognition stage.

The transformation of this canonical form is a description of the text by a series of ordered, overlapping disks (disks), each disk being located on the central axis of the text region and having a variable radius and direction. Geometric properties (e.g., center axis, radius, direction) of the text instance are evaluated through a Full Convolution Network (FCN) to characterize a text region as a series of ordered and overlapping disks, each disk being intersected by a center line and having a variable radius r and direction θ. The network module can change its shape to accommodate different changes, such as rotation, scaling, bending. Mathematically, a text instance t containing several characters can be seen as a sequence S (t), which is a collection of a series of disks. Each disk D has a set of geometric properties, r being defined as half the local width of t, the direction θ being the tangent of the center line through the center point c. Thus, by calculating the overlap of the disks in S (t), the text region t can be easily reconstructed. Note that the discs do not correspond one-to-one to the characters of the text instance. But the geometric property of the disc sequence can correct the irregular text instance and convert it into a horizontal rectangle more friendly to the text recognizer, firstly find an inscribed circle at the boundary, then move slowly at a small interval along the center line and draw the inscribed circle while moving, that is to say, the text in the text region is transformed to the horizontal direction by the inscribed circle, thus completing the isomorphism transformation.

Theoretically, we assume that we have a pattern x, which can be transformed into other forms by some transformation T, and the transformed pattern is once denoted T (x|w), and all transformation parameters w can be determined (learned) from the original pattern throughout the transformation. Of course this transformation is not known. That is, the content we study either learns the transformation itself or learns the recognition model with invariance. The common transformation, the recognition model should be invariant, a spatial transformation. Invariance of the transformation is typically hard coded by using Convolutional Neural Networks (CNNs). A common technique for implementing isomorphism recognition is to extend the training set with a spatially transformed version of the original image. Ideally, the machine learning system should be able to extrapolate beyond the range of parameter values in the training set.

Thus, conventional CNNs cannot generalize the rotation concept without additional means (not just to infer an unseen rotation angle, but rather to shift the recognition capability of an encountered angle from one category to another). The text belongs to the character of the character and can be based on a certain degree of shape characteristics. We slide on the picture with the window, naturally, as long as the same features are present, no matter where the translation is detected, so there is a translation invariance that the network itself has. Rotational invariance is the invariance of the spatial structure between small features inside the features. This should be that different objects have different unique structures that the neural network learns to have rotational invariance. Similarly, invariance such as scaling, micro-deformation and the like should be learned.

5 character recognition module

And inputting the network characteristics subjected to the isomorphism transformation and the image characteristics obtained by the convolution network into a text recognition module for recognizing the text. The main structure of the module is a Convolutional Recurrent Neural Network (CRNN), which is a combination of a deep Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN), and can be directly learned from sequence tags to generate a series of sequence-like tags. The Chinese character recognition module in this embodiment includes a two-way long and short time memory network (Bi-LSTM), a full connectivity layer and connected time sequence classification (CTC) decoder. And mapping the high-order features extracted by the previous convolution module into a sequence in a time main form, and sending the sequence into the RNN for encoding. Bi-LSTM is used to obtain the range dependence of the input sequence features. And then, summing the implicit states calculated in each time step in two directions, sending the implicit states into a complete connection to obtain a distribution of each state on a character class set, and finally, converting the frame classification score into a character label sequence by using CTC to obtain text recognition output.

While the foregoing embodiments have been described in detail and with reference to the present invention, it will be apparent to one skilled in the art that modifications and improvements can be made based on the disclosure without departing from the spirit and scope of the invention.

Claims

1. An end-to-end optical character detection and recognition method, comprising:

extracting image features to obtain a region of interest;

classifying the region of interest to obtain angle information of a frame of the region of interest;

segmenting an interested region to obtain text image contour information in the region;

dividing the text image into a plurality of polar coordinate-based circles based on the angle information and the text image contour information, comprising: finding out the center line of the text image based on the angle information and the text image contour information; drawing a first circle by taking one end of the central line as the center of a circle; drawing subsequent circles at predetermined intervals along the central line until the text image is completely divided into areas defined by a plurality of circles; finding the centerline of the text image includes: selecting a point on the boundary of the text image; determining a tangent line passing through the point and then determining a vertical line passing through the point and perpendicular to the tangent line; moving the point along the vertical line into the boundary of the text image until the distance between the point and the vertical line passing through the two ends of the text image is equal, wherein the point is a point on the central line; fitting a plurality of points to obtain a central line of the text image;

adjusting the coordinates of the circle and its delineating content to trim the text image;

and identifying the trimmed text image.

2. The end-to-end optical character detection and recognition method according to claim 1, wherein the extracting image features comprises:

inputting the image into a feature pyramid network to obtain a trunk feature map of the image;

and inputting the trunk feature map into a region generation network to obtain the region of interest.

3. The end-to-end optical character detection and recognition method according to claim 1, wherein the classifying the region of interest comprises:

the regions of interest are classified into specific categories and regression is performed on the borders of the regions of interest.

4. The end-to-end optical character detection and recognition method according to claim 1, wherein the segmenting the region of interest comprises:

deconvolving the region of interest generates a mask of the literal image.

5. The end-to-end optical character detection and recognition method according to claim 1, wherein the text image after the recognition and trimming is recognized by using a convolutional recurrent neural network.

6. An end-to-end optical character detection recognition system, the recognition system comprising:

the image feature extraction module is used for extracting image features to obtain an interested region;

the classification module classifies the region of interest to obtain angle information of a frame of the region of interest, and is connected with the image feature extraction module;

the segmentation module is used for segmenting the region of interest to obtain text image contour information in the region and is connected with the image feature extraction module;

the equal deformation transformation module is used for dividing the text image into a plurality of circles based on polar coordinates based on the angle information and the text image contour information, adjusting the coordinates of the circles and the delineation content thereof so as to trim the text image, and is connected with the image feature extraction module, the classification module and the segmentation module; dividing the text image entirely into a plurality of polar-based circles includes: finding out the center line of the text image based on the angle information and the text image contour information; drawing a first circle by taking one end of the central line as the center of a circle; drawing subsequent circles at predetermined intervals along the central line until the text image is completely divided into areas defined by a plurality of circles; finding the centerline of the text image includes: selecting a point on the boundary of the text image; determining a tangent line passing through the point and then determining a vertical line passing through the point and perpendicular to the tangent line; moving the point along the vertical line into the boundary of the text image until the distance between the point and the vertical line passing through the two ends of the text image is equal, wherein the point is a point on the central line; fitting a plurality of points to obtain a central line of the text image; and