CN110598690A

CN110598690A - End-to-end optical character detection and identification method and system

Info

Publication number: CN110598690A
Application number: CN201910707220.0A
Authority: CN
Inventors: 蔡华; 陈运文; 王文广; 纪达麒; 马振宇; 周炳诚
Original assignee: Daerguan Information Technology (shanghai) Co Ltd
Current assignee: Daerguan Information Technology (shanghai) Co Ltd
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2019-12-20
Anticipated expiration: 2039-08-01
Also published as: CN110598690B

Abstract

The invention discloses an end-to-end optical character detection and identification method and a system, wherein the identification method comprises the following steps: extracting image features to obtain an interested region; classifying the region of interest to obtain angle information of a frame of the region of interest; segmenting the region of interest to obtain text image contour information in the region; dividing the text image into a plurality of circles based on polar coordinates based on the angle information and the text image outline information, and adjusting the circles and coordinates of the circle content so as to finish the text image; the trimmed text image is identified. The invention integrates a method for realizing equal-variability transformation by a transformation network, and realizes accurate transformation of a curved text region.

Description

End-to-end optical character detection and identification method and system

Technical Field

The invention belongs to the field of character recognition, and particularly relates to an end-to-end optical character detection and recognition method and system.

Background

The traditional OCR method is to divide character detection and character recognition into two separate parts, namely, inputting a picture, firstly carrying out character detection, detecting the position of characters, then carrying out character recognition, namely, extracting the detected characters and sending the extracted characters into a recognition network. This aspect is relatively time consuming and the second does not share the features of detection and identification. The disadvantage of this method is that the text may not be detected accurately enough, which causes some difficulty for recognition, for example, the text edge is framed by many blank areas.

Meanwhile, the conventional OCR method has an unsatisfactory recognition effect on a bent text, and has the difficulty that affine transformation is performed on a horizontal detection frame or a quadrilateral detection frame, so that a character area cannot be accurately positioned, the character areas in the horizontal detection frame and the quadrilateral detection frame only occupy a small part, most of the character areas are backgrounds, and the horizontal detection frame or the oblique detection frame cannot correct the text, so that the long and short time sequence memory network (LSTM) -based Convolutional Recurrent Neural Network (CRNN) recognition method has a poor effect. Moreover, because the design of the Convolutional Neural Network (CNN) for image feature extraction itself does not take special consideration for rotation invariance, the CNN is weak in extracting rotation invariant features in general. The CNN can only learn the rotation invariance by itself in such a way that data enhancement (artificially mirroring, rotating, scaling, etc. the sample).

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an end-to-end optical character detection and recognition method and system.

In order to achieve the purpose, the invention adopts the following technical scheme:

an end-to-end optical character detection recognition method, the recognition method comprising: extracting image features to obtain an interested region; classifying the region of interest to obtain angle information of a frame of the region of interest; segmenting the region of interest to obtain text image contour information in the region; dividing the text image into a plurality of circles based on polar coordinates based on the angle information and the text image outline information, and adjusting the circles and coordinates of the circle content so as to finish the text image; the trimmed text image is identified.

Preferably, the extracting the image feature comprises: inputting the image into a characteristic pyramid network to obtain a main characteristic diagram of the image; and inputting the backbone feature map into the region to generate a network, and obtaining the region of interest.

Preferably, the classifying the region of interest includes: and classifying the region of interest into a specific category, and performing regression on the frame of the region of interest.

Preferably, the segmenting the region of interest comprises: deconvolving the region of interest to generate a mask of the text image.

Preferably, the dividing the text image into a plurality of circles based on polar coordinates based on the angle information and the text image outline information includes: finding out the center line of the text image based on the angle information and the outline information of the text image; drawing a first circle by taking one end of the center line as the center of the circle; and drawing subsequent circles at predetermined intervals along the center line until the text image is completely divided into the areas defined by the circles.

Preferably, the finding the center line of the text image comprises: selecting a point on a boundary of the text image; determining a tangent line passing through the point and then determining a perpendicular line passing through the point and perpendicular to the tangent line; moving the point along the vertical line to the boundary of the text image until the point is equal in distance from the vertical line to the two ends of the text image, wherein the point is a point on the central line; and fitting a plurality of the points to obtain a central line of the text image.

Preferably, a convolution cyclic neural network is adopted in the text image after the trimming is identified to identify the text image.

An end-to-end optical character detection recognition system, the recognition system comprising: the image feature extraction module extracts image features to obtain an interested region; the classification module classifies the region of interest to obtain angle information of a frame of the region of interest, and is connected with the image feature extraction module; the segmentation module is used for segmenting the region of interest to obtain the outline information of the text image in the region and is connected with the image feature extraction module; the system comprises an equal deformation transformation module, a text feature extraction module, a classification module and a segmentation module, wherein the equal deformation transformation module divides a text image into a plurality of circles based on polar coordinates based on angle information and text image outline information, adjusts the coordinates of the circles and the defined content thereof so as to finish the text image, and is connected with the image feature extraction module, the classification module and the segmentation module; and

and the character recognition module recognizes the trimmed text image and is connected with the equal-deformation transformation module.

Compared with the prior art, the invention has the beneficial effects that:

1. in the intelligent recognition system, an iso-variability transformation module is fused, so that the accurate transformation of a bent text region is realized;

2. the network is a multi-task learning structure and can obtain multi-task learning of element classification, text recognition and instance segmentation;

3. in the network structure, the image pyramid characteristics are extracted through a convolution module, and the detection and identification of texts with different scales are realized.

4. The system does not limit characters, and is suitable for intelligent detection and identification of all language characters;

5. the extracted image features are used by a classification module, a segmentation module and an equal deformation transformation module, so that the features are not repeatedly extracted, and the efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic overall structure diagram of an embodiment of the present invention.

Fig. 2 is a schematic diagram of a convolution network structure of the image feature extraction module.

Fig. 3 is a schematic diagram of a network structure of the classification module.

Fig. 4 is a schematic diagram of a network structure of a partitioning module.

FIG. 5 is a schematic diagram of the structure of sliding, centering and iso-degenerative transformation.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.

As shown in fig. 1, the present embodiment mainly includes an image feature extraction module, an element classification and instance segmentation module, an iso-variability transformation module, and a character recognition module.

1 image feature extraction module

As shown in fig. 2, the image feature extraction module provides shared image feature information for the entire system, thereby improving the calculation efficiency and the accuracy of the calculation result.

An image feature pyramid can be constructed by a Feature Pyramid Network (FPN) using the output features of the convolutional network block. Targets of different sizes have different features, simple targets can be distinguished by shallow features, while complex objects can be distinguished by using deep features, the convolutional network structure is divided into 5 parts in fig. 2, the output of each part corresponds to [ C1, C2, C3, C4, C5], performing depth convolution on the input image, then adding a convolution layer of 1x1 to [ C1, C2, C3, C4 and C5], extracting the characteristics on the convolution block to obtain an image characteristic pyramid structure [ P1, P2, P3, P4, P5], the features of P5 were upsampled such that they had a corresponding size to the 1x1 convolved features of C4, then, an addition operation (corresponding element addition) is performed on the processed features, the obtained result is input to P4, and 3 × 3 convolution is carried out on the P5 to obtain relevant characteristics as the input of the Region Proposal Network (RPN). The same operation is performed on P4, P3 and P2 in turn, and the processed low-layer features and the processed high-layer features are accumulated, so that the purpose of accumulating the processed low-layer features and the processed high-layer features is that the low-layer features can provide more accurate position information, and the positioning information of a deep network has errors due to multiple down-sampling and up-sampling operations, so that the positioning information of the deep network is combined with the positioning information of the deep network, and a deeper feature pyramid is constructed, multi-layer feature information is fused, and the multi-layer feature information is output in different features. That is, the performance of the standard feature extraction pyramid is improved by adding a second pyramid, which can select high-level features from the first pyramid and transfer them to the bottom layer. By this process it allows the features of each level to be combined with the features of both the high and low levels. The idea behind this is to obtain a strong semantic information, which can improve the detection performance, and to construct the feature pyramid with deeper layers, which is done to use more robust information.

The RPN region generation network is a lightweight neural network that scans an image with a sliding window and finds a region where a target exists. The region scanned by the RPN is called an anchor point, corresponding to a rectangle distributed on the image region, the sliding window is realized by the convolution process of the RPN, and the RPN does not directly scan the image but scans the main feature map. This allows the RPN to efficiently multiplex the extracted features and avoid duplicate computations. Generating a plurality of anchor points by RPN network for the characteristic maps of [ P1, P2, P3, P4 and P5] in different scales, reserving partial RoI (region of interest) after NMS (non-maximum suppression) operation, respectively aligning the characteristic maps of [ P1, P2, P3, P4 and P5] in different scales due to different step sizes, then connecting the characteristic maps, and inputting the connected characteristic maps into tasks of full connected element classification, full convolution pixel segmentation and equal degeneration transformation.

2 Classification Module

As shown in fig. 3, the ROI classifier performs classification and regresses to a bounding box, which is deeper and can classify regions into specific classes, unlike the RPN which can only distinguish between foreground or background classes. Meanwhile, the frame can be finely adjusted, and the position and the size of the frame are further finely adjusted to package the target.

3 splitting module

The method of example segmentation can accurately detect the characters and generate a mask of character areas. And deconvoluting the ROI characteristic region to obtain a mask region which is consistent with the size of the input picture and obtains a text. Window dimension channel

4 equal-variability transformation module

As shown in fig. 5, in the transformation structure of this embodiment, an angle information of the text region can be obtained through the regression frame of the continuous text region in the classification module. Then, the center line of the text region is found by using the angle information and the segmented text region information, and then, according to the center line and the outline boundary of the text region, the text region can be extended horizontally. It can fit any shape of text well, such as horizontal text, multi-directional text, and curved text.

This embodiment randomly selects one pixel as a starting point and centers it. The search process then branches into two opposite directions — sliding and centering until the end. This process will generate two ordered points in two opposite directions and can combine to generate a final central axis that conforms to the progress of the text and describes the shape accurately. In addition, the embodiment also uses the local geometric attributes to describe the structure of the text instance, and converts the predicted curved text instance into a canonical form, which greatly reduces the work of the subsequent recognition stage.

This canonical form of translation describes text by a series of ordered, overlapping disks (disks), each located on the central axis of a text region, with a radius and direction that can vary. Geometric attributes of text instances (e.g., centerline point, radius, direction) are evaluated through a Full Convolution Network (FCN) to characterize a text region as a series of ordered and overlapping disks, each disk intersected by a centerline with a variable radius r and direction θ. The network module can change its shape to accommodate different changes, such as rotation, scaling, bending. Mathematically, a text instance t containing several characters can be seen as a sequence s (t), which is a set of a series of circular discs. Each disk D carries a set of geometrical properties, r being defined as half the local width of t, and the direction θ being the tangent of a centre line through the centre point c. Thus, by calculating the coincidence of the disks in s (t), the text region t can be easily reconstructed. Note that the circular disks do not correspond one-to-one to the characters of the text instance. However, the geometric properties of the disk sequence can correct the irregular text instance and convert it into a more friendly horizontal rectangle for the text recognizer, and an inscribed circle is first found at the boundary and then moved slowly along the center line at a small interval while drawing the inscribed circle, that is, the text in the text area is transformed to the horizontal direction by the inscribed circle, thereby completing the iso-metamorphic transformation.

In theory, assuming that we have a mode x, this mode can be changed into other forms through some transformation T, the mode after transformation is temporarily referred to as T (x | w), and all transformation parameters w can be determined (learned) from the original mode in the whole transformation process. Of course this transformation is not known. That is, the content we are studying learns either the transformation itself or the recognition model with invariance. The common transformation, the recognition model, should be invariant, is a spatial transformation. The invariance of the transformation is typically hard coded by using a Convolutional Neural Network (CNN). A common technique to achieve iso-degenerative recognition is to extend the training set with a spatially transformed version of the original image. Ideally, the machine learning system should be able to extrapolate beyond the range of parameter values in the training set.

Thus, conventional CNNs cannot generalize the concept of rotation without additional means (not only to infer unseen angles of rotation, but to shift the recognition capabilities of encountered angles from one category to another). The characters belong to image characters and can be based on shape characteristics to a certain extent. We use windows to slide on the picture, and naturally have translation invariance, which is inherent in the network, as long as the same feature is present and translation can be detected no matter where it is. Rotation invariance is the spatial structure invariance between small features inside a feature. This should be the case if different objects have different unique structures that the neural network learns to have rotational invariance. Similarly, invariance such as scaling, micro-deformation, etc. should be learned.

5 character recognition module

And inputting the network characteristics subjected to the isovariate transformation and the image characteristics obtained by the convolution network into a character recognition module for recognizing the text. The main structure of the module is Convolutional Recurrent Neural Network (CRNN), which is a combination of deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), and can directly learn from sequence tags to generate a series of class sequence tags. The word recognition module in this embodiment comprises a Bi-directional long-short time-series memory network (Bi-LSTM), a fully connected layer and a connected time-series classification (CTC) decoder. And mapping the high-order features extracted by the convolution module into a sequence in a time main form, and sending the sequence into the RNN for encoding. Bi-LSTM is used to obtain the range dependence of the input sequence features. Then, the implicit states calculated in each time step in two directions are summed and sent into a complete connection to obtain a distribution of each state on the character class set, and finally, the frame classification scores are converted into character label sequences by using CTC to obtain text recognition output.

Although the present invention has been described in detail with respect to the above embodiments, it will be understood by those skilled in the art that modifications or improvements based on the disclosure of the present invention may be made without departing from the spirit and scope of the invention, and these modifications and improvements are within the spirit and scope of the invention.

Claims

1. An end-to-end optical character detection and recognition method, characterized in that the recognition method comprises:

extracting image features to obtain an interested region;

classifying the region of interest to obtain angle information of a frame of the region of interest;

segmenting the region of interest to obtain text image contour information in the region;

dividing the text image into a plurality of circles based on polar coordinates based on the angle information and the text image outline information, and adjusting the circles and coordinates of the circle content so as to finish the text image;

the trimmed text image is identified.

2. The end-to-end optical character detection recognition method of claim 1, wherein the extracting image features comprises:

inputting the image into a characteristic pyramid network to obtain a main characteristic diagram of the image;

and inputting the backbone feature map into the region to generate a network, and obtaining the region of interest.

3. The end-to-end optical character detection recognition method of claim 1, wherein the classifying the region of interest comprises:

and classifying the region of interest into a specific category, and performing regression on the frame of the region of interest.

4. The end-to-end optical character detection recognition method of claim 1, wherein the segmenting the region of interest comprises:

deconvolving the region of interest to generate a mask of the text image.

5. The method for end-to-end optical character detection and recognition according to claim 1, wherein the dividing the text image into a plurality of polar coordinate-based circles based on the angle information and the text image outline information comprises:

finding out the center line of the text image based on the angle information and the outline information of the text image;

drawing a first circle by taking one end of the center line as the center of the circle;

and drawing subsequent circles at predetermined intervals along the center line until the text image is completely divided into the areas defined by the circles.

6. The end-to-end optical character detection recognition method of claim 5, wherein the finding the centerline of the text image comprises:

selecting a point on a boundary of the text image;

determining a tangent line passing through the point and then determining a perpendicular line passing through the point and perpendicular to the tangent line;

moving the point along the vertical line to the boundary of the text image until the point is equal in distance from the vertical line to the two ends of the text image, wherein the point is a point on the central line;

and fitting a plurality of the points to obtain a central line of the text image.

7. The method for end-to-end optical character detection and recognition of claim 1, wherein the recognition of the trimmed text image uses a convolutional recurrent neural network to recognize the text image.

8. An end-to-end optical character detection recognition system, the recognition system comprising: the image feature extraction module extracts image features to obtain an interested region;

the classification module classifies the region of interest to obtain angle information of a frame of the region of interest, and is connected with the image feature extraction module;

the segmentation module is used for segmenting the region of interest to obtain the outline information of the text image in the region and is connected with the image feature extraction module;

the system comprises an equal deformation transformation module, a text feature extraction module, a classification module and a segmentation module, wherein the equal deformation transformation module divides a text image into a plurality of circles based on polar coordinates based on angle information and text image outline information, adjusts the coordinates of the circles and the defined content thereof so as to finish the text image, and is connected with the image feature extraction module, the classification module and the segmentation module; and