CN114821601A

CN114821601A - End-to-end English handwritten text detection and recognition technology based on deep learning

Info

Publication number: CN114821601A
Application number: CN202210391966.7A
Authority: CN
Inventors: 王嵬
Original assignee: Beijing Zhiyun Zaiqi Technology Co ltd
Current assignee: Beijing Zhiyun Zaiqi Technology Co ltd
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-07-29

Abstract

The invention relates to the technical field of text detection and identification, and discloses an end-to-end English handwritten text detection and identification technology based on deep learning, which comprises a text detection and identification method and comprises the following steps: s1, preprocessing text detection data; s2, training a DBNet algorithm model; s3, preprocessing text recognition data; s4, training a CRNN algorithm model; s5, predicting a DBNet model; s6, affine transformation; s7, predicting a CRNN model; s8, sorting and splicing the text areas; and S9, filtering and replacing the scratched-out character. According to the end-to-end English handwriting text detection and recognition technology based on deep learning, data and an algorithm are optimized in a targeted mode for English handwriting, the recognition accuracy of DBNet on a text region can be greatly improved by adopting RGB three-channel direct training, and the influence of the problems of brightness, angle, contrast and irregular format and font of the English handwriting on the robustness of a model is solved.

Description

End-to-end English handwritten text detection and recognition technology based on deep learning

Technical Field

The invention relates to the technical field of text detection and recognition, in particular to an end-to-end English handwritten text detection and recognition technology based on deep learning.

Background

Common Text detection technologies based on deep learning can be generally classified into two categories, namely regression-based and segmentation-based, wherein the regression-based methods are classified into box regression and pixel value regression, and methods adopting box regression mainly comprise CTPN, Textbox series and EAST, such algorithms have a good detection effect on regular-shaped texts but cannot accurately detect irregular texts, meanwhile, the distinction degree of words in handwritten English texts is insufficient, namely Text detection cannot be accurately performed by taking the words as units, more Text lines formed by the words are detected as one box, methods adopting pixel value regression mainly comprise CRAFT and SA-Text, such algorithms can detect curved texts and have an excellent effect on small texts but have insufficient real-time performance.

The algorithm based on segmentation, such as PSENet, is not limited by text shapes, and can achieve good effects on texts in various shapes, but often the post-processing is complex, which results in serious time consumption. At present, there is an algorithm specially improved for post-processing, such as DBNet, a differentiable binary method, which approximates binarization to make it conductive, incorporates training, increases robustness of the model, thereby obtaining more accurate boundaries, greatly simplifies post-processing process, and reduces time consumption, whereas the commonly used text recognition technology based on deep learning is classified into CTC-based and Attention-based technologies, CNN + RNN + Attention is suitable for short text recognition, which has better recognition effect based on CTC in a general scene of CNN + RNN + CTC for long text recognition, and is also more excellent in performance. The recognition effect of the handwritten English real-shot picture is poor mainly because the real-shot picture has the problems of brightness, angle, contrast and the like, and the requirements on data quality and model robustness are high. The handwritten English pictures have different backgrounds and have larger interference, the handwritten English texts generally have the problems of inclination, adhesion, correction and the like in different degrees, the writing habits of different people are different, and the like, so that the model accuracy is greatly influenced.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides an end-to-end English handwritten text detection and recognition technology based on deep learning, so as to solve the problems in the background technology.

(II) technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: an end-to-end English handwriting text detection and recognition technology based on deep learning comprises a text detection and recognition method and comprises the following steps:

s1, preprocessing text detection data;

s2, training a DBNet algorithm model;

s3, preprocessing text recognition data;

s4, training a CRNN algorithm model;

s5, predicting a DBNet model;

s6, affine transformation;

s7, predicting a CRNN model;

s8, sorting and splicing the text areas;

and S9, filtering and replacing the scratched-out character.

Preferably, the step S1 further includes:

(1) according to the statistics of the width and the height of letters in the English handwritten picture, verifying and judging the local optimal size of the input picture to be 1280 x 1280 for multiple times;

(2) most input pictures are different in length and width, filling is needed, the scaling ratio of equal proportion scaling is recorded, the background filling is 0, namely, all black, is too close to the writing color of a text area, the filling is 255, namely, all white, and has a certain degree of color difference with the original operation book color of the handwritten picture, and the filling 245 effect is determined to be locally optimal through comparison and verification;

(3) normalizing the input picture, dividing the three channels by 255, then normalizing, subtracting [0.485,0.456,0.406] from the three channels, and then dividing by [0.229,0.224,0.225 ];

(4) the text area mark takes the upper left corner of the picture as an origin, and marks the text area by using x and y coordinates of four points;

(5) reducing the text box pair according to the proportion of 0.4 to obtain a shrink _ map for the text area of the training data, marking 1 in the reduced text box, and marking 0 in other areas; for the text area of the training data, according to the proportion of 0.4, the text box is expanded outwards and contracted inwards to obtain threshold _ map, actually, the text box is a gradual change area, the maximum value is set to be 0.7, the minimum value is set to be 0.3, the closer to the original text box, the larger the value is, and the farther away the value is, the smaller the value is.

Preferably, the step S2 further includes:

(1) inputting a preprocessed picture, performing feature extraction on the picture at different scales by utilizing a resnet50 algorithm, reducing the scales of the picture to be 1/4,1/8,1/16 and 1/32, and correspondingly increasing the number of channels to be 256, 512,1024 and 2048;

(2) combining features of different scales by using an FPN algorithm, and converting a probability matrix P and a threshold matrix T according to combined feature information;

(3) converting probability matrix and threshold matrix into approximate binary matrix by using differentiable binary

B is an approximate binary matrix which can be differentiated and can be used for backward propagation;

(4) the model outputs three matrixes, namely a probability matrix P, a threshold matrix T and an approximate binarization matrix B, three loss correspondingly exist, and the three loss comprise:

loss 1: calculating through P and shrink _ map, wherein the function is cross entropy;

loss 2: calculating by T, mask and threshold _ map, wherein the mask is a matrix with a text area of 1 and other areas of 0, and the function is abs (T-threshold _ map) mask/mask;

loss 3: calculated by B and shrink _ map, the calculation process is

intersection＝B*shrink_map*mask；

union＝(B*mask).sum+(shrink_map*mask).sum；

loss＝1-2.0*intersection/union。

And (3) performing iteration through a back propagation training model, wherein the total Loss is 1+10 Loss2+ Loss3, and stopping training when the iteration is not reduced until the specified number of rounds or the repeated fluctuation of the Loss is no longer reduced.

Preferably, the step S3 further includes:

(1) the text recognition model only needs to cut out partial text areas as subgraphs and marks corresponding texts as input and output data without needing a whole picture;

(2) for each sub-image and corresponding text in the same picture, the following processing is carried out;

(3) scaling the subgraph to 32 in equal proportion, not limiting the length, not exceeding 320, simultaneously not performing normalization and standardization processing, and keeping the original image pixel points as input;

(4) for text marks, aiming at English handwriting, a model dictionary only keeps numbers, English punctuations, capital and small letters, in the data preprocessing process, Chinese punctuations are converted into corresponding English punctuations, meanwhile, pertinently increased _ 'is used as marks for scratching off words, because the situation that wrongly written words need to be scratched off and rewritten exists in a handwriting font, the _' mark scratch-off words can be added to further improve the recognition rate, the scratched-off words are converted into underline groups with corresponding lengths, such as 'applets', and the text marks are '_____' underline groups with the length of 5;

(5) in order to ensure the training performance, batch input is selected, the subgraph length is ensured to be the same, the processing method is to combine all the subgraphs in the same graph randomly to form 3 × 32 × 320 big subgraphs, and each 32 big subgraphs form a batch with the dimension of 32 × 3 × 32 × 320.

Preferably, the step S4 further includes:

(1) inputting pictures (32 × 3 × 32 × 320) of a batch B, a height H, a width W, and a channel C equal to 3, extracting features through CNN, reducing the height to 1/32 and the width to 1/4, and changing the channel from 3 to 512(32 × 1 × 80) [ B H W C ];

(2) entering the matrix output by the CNN network, setting T to 80(W/4) in the LSTM, outputting dimensions 32 x 1 (W/4) nclass by the LSTM, wherein the nclass represents the total number of characters in the dictionary, and performing softmax on the matrix output by the LSTM network;

(3) after softmax, a loss function ctc (connectionist Temporal classification) is connected, and the backward propagation is mainly performed to solve the problem that the picture length does not correspond to the actual text length.

Preferably, the step S5 further includes:

(1) processing an input picture according to the first step, and inputting the input picture into a trained DBNet model;

(2) only a probability matrix P is needed in the prediction process, the performance is greatly improved, then a fixed threshold value is defined, binarization is carried out, and a matrix B is obtained, wherein the size of the matrix B is 1280 x 1280;

(3) and obtaining a text region according to the matrix B, simultaneously amplifying according to the proportion of 0.4 during picture preprocessing to obtain a real text region, and extracting corresponding four coordinate point information.

Preferably, the step S6 further includes:

(1) affine transformation is required to be carried out according to coordinate point information of one or more text regions in a picture predicted by a text detection algorithm, because the following conditions exist;

(2) the quadrangle formed by four vertex coordinates of the text box predicted by the algorithm is occupied around letters and words actually written, but actually, the sides of the text box are not parallel to the sides of the picture actually because students write obliquely or pictures uploaded obliquely;

(3) under the circumstance, in order to cut out specific pixel points of an actual text region for the next text recognition and conversion, a rectangle with one side parallel to a picture needs to be drawn according to the vertex of a text box, so that a subgraph for text recognition has more regions of irrelevant text than the original subgraph, and the accuracy of text recognition is influenced;

(4) and (3) applying affine transformation to cut out the subgraph directly according to the predicted vertex coordinates of the text box, so that the processed subgraph can be more closely attached to the text area.

Preferably, the step S7 further includes:

(1) scaling according to H-32 according to the subgraph processed in the step six, wherein the length W is not limited;

(2) inputting the characters into a trained CRNN model, and outputting characters with the length of W/4;

(3) according to the output character string, merging and de-duplication are needed, the invention defines blank characters as "#", de-duplication among blank characters, and then removes the blank characters for merging, for example, "aa # p # ppp # ee" becomes "applet" after merging.

Preferably, the step S8 further includes:

(1) according to the coordinates of the text regions predicted by the DBNet model, performing ascending order sorting on the coordinates of one or more text regions in the horizontal direction;

(2) extracting text areas in sequence, marking the text areas as A, searching the text areas which are more than 0.3 of the intersection ratio of the nearest text box in the horizontal direction of the text areas and the text box on the vertical direction and have the height ratio of more than 0.3, and marking the text areas as B, wherein B is the adjacent text box of A;

(3) and taking B as a new A, and repeating the operation in the previous step until no adjacent text area can be found. Judging the line of text area searching to end;

(4) text regions which are recorded are removed from the candidate set, and the steps (1) and (2) are repeated until the candidate set is empty;

(5) and merging the texts corresponding to the text regions which are already sorted.

Preferably, the step S9 further includes:

(1) aiming at the situation that the wrong word is scratched off in English handwriting and the complementary writing is carried out above or below the wrong word;

(2) after sorting, checking whether the scratched characters exist or not;

(3) if the isolated text region exists, searching for the isolated text region, and judging the conditions as follows:

whether the isolated region is above or below the scratched-out character region, wherein the intersection of the two text regions or the distance between the two text regions is far smaller than the distance between lines of the whole text of the picture;

the intersection ratio of the isolated area and the horizontal text box of the scratched-out character area is greater than 0.5, and the width ratio is greater than 0.5;

(4) if the requirement (3) is met, replacing the text of the character area to be scratched out with the text of the isolated area, and deleting the text of the isolated area; if (3) is not satisfied, the scratched-out character is directly removed.

(III) advantageous effects

Compared with the prior art, the invention provides an end-to-end English handwritten text detection and recognition technology based on deep learning, which has the following beneficial effects:

the end-to-end English handwritten text detection and recognition technology based on deep learning comprises the following steps: the invention optimizes data and algorithm pertinently for English handwriting, retains RGB three-channel information during DBNet algorithm training to cope with the situation that handwriting real-time images are complex and changeable, generally adopts gray images for text detection of print, because for the print, input images are generally regular and clear, the information retained by the gray images can enable the algorithm to better distinguish text and non-text areas, but for the handwriting real-time images, scenes are complex, definition is not enough, the information retained by the gray images is limited, direct training by adopting the RGB three channels can greatly improve the recognition accuracy of DBNet to the text areas, and solves the influence of the problems of brightness, angle, contrast and irregular format fonts of the handwriting on the robustness of the model.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a technical scheme, in particular to an end-to-end English handwritten text detection and identification technology based on deep learning, which comprises a text detection and identification method and comprises the following steps:

s1, preprocessing text detection data; in step S1, the method further includes:

(5) reducing the text box pair according to the proportion of 0.4 to obtain a shrink _ map for the text area of the training data, marking 1 in the reduced text box, and marking 0 in other areas; for the text area of the training data, according to a ratio of 0.4, expanding the text box outwards and contracting the text box inwards to obtain a threshold _ map, actually, the text box is a gradual change area, the maximum value is set to be 0.7, the minimum value is set to be 0.3, and the closer to the original text box, the larger the value is, the farther away the value is, the smaller the value is.

S2, training a DBNet algorithm model; in step S2, the method further includes:

loss 3: calculated by B and shrink _ map, the calculation process is

intersection＝B*shrink_map*mask；

union＝(B*mask).sum+(shrink_map*mask).sum；

loss＝1-2.0*intersection/union。

S3, preprocessing text recognition data; in step S3, the method further includes:

S4, training a CRNN algorithm model; in step S4, the method further includes:

S5, predicting a DBNet model; in step S5, the method further includes:

S6, affine transformation; in step S6, the method further includes:

S7, predicting a CRNN model; in step S7, the method further includes:

(3) according to the output character string, merging and de-duplication are needed, the invention defines blank characters as "#", de-duplication among blank characters, and then merging is carried out after removing the blank characters, for example, "aa # p # ppp # ll # ee" becomes "applet" after merging.

S8, sorting and splicing the text areas; in step S8, the method further includes:

S9, cut-out character filtering and replacing, step S9 further comprises:

(2) after sorting, checking whether the scratched-out characters exist or not;

The working principle of the device is as follows: (1) firstly, preprocessing a picture file, zooming and filling the picture file to a fixed size in a targeted manner, and normalizing the picture file;

(2) inputting the processed picture into a DBNet algorithm, and obtaining box data after the preprocessed picture data passes through the DBNet algorithm, wherein the box data comprises coordinate points of a text box corresponding to the size of the original picture;

(3) b, inputting the Box data into a CLS algorithm for stroke classification, cutting out a subgraph according to coordinate points of a text box in a box, and distinguishing whether the text in the subgraph is a print or a handwriting through a print handwriting classification algorithm;

(4) the CRNN1 is an English print recognition algorithm and is used for recognizing texts in the subgraph;

(5) the CRNN2 is an English handwriting recognition algorithm and is used for recognizing texts in the subgraph;

(6) the text box ordering module determines the position of the text box on the picture according to the coordinate information of the text box and outputs the text box in an arranged manner;

(7) and combining the recognized sub-texts and the corresponding coordinate information, and filtering and replacing the scratched-out characters to generate a complete text.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An end-to-end English handwriting text detection and recognition technology based on deep learning comprises a text detection and recognition method and is characterized in that: comprises the following steps:

s1, preprocessing text detection data;

s2, training a DBNet algorithm model;

s3, preprocessing text recognition data;

s4, training a CRNN algorithm model;

s5, predicting a DBNet model;

s6, affine transformation;

s7, predicting a CRNN model;

s8, sorting and splicing the text areas;

and S9, filtering and replacing the scratched-out character.

2. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S1 further includes:

3. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S2 further includes:

loss 3: calculated by B and shrink _ map, the calculation process is

intersection＝B*shrink_map*mask；

union＝(B*mask).sum+(shrink_map*mask).sum；

loss＝1-2.0*intersection/union。

And (4) performing iteration through a back propagation training model, wherein the total Loss1+10 × Loss2+ Loss3, and stopping training when the iteration is repeated until the specified number of turns or the repeated fluctuation of Loss does not fall any more.

4. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S3 further includes:

5. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S4 further includes:

(2) inputting the matrix output by the CNN network into the LSTM, setting T to be 80(W/4), outputting dimensions 32 x 1 (W/4) nclass by the LSTM, wherein the nclass represents the total number of characters in the dictionary, and performing softmax on the matrix output by the LSTM network;

6. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S5 further includes:

(2) only a probability matrix P is needed in the prediction process, the performance is greatly improved, then a fixed threshold value is defined, binarization is carried out, and a matrix B is obtained, wherein the size of the matrix B is fixed 1280 x 1280;

7. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S6 further includes:

8. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S7 further includes:

9. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S8 further includes:

(5) and combining the texts corresponding to the text regions which are sequenced.

10. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S9 further includes:

(2) after sorting, checking whether the scratched-out characters exist or not;

(3) if an isolated text region is found, the judgment conditions are as follows: