CN114821601A - End-to-end English handwritten text detection and recognition technology based on deep learning - Google Patents

End-to-end English handwritten text detection and recognition technology based on deep learning Download PDF

Info

Publication number
CN114821601A
CN114821601A CN202210391966.7A CN202210391966A CN114821601A CN 114821601 A CN114821601 A CN 114821601A CN 202210391966 A CN202210391966 A CN 202210391966A CN 114821601 A CN114821601 A CN 114821601A
Authority
CN
China
Prior art keywords
text
picture
matrix
english
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210391966.7A
Other languages
Chinese (zh)
Inventor
王嵬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhiyun Zaiqi Technology Co ltd
Original Assignee
Beijing Zhiyun Zaiqi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhiyun Zaiqi Technology Co ltd filed Critical Beijing Zhiyun Zaiqi Technology Co ltd
Priority to CN202210391966.7A priority Critical patent/CN114821601A/en
Publication of CN114821601A publication Critical patent/CN114821601A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Character Input (AREA)

Abstract

The invention relates to the technical field of text detection and identification, and discloses an end-to-end English handwritten text detection and identification technology based on deep learning, which comprises a text detection and identification method and comprises the following steps: s1, preprocessing text detection data; s2, training a DBNet algorithm model; s3, preprocessing text recognition data; s4, training a CRNN algorithm model; s5, predicting a DBNet model; s6, affine transformation; s7, predicting a CRNN model; s8, sorting and splicing the text areas; and S9, filtering and replacing the scratched-out character. According to the end-to-end English handwriting text detection and recognition technology based on deep learning, data and an algorithm are optimized in a targeted mode for English handwriting, the recognition accuracy of DBNet on a text region can be greatly improved by adopting RGB three-channel direct training, and the influence of the problems of brightness, angle, contrast and irregular format and font of the English handwriting on the robustness of a model is solved.

Description

End-to-end English handwritten text detection and recognition technology based on deep learning
Technical Field
The invention relates to the technical field of text detection and recognition, in particular to an end-to-end English handwritten text detection and recognition technology based on deep learning.
Background
Common Text detection technologies based on deep learning can be generally classified into two categories, namely regression-based and segmentation-based, wherein the regression-based methods are classified into box regression and pixel value regression, and methods adopting box regression mainly comprise CTPN, Textbox series and EAST, such algorithms have a good detection effect on regular-shaped texts but cannot accurately detect irregular texts, meanwhile, the distinction degree of words in handwritten English texts is insufficient, namely Text detection cannot be accurately performed by taking the words as units, more Text lines formed by the words are detected as one box, methods adopting pixel value regression mainly comprise CRAFT and SA-Text, such algorithms can detect curved texts and have an excellent effect on small texts but have insufficient real-time performance.
The algorithm based on segmentation, such as PSENet, is not limited by text shapes, and can achieve good effects on texts in various shapes, but often the post-processing is complex, which results in serious time consumption. At present, there is an algorithm specially improved for post-processing, such as DBNet, a differentiable binary method, which approximates binarization to make it conductive, incorporates training, increases robustness of the model, thereby obtaining more accurate boundaries, greatly simplifies post-processing process, and reduces time consumption, whereas the commonly used text recognition technology based on deep learning is classified into CTC-based and Attention-based technologies, CNN + RNN + Attention is suitable for short text recognition, which has better recognition effect based on CTC in a general scene of CNN + RNN + CTC for long text recognition, and is also more excellent in performance. The recognition effect of the handwritten English real-shot picture is poor mainly because the real-shot picture has the problems of brightness, angle, contrast and the like, and the requirements on data quality and model robustness are high. The handwritten English pictures have different backgrounds and have larger interference, the handwritten English texts generally have the problems of inclination, adhesion, correction and the like in different degrees, the writing habits of different people are different, and the like, so that the model accuracy is greatly influenced.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides an end-to-end English handwritten text detection and recognition technology based on deep learning, so as to solve the problems in the background technology.
(II) technical scheme
In order to achieve the purpose, the invention provides the following technical scheme: an end-to-end English handwriting text detection and recognition technology based on deep learning comprises a text detection and recognition method and comprises the following steps:
s1, preprocessing text detection data;
s2, training a DBNet algorithm model;
s3, preprocessing text recognition data;
s4, training a CRNN algorithm model;
s5, predicting a DBNet model;
s6, affine transformation;
s7, predicting a CRNN model;
s8, sorting and splicing the text areas;
and S9, filtering and replacing the scratched-out character.
Preferably, the step S1 further includes:
(1) according to the statistics of the width and the height of letters in the English handwritten picture, verifying and judging the local optimal size of the input picture to be 1280 x 1280 for multiple times;
(2) most input pictures are different in length and width, filling is needed, the scaling ratio of equal proportion scaling is recorded, the background filling is 0, namely, all black, is too close to the writing color of a text area, the filling is 255, namely, all white, and has a certain degree of color difference with the original operation book color of the handwritten picture, and the filling 245 effect is determined to be locally optimal through comparison and verification;
(3) normalizing the input picture, dividing the three channels by 255, then normalizing, subtracting [0.485,0.456,0.406] from the three channels, and then dividing by [0.229,0.224,0.225 ];
(4) the text area mark takes the upper left corner of the picture as an origin, and marks the text area by using x and y coordinates of four points;
(5) reducing the text box pair according to the proportion of 0.4 to obtain a shrink _ map for the text area of the training data, marking 1 in the reduced text box, and marking 0 in other areas; for the text area of the training data, according to the proportion of 0.4, the text box is expanded outwards and contracted inwards to obtain threshold _ map, actually, the text box is a gradual change area, the maximum value is set to be 0.7, the minimum value is set to be 0.3, the closer to the original text box, the larger the value is, and the farther away the value is, the smaller the value is.
Preferably, the step S2 further includes:
(1) inputting a preprocessed picture, performing feature extraction on the picture at different scales by utilizing a resnet50 algorithm, reducing the scales of the picture to be 1/4,1/8,1/16 and 1/32, and correspondingly increasing the number of channels to be 256, 512,1024 and 2048;
(2) combining features of different scales by using an FPN algorithm, and converting a probability matrix P and a threshold matrix T according to combined feature information;
(3) converting probability matrix and threshold matrix into approximate binary matrix by using differentiable binary
Figure BDA0003597296600000031
B is an approximate binary matrix which can be differentiated and can be used for backward propagation;
(4) the model outputs three matrixes, namely a probability matrix P, a threshold matrix T and an approximate binarization matrix B, three loss correspondingly exist, and the three loss comprise:
loss 1: calculating through P and shrink _ map, wherein the function is cross entropy;
loss 2: calculating by T, mask and threshold _ map, wherein the mask is a matrix with a text area of 1 and other areas of 0, and the function is abs (T-threshold _ map) mask/mask;
loss 3: calculated by B and shrink _ map, the calculation process is
intersection=B*shrink_map*mask;
union=(B*mask).sum+(shrink_map*mask).sum;
loss=1-2.0*intersection/union。
And (3) performing iteration through a back propagation training model, wherein the total Loss is 1+10 Loss2+ Loss3, and stopping training when the iteration is not reduced until the specified number of rounds or the repeated fluctuation of the Loss is no longer reduced.
Preferably, the step S3 further includes:
(1) the text recognition model only needs to cut out partial text areas as subgraphs and marks corresponding texts as input and output data without needing a whole picture;
(2) for each sub-image and corresponding text in the same picture, the following processing is carried out;
(3) scaling the subgraph to 32 in equal proportion, not limiting the length, not exceeding 320, simultaneously not performing normalization and standardization processing, and keeping the original image pixel points as input;
(4) for text marks, aiming at English handwriting, a model dictionary only keeps numbers, English punctuations, capital and small letters, in the data preprocessing process, Chinese punctuations are converted into corresponding English punctuations, meanwhile, pertinently increased _ 'is used as marks for scratching off words, because the situation that wrongly written words need to be scratched off and rewritten exists in a handwriting font, the _' mark scratch-off words can be added to further improve the recognition rate, the scratched-off words are converted into underline groups with corresponding lengths, such as 'applets', and the text marks are '_____' underline groups with the length of 5;
(5) in order to ensure the training performance, batch input is selected, the subgraph length is ensured to be the same, the processing method is to combine all the subgraphs in the same graph randomly to form 3 × 32 × 320 big subgraphs, and each 32 big subgraphs form a batch with the dimension of 32 × 3 × 32 × 320.
Preferably, the step S4 further includes:
(1) inputting pictures (32 × 3 × 32 × 320) of a batch B, a height H, a width W, and a channel C equal to 3, extracting features through CNN, reducing the height to 1/32 and the width to 1/4, and changing the channel from 3 to 512(32 × 1 × 80) [ B H W C ];
(2) entering the matrix output by the CNN network, setting T to 80(W/4) in the LSTM, outputting dimensions 32 x 1 (W/4) nclass by the LSTM, wherein the nclass represents the total number of characters in the dictionary, and performing softmax on the matrix output by the LSTM network;
(3) after softmax, a loss function ctc (connectionist Temporal classification) is connected, and the backward propagation is mainly performed to solve the problem that the picture length does not correspond to the actual text length.
Preferably, the step S5 further includes:
(1) processing an input picture according to the first step, and inputting the input picture into a trained DBNet model;
(2) only a probability matrix P is needed in the prediction process, the performance is greatly improved, then a fixed threshold value is defined, binarization is carried out, and a matrix B is obtained, wherein the size of the matrix B is 1280 x 1280;
(3) and obtaining a text region according to the matrix B, simultaneously amplifying according to the proportion of 0.4 during picture preprocessing to obtain a real text region, and extracting corresponding four coordinate point information.
Preferably, the step S6 further includes:
(1) affine transformation is required to be carried out according to coordinate point information of one or more text regions in a picture predicted by a text detection algorithm, because the following conditions exist;
(2) the quadrangle formed by four vertex coordinates of the text box predicted by the algorithm is occupied around letters and words actually written, but actually, the sides of the text box are not parallel to the sides of the picture actually because students write obliquely or pictures uploaded obliquely;
(3) under the circumstance, in order to cut out specific pixel points of an actual text region for the next text recognition and conversion, a rectangle with one side parallel to a picture needs to be drawn according to the vertex of a text box, so that a subgraph for text recognition has more regions of irrelevant text than the original subgraph, and the accuracy of text recognition is influenced;
(4) and (3) applying affine transformation to cut out the subgraph directly according to the predicted vertex coordinates of the text box, so that the processed subgraph can be more closely attached to the text area.
Preferably, the step S7 further includes:
(1) scaling according to H-32 according to the subgraph processed in the step six, wherein the length W is not limited;
(2) inputting the characters into a trained CRNN model, and outputting characters with the length of W/4;
(3) according to the output character string, merging and de-duplication are needed, the invention defines blank characters as "#", de-duplication among blank characters, and then removes the blank characters for merging, for example, "aa # p # ppp # ee" becomes "applet" after merging.
Preferably, the step S8 further includes:
(1) according to the coordinates of the text regions predicted by the DBNet model, performing ascending order sorting on the coordinates of one or more text regions in the horizontal direction;
(2) extracting text areas in sequence, marking the text areas as A, searching the text areas which are more than 0.3 of the intersection ratio of the nearest text box in the horizontal direction of the text areas and the text box on the vertical direction and have the height ratio of more than 0.3, and marking the text areas as B, wherein B is the adjacent text box of A;
(3) and taking B as a new A, and repeating the operation in the previous step until no adjacent text area can be found. Judging the line of text area searching to end;
(4) text regions which are recorded are removed from the candidate set, and the steps (1) and (2) are repeated until the candidate set is empty;
(5) and merging the texts corresponding to the text regions which are already sorted.
Preferably, the step S9 further includes:
(1) aiming at the situation that the wrong word is scratched off in English handwriting and the complementary writing is carried out above or below the wrong word;
(2) after sorting, checking whether the scratched characters exist or not;
(3) if the isolated text region exists, searching for the isolated text region, and judging the conditions as follows:
whether the isolated region is above or below the scratched-out character region, wherein the intersection of the two text regions or the distance between the two text regions is far smaller than the distance between lines of the whole text of the picture;
the intersection ratio of the isolated area and the horizontal text box of the scratched-out character area is greater than 0.5, and the width ratio is greater than 0.5;
(4) if the requirement (3) is met, replacing the text of the character area to be scratched out with the text of the isolated area, and deleting the text of the isolated area; if (3) is not satisfied, the scratched-out character is directly removed.
(III) advantageous effects
Compared with the prior art, the invention provides an end-to-end English handwritten text detection and recognition technology based on deep learning, which has the following beneficial effects:
the end-to-end English handwritten text detection and recognition technology based on deep learning comprises the following steps: the invention optimizes data and algorithm pertinently for English handwriting, retains RGB three-channel information during DBNet algorithm training to cope with the situation that handwriting real-time images are complex and changeable, generally adopts gray images for text detection of print, because for the print, input images are generally regular and clear, the information retained by the gray images can enable the algorithm to better distinguish text and non-text areas, but for the handwriting real-time images, scenes are complex, definition is not enough, the information retained by the gray images is limited, direct training by adopting the RGB three channels can greatly improve the recognition accuracy of DBNet to the text areas, and solves the influence of the problems of brightness, angle, contrast and irregular format fonts of the handwriting on the robustness of the model.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a technical scheme, in particular to an end-to-end English handwritten text detection and identification technology based on deep learning, which comprises a text detection and identification method and comprises the following steps:
s1, preprocessing text detection data; in step S1, the method further includes:
(1) according to the statistics of the width and the height of letters in the English handwritten picture, verifying and judging the local optimal size of the input picture to be 1280 x 1280 for multiple times;
(2) most input pictures are different in length and width, filling is needed, the scaling ratio of equal proportion scaling is recorded, the background filling is 0, namely, all black, is too close to the writing color of a text area, the filling is 255, namely, all white, and has a certain degree of color difference with the original operation book color of the handwritten picture, and the filling 245 effect is determined to be locally optimal through comparison and verification;
(3) normalizing the input picture, dividing the three channels by 255, then normalizing, subtracting [0.485,0.456,0.406] from the three channels, and then dividing by [0.229,0.224,0.225 ];
(4) the text area mark takes the upper left corner of the picture as an origin, and marks the text area by using x and y coordinates of four points;
(5) reducing the text box pair according to the proportion of 0.4 to obtain a shrink _ map for the text area of the training data, marking 1 in the reduced text box, and marking 0 in other areas; for the text area of the training data, according to a ratio of 0.4, expanding the text box outwards and contracting the text box inwards to obtain a threshold _ map, actually, the text box is a gradual change area, the maximum value is set to be 0.7, the minimum value is set to be 0.3, and the closer to the original text box, the larger the value is, the farther away the value is, the smaller the value is.
S2, training a DBNet algorithm model; in step S2, the method further includes:
(1) inputting a preprocessed picture, performing feature extraction on the picture at different scales by utilizing a resnet50 algorithm, reducing the scales of the picture to be 1/4,1/8,1/16 and 1/32, and correspondingly increasing the number of channels to be 256, 512,1024 and 2048;
(2) combining features of different scales by using an FPN algorithm, and converting a probability matrix P and a threshold matrix T according to combined feature information;
(3) converting probability matrix and threshold matrix into approximate binary matrix by using differentiable binary
Figure BDA0003597296600000081
B is an approximate binary matrix which can be differentiated and can be used for backward propagation;
(4) the model outputs three matrixes, namely a probability matrix P, a threshold matrix T and an approximate binarization matrix B, three loss correspondingly exist, and the three loss comprise:
loss 1: calculating through P and shrink _ map, wherein the function is cross entropy;
loss 2: calculating by T, mask and threshold _ map, wherein the mask is a matrix with a text area of 1 and other areas of 0, and the function is abs (T-threshold _ map) mask/mask;
loss 3: calculated by B and shrink _ map, the calculation process is
intersection=B*shrink_map*mask;
union=(B*mask).sum+(shrink_map*mask).sum;
loss=1-2.0*intersection/union。
And (3) performing iteration through a back propagation training model, wherein the total Loss is 1+10 Loss2+ Loss3, and stopping training when the iteration is not reduced until the specified number of rounds or the repeated fluctuation of the Loss is no longer reduced.
S3, preprocessing text recognition data; in step S3, the method further includes:
(1) the text recognition model only needs to cut out partial text areas as subgraphs and marks corresponding texts as input and output data without needing a whole picture;
(2) for each sub-image and corresponding text in the same picture, the following processing is carried out;
(3) scaling the subgraph to 32 in equal proportion, not limiting the length, not exceeding 320, simultaneously not performing normalization and standardization processing, and keeping the original image pixel points as input;
(4) for text marks, aiming at English handwriting, a model dictionary only keeps numbers, English punctuations, capital and small letters, in the data preprocessing process, Chinese punctuations are converted into corresponding English punctuations, meanwhile, pertinently increased _ 'is used as marks for scratching off words, because the situation that wrongly written words need to be scratched off and rewritten exists in a handwriting font, the _' mark scratch-off words can be added to further improve the recognition rate, the scratched-off words are converted into underline groups with corresponding lengths, such as 'applets', and the text marks are '_____' underline groups with the length of 5;
(5) in order to ensure the training performance, batch input is selected, the subgraph length is ensured to be the same, the processing method is to combine all the subgraphs in the same graph randomly to form 3 × 32 × 320 big subgraphs, and each 32 big subgraphs form a batch with the dimension of 32 × 3 × 32 × 320.
S4, training a CRNN algorithm model; in step S4, the method further includes:
(1) inputting pictures (32 × 3 × 32 × 320) of a batch B, a height H, a width W, and a channel C equal to 3, extracting features through CNN, reducing the height to 1/32 and the width to 1/4, and changing the channel from 3 to 512(32 × 1 × 80) [ B H W C ];
(2) entering the matrix output by the CNN network, setting T to 80(W/4) in the LSTM, outputting dimensions 32 x 1 (W/4) nclass by the LSTM, wherein the nclass represents the total number of characters in the dictionary, and performing softmax on the matrix output by the LSTM network;
(3) after softmax, a loss function ctc (connectionist Temporal classification) is connected, and the backward propagation is mainly performed to solve the problem that the picture length does not correspond to the actual text length.
S5, predicting a DBNet model; in step S5, the method further includes:
(1) processing an input picture according to the first step, and inputting the input picture into a trained DBNet model;
(2) only a probability matrix P is needed in the prediction process, the performance is greatly improved, then a fixed threshold value is defined, binarization is carried out, and a matrix B is obtained, wherein the size of the matrix B is 1280 x 1280;
(3) and obtaining a text region according to the matrix B, simultaneously amplifying according to the proportion of 0.4 during picture preprocessing to obtain a real text region, and extracting corresponding four coordinate point information.
S6, affine transformation; in step S6, the method further includes:
(1) affine transformation is required to be carried out according to coordinate point information of one or more text regions in a picture predicted by a text detection algorithm, because the following conditions exist;
(2) the quadrangle formed by four vertex coordinates of the text box predicted by the algorithm is occupied around letters and words actually written, but actually, the sides of the text box are not parallel to the sides of the picture actually because students write obliquely or pictures uploaded obliquely;
(3) under the circumstance, in order to cut out specific pixel points of an actual text region for the next text recognition and conversion, a rectangle with one side parallel to a picture needs to be drawn according to the vertex of a text box, so that a subgraph for text recognition has more regions of irrelevant text than the original subgraph, and the accuracy of text recognition is influenced;
(4) and (3) applying affine transformation to cut out the subgraph directly according to the predicted vertex coordinates of the text box, so that the processed subgraph can be more closely attached to the text area.
S7, predicting a CRNN model; in step S7, the method further includes:
(1) scaling according to H-32 according to the subgraph processed in the step six, wherein the length W is not limited;
(2) inputting the characters into a trained CRNN model, and outputting characters with the length of W/4;
(3) according to the output character string, merging and de-duplication are needed, the invention defines blank characters as "#", de-duplication among blank characters, and then merging is carried out after removing the blank characters, for example, "aa # p # ppp # ll # ee" becomes "applet" after merging.
S8, sorting and splicing the text areas; in step S8, the method further includes:
(1) according to the coordinates of the text regions predicted by the DBNet model, performing ascending order sorting on the coordinates of one or more text regions in the horizontal direction;
(2) extracting text areas in sequence, marking the text areas as A, searching the text areas which are more than 0.3 of the intersection ratio of the nearest text box in the horizontal direction of the text areas and the text box on the vertical direction and have the height ratio of more than 0.3, and marking the text areas as B, wherein B is the adjacent text box of A;
(3) and taking B as a new A, and repeating the operation in the previous step until no adjacent text area can be found. Judging the line of text area searching to end;
(4) text regions which are recorded are removed from the candidate set, and the steps (1) and (2) are repeated until the candidate set is empty;
(5) and merging the texts corresponding to the text regions which are already sorted.
S9, cut-out character filtering and replacing, step S9 further comprises:
(1) aiming at the situation that the wrong word is scratched off in English handwriting and the complementary writing is carried out above or below the wrong word;
(2) after sorting, checking whether the scratched-out characters exist or not;
(3) if the isolated text region exists, searching for the isolated text region, and judging the conditions as follows:
whether the isolated region is above or below the scratched-out character region, wherein the intersection of the two text regions or the distance between the two text regions is far smaller than the distance between lines of the whole text of the picture;
the intersection ratio of the isolated area and the horizontal text box of the scratched-out character area is greater than 0.5, and the width ratio is greater than 0.5;
(4) if the requirement (3) is met, replacing the text of the character area to be scratched out with the text of the isolated area, and deleting the text of the isolated area; if (3) is not satisfied, the scratched-out character is directly removed.
The working principle of the device is as follows: (1) firstly, preprocessing a picture file, zooming and filling the picture file to a fixed size in a targeted manner, and normalizing the picture file;
(2) inputting the processed picture into a DBNet algorithm, and obtaining box data after the preprocessed picture data passes through the DBNet algorithm, wherein the box data comprises coordinate points of a text box corresponding to the size of the original picture;
(3) b, inputting the Box data into a CLS algorithm for stroke classification, cutting out a subgraph according to coordinate points of a text box in a box, and distinguishing whether the text in the subgraph is a print or a handwriting through a print handwriting classification algorithm;
(4) the CRNN1 is an English print recognition algorithm and is used for recognizing texts in the subgraph;
(5) the CRNN2 is an English handwriting recognition algorithm and is used for recognizing texts in the subgraph;
(6) the text box ordering module determines the position of the text box on the picture according to the coordinate information of the text box and outputs the text box in an arranged manner;
(7) and combining the recognized sub-texts and the corresponding coordinate information, and filtering and replacing the scratched-out characters to generate a complete text.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. An end-to-end English handwriting text detection and recognition technology based on deep learning comprises a text detection and recognition method and is characterized in that: comprises the following steps:
s1, preprocessing text detection data;
s2, training a DBNet algorithm model;
s3, preprocessing text recognition data;
s4, training a CRNN algorithm model;
s5, predicting a DBNet model;
s6, affine transformation;
s7, predicting a CRNN model;
s8, sorting and splicing the text areas;
and S9, filtering and replacing the scratched-out character.
2. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S1 further includes:
(1) according to the statistics of the width and the height of letters in the English handwritten picture, verifying and judging the local optimal size of the input picture to be 1280 x 1280 for multiple times;
(2) most input pictures are different in length and width, filling is needed, the scaling ratio of equal proportion scaling is recorded, the background filling is 0, namely, all black, is too close to the writing color of a text area, the filling is 255, namely, all white, and has a certain degree of color difference with the original operation book color of the handwritten picture, and the filling 245 effect is determined to be locally optimal through comparison and verification;
(3) normalizing the input picture, dividing the three channels by 255, then normalizing, subtracting [0.485,0.456,0.406] from the three channels, and then dividing by [0.229,0.224,0.225 ];
(4) the text area mark takes the upper left corner of the picture as an origin, and marks the text area by using x and y coordinates of four points;
(5) reducing the text box pair according to the proportion of 0.4 to obtain a shrink _ map for the text area of the training data, marking 1 in the reduced text box, and marking 0 in other areas; for the text area of the training data, according to the proportion of 0.4, the text box is expanded outwards and contracted inwards to obtain threshold _ map, actually, the text box is a gradual change area, the maximum value is set to be 0.7, the minimum value is set to be 0.3, the closer to the original text box, the larger the value is, and the farther away the value is, the smaller the value is.
3. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S2 further includes:
(1) inputting a preprocessed picture, performing feature extraction on the picture at different scales by utilizing a resnet50 algorithm, reducing the scales of the picture to be 1/4,1/8,1/16 and 1/32, and correspondingly increasing the number of channels to be 256, 512,1024 and 2048;
(2) combining features of different scales by using an FPN algorithm, and converting a probability matrix P and a threshold matrix T according to combined feature information;
(3) converting probability matrix and threshold matrix into approximate binary matrix by using differentiable binary
Figure FDA0003597296590000021
B is an approximate binary matrix which can be differentiated and can be used for backward propagation;
(4) the model outputs three matrixes, namely a probability matrix P, a threshold matrix T and an approximate binarization matrix B, three loss correspondingly exist, and the three loss comprise:
loss 1: calculating through P and shrink _ map, wherein the function is cross entropy;
loss 2: calculating by T, mask and threshold _ map, wherein the mask is a matrix with a text area of 1 and other areas of 0, and the function is abs (T-threshold _ map) mask/mask;
loss 3: calculated by B and shrink _ map, the calculation process is
intersection=B*shrink_map*mask;
union=(B*mask).sum+(shrink_map*mask).sum;
loss=1-2.0*intersection/union。
And (4) performing iteration through a back propagation training model, wherein the total Loss1+10 × Loss2+ Loss3, and stopping training when the iteration is repeated until the specified number of turns or the repeated fluctuation of Loss does not fall any more.
4. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S3 further includes:
(1) the text recognition model only needs to cut out partial text areas as subgraphs and marks corresponding texts as input and output data without needing a whole picture;
(2) for each sub-image and corresponding text in the same picture, the following processing is carried out;
(3) scaling the subgraph to 32 in equal proportion, not limiting the length, not exceeding 320, simultaneously not performing normalization and standardization processing, and keeping the original image pixel points as input;
(4) for text marks, aiming at English handwriting, a model dictionary only keeps numbers, English punctuations, capital and small letters, in the data preprocessing process, Chinese punctuations are converted into corresponding English punctuations, meanwhile, pertinently increased _ 'is used as marks for scratching off words, because the situation that wrongly written words need to be scratched off and rewritten exists in a handwriting font, the _' mark scratch-off words can be added to further improve the recognition rate, the scratched-off words are converted into underline groups with corresponding lengths, such as 'applets', and the text marks are '_____' underline groups with the length of 5;
(5) in order to ensure the training performance, batch input is selected, the subgraph length is ensured to be the same, the processing method is to combine all the subgraphs in the same graph randomly to form 3 × 32 × 320 big subgraphs, and each 32 big subgraphs form a batch with the dimension of 32 × 3 × 32 × 320.
5. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S4 further includes:
(1) inputting pictures (32 × 3 × 32 × 320) of a batch B, a height H, a width W, and a channel C equal to 3, extracting features through CNN, reducing the height to 1/32 and the width to 1/4, and changing the channel from 3 to 512(32 × 1 × 80) [ B H W C ];
(2) inputting the matrix output by the CNN network into the LSTM, setting T to be 80(W/4), outputting dimensions 32 x 1 (W/4) nclass by the LSTM, wherein the nclass represents the total number of characters in the dictionary, and performing softmax on the matrix output by the LSTM network;
(3) after softmax, a loss function ctc (connectionist Temporal classification) is connected, and the backward propagation is mainly performed to solve the problem that the picture length does not correspond to the actual text length.
6. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S5 further includes:
(1) processing an input picture according to the first step, and inputting the input picture into a trained DBNet model;
(2) only a probability matrix P is needed in the prediction process, the performance is greatly improved, then a fixed threshold value is defined, binarization is carried out, and a matrix B is obtained, wherein the size of the matrix B is fixed 1280 x 1280;
(3) and obtaining a text region according to the matrix B, simultaneously amplifying according to the proportion of 0.4 during picture preprocessing to obtain a real text region, and extracting corresponding four coordinate point information.
7. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S6 further includes:
(1) affine transformation is required to be carried out according to coordinate point information of one or more text regions in a picture predicted by a text detection algorithm, because the following conditions exist;
(2) the quadrangle formed by four vertex coordinates of the text box predicted by the algorithm is occupied around letters and words actually written, but actually, the sides of the text box are not parallel to the sides of the picture actually because students write obliquely or pictures uploaded obliquely;
(3) under the circumstance, in order to cut out specific pixel points of an actual text region for the next text recognition and conversion, a rectangle with one side parallel to a picture needs to be drawn according to the vertex of a text box, so that a subgraph for text recognition has more regions of irrelevant text than the original subgraph, and the accuracy of text recognition is influenced;
(4) and (3) applying affine transformation to cut out the subgraph directly according to the predicted vertex coordinates of the text box, so that the processed subgraph can be more closely attached to the text area.
8. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S7 further includes:
(1) scaling according to H-32 according to the subgraph processed in the step six, wherein the length W is not limited;
(2) inputting the characters into a trained CRNN model, and outputting characters with the length of W/4;
(3) according to the output character string, merging and de-duplication are needed, the invention defines blank characters as "#", de-duplication among blank characters, and then merging is carried out after removing the blank characters, for example, "aa # p # ppp # ll # ee" becomes "applet" after merging.
9. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S8 further includes:
(1) according to the coordinates of the text regions predicted by the DBNet model, performing ascending order sorting on the coordinates of one or more text regions in the horizontal direction;
(2) extracting text areas in sequence, marking the text areas as A, searching the text areas which are more than 0.3 of the intersection ratio of the nearest text box in the horizontal direction of the text areas and the text box on the vertical direction and have the height ratio of more than 0.3, and marking the text areas as B, wherein B is the adjacent text box of A;
(3) and taking B as a new A, and repeating the operation in the previous step until no adjacent text area can be found. Judging the line of text area searching to end;
(4) text regions which are recorded are removed from the candidate set, and the steps (1) and (2) are repeated until the candidate set is empty;
(5) and combining the texts corresponding to the text regions which are sequenced.
10. The deep learning based end-to-end English handwritten text detection and recognition technology of claim 1, characterized in that: the step S9 further includes:
(1) aiming at the situation that the wrong word is scratched off in English handwriting and the complementary writing is carried out above or below the wrong word;
(2) after sorting, checking whether the scratched-out characters exist or not;
(3) if an isolated text region is found, the judgment conditions are as follows:
whether the isolated region is above or below the scratched-out character region, wherein the intersection of the two text regions or the distance between the two text regions is far smaller than the distance between lines of the whole text of the picture;
the intersection ratio of the isolated area and the horizontal text box of the scratched-out character area is greater than 0.5, and the width ratio is greater than 0.5;
(4) if the requirement (3) is met, replacing the text of the character area to be scratched out with the text of the isolated area, and deleting the text of the isolated area; if (3) is not satisfied, the scratched-out character is directly removed.
CN202210391966.7A 2022-04-14 2022-04-14 End-to-end English handwritten text detection and recognition technology based on deep learning Pending CN114821601A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210391966.7A CN114821601A (en) 2022-04-14 2022-04-14 End-to-end English handwritten text detection and recognition technology based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210391966.7A CN114821601A (en) 2022-04-14 2022-04-14 End-to-end English handwritten text detection and recognition technology based on deep learning

Publications (1)

Publication Number Publication Date
CN114821601A true CN114821601A (en) 2022-07-29

Family

ID=82536098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210391966.7A Pending CN114821601A (en) 2022-04-14 2022-04-14 End-to-end English handwritten text detection and recognition technology based on deep learning

Country Status (1)

Country Link
CN (1) CN114821601A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3598339A1 (en) * 2018-07-19 2020-01-22 Tata Consultancy Services Limited Systems and methods for end-to-end handwritten text recognition using neural networks
CN112183372A (en) * 2020-09-29 2021-01-05 深圳数联天下智能科技有限公司 Text recognition method, device and equipment and readable storage medium
CN112560845A (en) * 2020-12-23 2021-03-26 京东方科技集团股份有限公司 Character recognition method and device, intelligent meal taking cabinet, electronic equipment and storage medium
CN113537189A (en) * 2021-06-03 2021-10-22 深圳市雄帝科技股份有限公司 Handwritten character recognition method, device, equipment and storage medium
CN113971809A (en) * 2021-10-25 2022-01-25 多伦科技股份有限公司 Text recognition method and device based on deep learning and storage medium
CN114241492A (en) * 2021-12-17 2022-03-25 黑盒科技(广州)有限公司 Method for recognizing handwritten text of composition manuscript paper and reproducing text structure

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3598339A1 (en) * 2018-07-19 2020-01-22 Tata Consultancy Services Limited Systems and methods for end-to-end handwritten text recognition using neural networks
CN112183372A (en) * 2020-09-29 2021-01-05 深圳数联天下智能科技有限公司 Text recognition method, device and equipment and readable storage medium
CN112560845A (en) * 2020-12-23 2021-03-26 京东方科技集团股份有限公司 Character recognition method and device, intelligent meal taking cabinet, electronic equipment and storage medium
CN113537189A (en) * 2021-06-03 2021-10-22 深圳市雄帝科技股份有限公司 Handwritten character recognition method, device, equipment and storage medium
CN113971809A (en) * 2021-10-25 2022-01-25 多伦科技股份有限公司 Text recognition method and device based on deep learning and storage medium
CN114241492A (en) * 2021-12-17 2022-03-25 黑盒科技(广州)有限公司 Method for recognizing handwritten text of composition manuscript paper and reproducing text structure

Similar Documents

Publication Publication Date Title
CN111814722B (en) Method and device for identifying table in image, electronic equipment and storage medium
CN107609549B (en) Text detection method for certificate image in natural scene
CN109726643B (en) Method and device for identifying table information in image, electronic equipment and storage medium
CN109241894B (en) Bill content identification system and method based on form positioning and deep learning
CN104809481B (en) A kind of natural scene Method for text detection based on adaptive Color-based clustering
CN107133622B (en) Word segmentation method and device
Blumenstein et al. A novel feature extraction technique for the recognition of segmented handwritten characters
US20080212837A1 (en) License plate recognition apparatus, license plate recognition method, and computer-readable storage medium
CN109241861B (en) Mathematical formula identification method, device, equipment and storage medium
CN113537227B (en) Structured text recognition method and system
CN113128442A (en) Chinese character calligraphy style identification method and scoring method based on convolutional neural network
CN102332097B (en) Method for segmenting complex background text images based on image segmentation
CN109389115B (en) Text recognition method, device, storage medium and computer equipment
Bijalwan et al. Automatic text recognition in natural scene and its translation into user defined language
CN116824608A (en) Answer sheet layout analysis method based on target detection technology
CN112446259A (en) Image processing method, device, terminal and computer readable storage medium
Ayesh et al. A robust line segmentation algorithm for Arabic printed text with diacritics
CN109508716B (en) Image character positioning method and device
Verma et al. A novel approach for structural feature extraction: contour vs. direction
CN111832497B (en) Text detection post-processing method based on geometric features
Park et al. A method for automatically translating print books into electronic Braille books
Lue et al. A novel character segmentation method for text images captured by cameras
CN116030472A (en) Text coordinate determining method and device
CN114821601A (en) End-to-end English handwritten text detection and recognition technology based on deep learning
CN115311666A (en) Image-text recognition method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination