CN112036405A

CN112036405A - Detection and identification method for handwritten document text

Info

Publication number: CN112036405A
Application number: CN202010896671.6A
Authority: CN
Inventors: 崔炜炜; 魏金雷; 尹洪义
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-12-04

Abstract

The invention particularly relates to a detection and identification method of a handwritten document text. The detection and identification method of the handwritten document text comprises two parts, namely text line positioning and text line detection; the text line positioning network uses the deformed VGG-11 to train a picture through the network, so that the possible starting position of the text line is found on the picture; the text line detection network incrementally transmits the text line along the text line in the forward direction, the obtained starting position and the rotation angle of the text line are obtained, a viewing window is obtained by resampling, the viewing window is input to a CNN network to carry out regression to obtain the rotation angle of the next position until the edge of the picture is reached, a normalized text line picture is finally generated and is input to a text line identification network to identify the text line picture and output an identification result. The method for detecting and identifying the handwritten document text not only can overcome interference factors in natural scenes and accurately detect and identify the text, but also can accurately advance recursively along the extension direction of the text lines, and finally detects the curved text lines.

Description

Detection and identification method for handwritten document text

Technical Field

The invention relates to the technical field of deep learning, in particular to a detection and identification method for a handwritten document text.

Background

The problem of text block position detection in complex color images in natural scenes is firstly proposed at the end of the twentieth century. The solution to the problem has great economic and cultural benefits, so the problem quickly becomes a hotspot in the fields of computer vision and document analysis. In the decades after the above problems have been addressed, various text detection and recognition methods have been proposed.

For text detection, there are currently mainly the following methods:

1. based on a capability minimization method, most methods are based on a conditional random field and a Markov random field, and the problem of detecting text lines is regarded as an energy minimization problem so as to solve the interference between the text lines;

2. the method based on the connected domain has the core idea that small parts are found to form large parts, then non-character parts are removed through a classifier, and finally characters are extracted from an image and combined into a character area, wherein the method based on the connected domain is most representative of a maximum stable extremum area (MSER) and stroke width conversion (SWT);

3. based on a deep learning method, a convolutional neural network is utilized to extract high-dimensional features from the image, and text detection and identification are realized.

For text recognition, there are currently mainly the following methods:

1. a character-based approach that performs character-level text recognition, successful recognition of characters making bottom-up text recognition easier to implement;

2. the text recognition is regarded as word recognition based on a method of a word group;

3. the method based on the sequence converts a text recognition problem into a sequence recognition problem, the text is represented by a character sequence, and a convolution cyclic neural network is utilized to process a sequence with any length.

Text detection recognition in handwritten documents in natural scenes is different from conventional OCR recognition. Text detection recognition in handwritten documents in natural scenes presents a very large challenge compared to OCR:

firstly, scene complexity, noise, deformation, non-uniform illumination, local shielding, confusion of characters and backgrounds and the like all influence the detection and identification effect;

secondly, the character diversity, color, size, direction, font, language, character partial deformity, etc. can also affect the detection and identification effect.

The problem is solved with great cultural economic benefits, such as helping people with visual impairment to read documents, realizing real-time photographing and translation, and the like. However, because many interference factors exist in handwritten document pictures shot in natural scenes, the traditional text detection and recognition method cannot be well applied to natural scenes. Based on the method, the invention provides a method for detecting and identifying the handwritten document text.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention provides a simple and efficient method for detecting and identifying the handwritten document text.

The invention is realized by the following technical scheme:

a detection and recognition method of handwritten document text is characterized by comprising the following steps: the method comprises two parts of text line positioning and text line detection;

the text line positioning network uses the deformed VGG-11 to perform network training on a picture, and the (x) is obtained by regression₀，y₀) Coordinates, scale s₀Degree of rotation theta₀And likelihood of text line occurrence p₀Finding possible starting positions of the text lines on the picture;

the text line detection network incrementally propagates forward along the text line, and the text line start position and rotation angle (x) obtained by the text line positioning network_i，y_i，θ_i) Resampling to obtain a viewing window, inputting the viewing window into CNN network to obtain (x) of the next position_i+1，y_i+1，θ_i+1) And repeating the process until the picture edge is reached, finally generating a normalized text line picture, inputting the normalized text line picture into a text line identification network, identifying the text line picture by the text line identification network and outputting an identification result.

Before inputting to the text line positioning network, processing the data set, outputting all text line pictures, and simultaneously outputting json labeling information, wherein the json labeling information comprises an image path, area coordinates of each line of text, coordinates of an area where each word is located in each line, and text content of each line of text.

The processing method of the text line positioning network comprises the following steps:

s1, reading a json file of an image label, traversing the json file, and removing a part with an error label;

s2. bringing the input image resize to 512 pixels wide and sampling 256 x 256 image patches over the entire picture, allowing each patch to extend outside the image using the average color fill of the image patch edges;

s3, inputting each 16-by-16 input image block into a deformed VGG-11 network for training, and obtaining (x) through network training regression₀，y₀) Coordinates, scale s₀Degree of rotation theta₀And likelihood of text line occurrence p₀；

S4, after training, making p₀＝1，(x₀，y₀) Coordinates, scale s₀And degree of rotation theta₀Equal to 0;

and S5, after determining the starting position of the text line in the picture by using the text line positioning module, the text line detection network gradually advances along the path increment of the text line to determine the finished text line area.

The deformed VGG-11 network omits a full connection layer and a last pooling layer in a classic VGG-11 network, all convolution layers are convolution kernels with the same size, the size is 3 x 3, the step size stride is 1, and the padding is 1.

In step S4, the training process uses a loss function proposed for the multi-box target detection problem to align between the maximum probability predicted text line start position and the target position, where the loss function is as follows:

wherein, t_mIs the target position, p_nIs the likelihood of SOL occurrence, Xnm are N predicted positions and M target positionsA two-way alignment matrix therebetween, alpha is a parameter for measuring the relative importance between the position loss and the confidence loss, and the default is 0.01, l_nIs an initial prediction (x) of the convolutional neural network_n，y_n，s_n，θ_n) Given the (L, p, t) calculation to minimize L Xnm, L_nThe calculation formula of (a) is as follows:

l_n＝(-sin(θ_n)s_n+x_n,-cos(θ_n)s_n+y_n,sin(θ_n)s_n+x_n,cos(θ_n)s_n+y_n)

the processing method of the text line detection network comprises the following steps:

s2, the text line detection network operates in a recursion increment mode, and the text line starting position and the rotation angle (x) obtained through the text line positioning network_i，y_i，θ_i) Resampling to obtain a viewing window;

s3, inputting the data to a CNN network to carry out regression to obtain (x) of the next position_i+1，y_i+1，θ_i+1)；

S4, repeating the steps until the picture edge is reached, wherein the size of the viewing window is determined by the dimension s0 predicted by the text line positioning module and is kept unchanged.

In step S2, the process of resampling the viewing window is similar to the spatial transformation network, and the image coordinates are mapped to the viewing image coordinates by using the radial transformation matrix;

the first matrix of viewing windows is W₀＝AW_SOLThe matrix A is a forward propagation matrix and is responsible for providing context information for the text line detection network to correctly position the text line;

the matrix A and the matrix W_SOLThe calculation formula of (a) is as follows:

the parameters are obtained by prediction of a text line positioning network;

according to W_iExtracting a 32 x 32 viewing window from the matrix, and then finding a network regression by text lines to obtain x_i，y_iAnd theta_iX obtained by regression_i，y_i，θ_iAnd a prediction matrix P_iCalculate the next matrix W_i＝P_iW_i-1；

The prediction matrix P_iThe calculation formula of (a) is as follows:

to locate a line of text, the line of text is treated as a series of pairs of upper and lower coordinate points p_u,iAnd p_l,IThe coordinate pair is calculated by the upper midpoint and the lower midpoint of the prediction window;

a Mean Square Error (Mean Square Error) loss function is used in the training process of the convolutional neural network, and the calculation formula is as follows:

the text detection network starts at the first target position, t_u,0And t_l,0Resetting the corresponding position point every 4 steps, which aims to recover the correct path without introducing a large amount of errors in the training process when the text line detects that the network deviates from the handwritten text line;

in order to enhance the robustness of the text line detection network, after the target position is reset, translation transformation of delta x, delta y epsilon-2, 2 pixels and rotation transformation of delta theta epsilon-0.1, 0.1 radian are randomly applied to the target position.

The text line detection network outputs a normalized text line picture and inputs the normalized text line picture into a text line identification network; the text line identification network uses a traditional convolutional neural network and a bidirectional cyclic neural network, and uses CTC to perform Loss calculation on the top layer of the frame, so as to identify the input indefinite text line image and output a text line identification result.

The invention has the beneficial effects that: the method for detecting and identifying the handwritten document text not only can overcome interference factors in natural scenes and accurately detect and identify the text, but also can accurately advance recursively along the extension direction of the text lines, and finally detects the curved text lines.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of a system for detecting and recognizing handwritten document texts according to the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The detection and identification method of the handwritten document text is based on a deep learning technology and comprises two parts, namely text line positioning and text line detection;

S4, after training, making p₀＝1，(x₀，y₀) The coordinates,Dimension s₀And degree of rotation theta₀Equal to 0;

wherein, t_mIs the target position, p_nIs the likelihood of SOL occurrence, Xnm is a bi-directional alignment matrix between the N predicted positions and the M target positions, α is a parameter that measures the relative importance between position loss and confidence loss, and is taken to be 0.01 by default, l_nIs an initial prediction (x) of the convolutional neural network_n，y_n，s_n，θ_n) Given the (L, p, t) calculation to minimize L Xnm, L_nThe calculation formula of (a) is as follows:

l_n＝(-sin(θ_n)s_n+x_n,-cos(θ_n)s_n+y_n,sin(θ_n)s_n+x_n,cos(θ_n)s_n+y_n)

s2, the recursive incremental operation of the text line detection network, the text line starting position obtained by the text line positioning network and the sum ofAngle of rotation (x)_i，y_i，θ_i) Resampling to obtain a viewing window;

the matrix A and the matrix W_SOLThe calculation formula of (a) is as follows:

the parameters are obtained by prediction of a text line positioning network;

The prediction matrix P_iThe calculation formula of (a) is as follows:

The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A detection and recognition method of handwritten document text is characterized by comprising the following steps: the method comprises two parts of text line positioning and text line detection;

2. The method for detecting and recognizing handwritten document text according to claim 1, characterized in that: before inputting to the text line positioning network, processing the data set, outputting all text line pictures, and simultaneously outputting json labeling information, wherein the json labeling information comprises an image path, area coordinates of each line of text, coordinates of an area where each word is located in each line, and text content of each line of text.

3. The method for detecting and recognizing handwritten document text according to claim 1 or 2, characterized in that: the processing method of the text line positioning network comprises the following steps:

s3, inputting each 16-by-16 input image block into a deformed VGG-11 network for training, and obtaining the image block through network training regression(x₀，y₀) Coordinates, scale s₀Degree of rotation theta₀And likelihood of text line occurrence p₀；

4. The method for detecting and recognizing handwritten document text according to claim 3, characterized in that: the deformed VGG-11 network omits a full connection layer and a last pooling layer in a classic VGG-11 network, all convolution layers are convolution kernels with the same size, the size is 3 x 3, the step size stride is 1, and the padding is 1.

5. The method for detecting and recognizing handwritten document text according to claim 3 or 4, characterized in that: in step S4, the training process uses a loss function proposed for the multi-box target detection problem to align between the maximum probability predicted text line start position and the target position, where the loss function is as follows:

l_n＝(-sin(θ_n)s_n+x_n,-cos(θ_n)s_n+y_n,sin(θ_n)s_n+x_n,cos(θ_n)s_n+y_n)。

6. the method for detecting and recognizing handwritten document text according to claim 1 or 2, characterized in that: the processing method of the text line detection network comprises the following steps:

7. The method for detecting and recognizing handwritten document text according to claim 6, characterized in that: in step S2, the process of resampling the viewing window is similar to the spatial transformation network, and the image coordinates are mapped to the viewing image coordinates by using the radial transformation matrix;

the matrix A and the matrix W_SOLThe calculation formula of (a) is as follows:

the parameters are obtained by prediction of a text line positioning network;

The prediction matrix P_iThe calculation formula of (a) is as follows:

8. the method for detecting and recognizing handwritten document text according to claim 7, characterized in that: a Mean Square Error (Mean Square Error) loss function is used in the training process of the convolutional neural network, and the calculation formula is as follows:

the text detection network starts at the first target position, t_u,0And t_l,0Resetting the corresponding location point every 4 steps, the purpose of which is to recover the correct path when the text line detection network deviates from the handwritten text line, without introducing a large number of mistakes during the training processError;

9. The method for detecting and recognizing handwritten document text according to claim 6, 7 or 8, characterized in that: the text line detection network outputs a normalized text line picture and inputs the normalized text line picture into a text line identification network; the text line identification network uses a traditional convolutional neural network and a bidirectional cyclic neural network, and uses CTC to perform Loss calculation on the top layer of the frame, so as to identify the input indefinite text line image and output a text line identification result.