CN110837835A

CN110837835A - End-to-end scene text identification method based on boundary point detection

Info

Publication number: CN110837835A
Application number: CN201911038568.1A
Authority: CN
Inventors: 刘文予; 白翔; 许永超; 王豪; 卢普; 张辉; 杨明锟; 何梦超; 王永攀
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-02-25
Anticipated expiration: 2039-10-29
Also published as: CN110837835B

Abstract

The invention discloses a scene text end-to-end identification method based on boundary point detection, which extracts text characteristics through a characteristic pyramid network and is used for generating a candidate text box through a regional extraction network; then, detecting a more accurate multidirectional bounding box of the text example through a multidirectional rectangular detection network; secondly, detecting an upper boundary point sequence and a lower boundary point sequence of the text in the multidirectional bounding box; and finally, converting the text in any shape into a horizontal text by using the detected boundary point sequence for recognition by a subsequent sequence recognition network based on an attention mechanism, and finally finding the best matching word of the prediction sequence in the given dictionary by using a cluster search algorithm to obtain a final text recognition result. The method can simultaneously detect and identify scene texts in any shapes in the natural image without character-level labeling, wherein the scene texts comprise horizontal texts, multi-directional texts and curved texts, and can completely carry out end-to-end training.

Description

End-to-end scene text identification method based on boundary point detection

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a scene text end-to-end identification method based on boundary point detection.

Background

Scene text detection and recognition is a very active and challenging research direction in the field of computer vision, and many practical applications are highly relevant to the scene text detection and recognition, such as network information security monitoring systems, intelligent transportation systems, blind help and the like.

In most of past researches, scene text detection and recognition technology is regarded as two separate processes, namely, firstly, a trained detector is used for detecting character areas in a natural scene picture, and secondly, the character areas detected in the first step are input into a recognition module for recognition to obtain character contents. Since the detection and recognition tasks are highly correlated and complementary to each other, on the one hand, the quality of the detection step determines the accuracy of the recognition; on the other hand, the result of the recognition may also provide feedback for the detection. Such separate processing may result in less than optimal performance of the detection and identification.

Recently, there are various methods for providing an end-to-end identification solution, and these methods can be roughly classified into two types. The first approach follows a similar process flow: firstly, a text instance is represented as a horizontal or multidirectional bounding box, the text bounding box is detected by using a detection network, and then a text image or a feature is acquired from an image or a feature map according to the detected bounding box and is identified by a subsequent text identification network. Since text instances are described as horizontal or multi-directional bounding boxes, such schemes have difficulty handling arbitrarily shaped text. The second solution consists of a text detector based on example segmentation and a text recognizer based on character segmentation. Detecting texts in any shapes by a method of segmenting example text regions; and recognizing the text through semantic segmentation in a two-dimensional space, so that irregular text instances are recognized. However, such methods require character-level labeling and the recognition network cannot model literal sequence information. Therefore, an economical and efficient end-to-end recognition method is needed to process the scene text with any shape.

Disclosure of Invention

The invention aims to provide a scene text end-to-end identification method based on boundary point detection, which consists of a text detector based on boundary point detection and a text recognizer based on sequence identification of an attention mechanism. Detecting texts in any shapes by a method for detecting boundary points of text instances; correcting the text in any shape into a horizontal text by utilizing a thin plate spline interpolation algorithm according to the detected text instance boundary points; identifying irregular text instances is accomplished by identifying the rectified text with a text recognizer based on the sequence recognition of the attention mechanism. The method can detect and recognize text instances of arbitrary shapes and can perform end-to-end training completely.

In order to achieve the above object, the present invention provides an end-to-end recognition method for scene texts with arbitrary shapes, comprising the following steps:

(1) training a scene text end-to-end recognition network model based on boundary point detection, comprising the following sub-steps:

(1.1) carrying out word-level labeling on texts in any shapes of all pictures in an original data set, wherein labels are the clockwise vertex coordinates of polygons of text bounding boxes in word level and word character sequences of the texts, and obtaining a standard training data set with labels;

and (1.2) defining a scene text end-to-end identification network model based on boundary point detection, wherein the scene text end-to-end identification network model based on boundary point detection is composed of a characteristic pyramid structure network, a region extraction network, a multi-direction rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism. Calculating a training label according to the standard training data set with the label in the step (1.1), designing a loss function, and training the scene text end-to-end recognition network based on the boundary point detection by using a reverse conduction method to obtain a scene text end-to-end recognition network model based on the boundary point detection; the method specifically comprises the following substeps:

(1.2.1) constructing a scene text end-to-end identification network model based on boundary point detection, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; the characteristic pyramid structure network is formed by adding a bottom-up connection, a top-down connection and a transverse connection by taking a ResNet-50 deep convolution neural network as a basic network, and is used for extracting and fusing characteristics with different resolutions from an input standard data set picture; inputting the extracted features of different scales into a region extraction network to obtain a candidate text region, and after the alignment operation of the region of interest, obtaining the candidate text region of a fixed scale; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; the multidirectional rectangle detection network is composed of 3 full-connection layers FC1, FC2 and FC3, and outputs a prediction vector with dimension 5, which respectively represents the offset of the center of a candidate text region from the center of a minimum circumscribed rectangle, the width and height of the minimum circumscribed rectangle and the rotation angle of the minimum circumscribed rectangle. The boundary point detection network is composed of 4 convolutional layers Conv1, Conv2, Conv3 and Conv4 and a full-connection layer, and outputs a vector with dimension of 28, wherein the vector respectively represents the offset of 7 boundary points of the upper boundary and the lower boundary of the text example; the attention-based sequence recognition network is composed of three convolutional layers and an attention-based model, and the attention model outputs probability distribution of predicted characters at each step.

(1.2.2) generating a horizontal initial bounding box, a multidirectional rectangular bounding box and upper and lower boundary points of a text example on an original image according to a standard training set with labels and a characteristic diagram, and respectively providing training labels for the area extraction network, the multidirectional rectangular detection network and the boundary point detection network: for the labeled standard training set Itr, the input picture true label contains a polygon P ═ { P ] representing the text region₁，p₂…p_mAnd a character string S ═ S representing the text content₁，s₂…s_mFor input picture Itr_iWherein P is_iIs a picture Itr_iPolygonal bounding box of the middle text region, p_ij＝(x_ij，y_ij) Is a polygon P_iCoordinates of jth vertex, m represents the number of polygonal text label boxes, s_iIs a polygon P_iThe content of the text in the text.

For a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P₁，p₂…p_mThe smallest horizontal rectangular bounding box, G, which is represented by the center point (x, y) of the rectangle, as well as the height h and width w, translates into a polygonal text label box_d(x, y, h, w); for the area extraction network, labeling bounding box G according to the labeling data set_d(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated₀Annotation bounding box G with respect to an annotation data set_dWhen all the labeled bounding boxes G are labeled_dAnd an initial bounding box Q₀All Jaccard coefficients are less than 0.5, then the initial bounding box Q₀Labeled negative class non-text, class label P_rpnThe value is 0; otherwise, i.e. there is at least one label bounding box G_dAnd Q₀Has a Jaccard coefficient of not less than 0.5, Q₀Quilt labelMarked as positive text, category label P_rpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:

x＝x₀+w₀Δx

y＝y₀+h₀Δy

w＝w₀exp(Δw)

h＝h₀exp(Δh)

wherein x is₀、y₀Respectively an initial bounding box Q₀Abscissa, ordinate, w of the center point of (a)₀、h₀Respectively an initial bounding box Q₀And Δ x, Δ y are Q, respectively₀Center point of (D) relative to G_dThe horizontal and vertical coordinate position offset of the central point, exp is exponential operation, and the training label of the area extraction network is obtained as follows:

gt_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn，P_rpn)

for a multidirectional rectangular detection network, first the polygon P in the dataset label is set to { P ═ P₁，p₂…p_mConverting to a minimum multidirectional rectangular bounding box of a polygonal text labeling box, representing the multidirectional rectangular bounding box G by the center point (x, y), height h, width w and rotation angle theta of a rectangle_rotate(x, y, h, w, θ); the candidate text region after the network correction by the region extraction is G_rpn(x_rpn，y_rpn，w_rpn，h_rpn) The predicted positional deviation amount calculation formula is as follows:

x＝x_rpn+w_rpnΔx_or

y＝y_rpn+h_rpnΔy_or

w＝w_rpnexp(Δw_or)

h＝h_rpnexp(Δh_or)

the training label of the multidirectional rectangular detection network obtained by the formula is as follows:

gt_or＝(Δx_or，Δy_or，Δh_or，Δw_or，θ)

for the boundary point detection network, the training label calculation process of the boundary point detection network is as follows:

a. setting default boundary points: based on detected multidirectional rectangular bounding boxes

G_rotate(x, y, h, w, theta), rotating the rectangle clockwise by theta degrees to obtain a horizontal bounding box G_horizon(x, y, h, w), sampling K boundary points at equal intervals on each long side of the horizontal bounding box to obtain an upper default boundary point sequence and a lower default boundary point sequence: p_du＝{p₁，p₂…p_KAnd P_dd＝{p₁，p₂…p_KIs of P_d＝P_du∪P_dd。

b. Generating a target boundary point:

a) first, a polygon P is divided into two sides according to a long side, P₁＝{p₁，p₂…p_lAnd P₂＝{p_l+1,…,p_mP represents a point in the polygon.

b) According to P₁And P₂Generating boundary points of an upper boundary and a lower boundary: p_tu＝{p₁，p₂…p_KAnd P_td＝{p₁，p₂…p_KIs of P_t＝P_tu∪P_td。

c. Calculating training label gt according to the following formula_bp＝{(Δx_i,Δy_i),|i∈[0,2K-1)}：

Wherein the content of the first and second substances,and

respectively representing the coordinates of the ith target boundary point and the coordinates of the ith default boundary point.

For the sequence recognition network based on the attention mechanism, each text instance in the input image is marked with a corresponding character string s with the length of n_i＝{(c₀,c₁,…,c_n-1),|c_iE {0,1, …,9, a, B, …, Z, a, B, …, Z } } to describe text content. Identifying the training label of the network as gt_recog＝(onehot(c₀)，onehot(c₁)，…，onehot(c_n-1) Wherein onehot (c)_i) Indicates a character c₁And converting into a one-hot coding form. Combining the above, the final training label is generated as follows: gt ═ { gt_rpn，gt_or，gt_bp，gt_recog}；

(1.2.3) training data set I with the standard_trAs the input of the recognition network model, extracting the characteristics by using a characteristic pyramid network module, namely extracting the characteristics of a standard training data set I_trIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; the top-down connection in the feature pyramid network module upsamples the output convolution feature of the ResNet-50 to generate a multi-scale upsampling feature, and the transverse connection structure in the feature pyramid network module fuses the feature of each level upsampled in the top-down process and the feature generated in the bottom-up process to generate a final feature { F2, F3, F4, F5, F6}, which is shown in fig. 3.

(1.2.4) inputting the features extracted by the feature pyramid network into a region extraction network, distributing anchor points, adjusting a feature map by using a region-of-interest alignment method, and generating a candidate text box:

for input picture Itr_kExtracting 5 stage features { F2, F3, F4, F5, F6} through a feature pyramid network, and defining anchors according to stages { P2, P3, P4, P5, P6}, wherein the anchors are defined in the stagesFeature scale of different stages is 32²，64²，128²，256²，512²And each scale layer has 5 aspect ratios {1:5, 1:2, 1:1, 2:1, 5:1 }; thus, 25 candidate text boxes { Ftr with different scales and proportions can be extracted₁，Ftr₂，…，Ftr₂₅Is denoted as Ftr_pSubscript p ═ 1, …, 25; in the region extraction network, the probability that each candidate text box is a correct text region bounding box is predicted to be P through classification_rpnPredicting candidate textbox offsets by regression:

Y_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn)。

selecting candidate text boxes predicted as correct text area bounding boxes, and inputting the candidate text boxes to a subsequent multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; and generating a candidate text region with a fixed scale of 7 multiplied by 7 by the correct text region selected by the region extraction network through a region-of-interest alignment operation, and predicting the multidirectional bounding box of the text instance in the candidate text region with the fixed scale by the multidirectional rectangular prediction network. In particular, the multidirectional rectangular prediction network prediction quantity Y_or＝(Δx_or，Δy_or，Δh_or，Δw_orθ), which includes 4 prediction offsets and one prediction angle, the network finally learns the multidirectional bounding box of the predicted text instance by computing the loss function and conducting backwards.

(1.2.5) after the multidirectional bounding box of each text example is predicted by the multidirectional rectangular prediction network, generating a candidate text region with a fixed scale of 7 x 7 through a rotating region-of-interest alignment operation. The boundary point prediction network outputs 28 prediction regression offsets Y_bp＝{(Δx_i,Δy_i) And i belongs to 0,14), and by calculating a loss function and conducting reversely, the network finally learns and predicts the boundary points of the text instance.

(1.2.6) after the boundary point of each text example is predicted by the boundary point prediction network, generating a sampling grid by a thin plate spline interpolation algorithm, and enabling the text in any shape to be in a special shapeThe eigen-rectification is a horizontal, fixed-scale 16 x 64 feature map. And inputting the feature map into a sequence recognition network based on an attention mechanism to predict text content. The identification network consists of 3 convolutional layers and an RNN network with a basic unit of GRU. After 3 convolutional layers, the resolution of text features is 2 × 32, the RNN model outputs probability distribution with dimension 63(62 characters and a stop character) for each step length, and the value of each dimension is [0,1 ]]And is 1. Predicted probability distribution P combining all steps_recogAnd the beam search algorithm to predict the character sequence S_q。

(1.2.7) taking training label gt as expected output of the network to predict labelsFor the network prediction output, an objective loss function between the desired output and the prediction output is designed for the constructed network model: taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6)

For network prediction output, aiming at the network model constructed in (1.2.1), designing an objective loss function between expected output and prediction output, wherein the overall objective loss function consists of a region extraction network, a multidirectional rectangular prediction network, a boundary point prediction network and a sequence identification network, and the overall objective loss function expression is as follows:

wherein L is_rpn(P_rpn，Y_rpn) Extracting the loss function of the network for the region, L_or(Y_or) Detecting the loss function of the network for multidirectional rectangles, L_bp(Y_bp) The loss function of the network is detected for the boundary points,

L(P_rpn，Y_rpn，Y_or，Y_bp，P_recog)＝L_rpn(P_rpn，Y_rpn)+α₁L_or(Y_or)+α₂L_bp(Y_bp)+α₃L_recog(P_recog)

L_recog(P_recog) Identifying loss function of network for sequence α₁，α₂，α₃Are respectively a loss function L_rcnn、L_bpAnd L_recogThe weight coefficient of (1) is simply set to 1;

according to a designed overall target loss function, iterative training is carried out on the model by utilizing a back propagation algorithm, the overall target loss function is minimized, an optimal network model is realized, and aiming at a scene character detection and recognition task, iterative training is firstly carried out on a synthetic text data set (SynthText) in the training process to obtain initial network parameters; training is then performed on the real dataset to fine-tune the network parameters.

(2) The character recognition is carried out on the text picture to be recognized by utilizing the trained model, and the character recognition method comprises the following substeps:

(2.1) sequentially inputting the extracted features of the text picture of the scene to be detected and identified into a region extraction network and a multidirectional rectangular detection network to generate a multidirectional candidate text region, and filtering the non-maximum suppression operation to obtain a more accurate multidirectional candidate text region: for the data set I to be detected_tstIth picture Itst_kInputting the text data into the model trained in the step (1.2), extracting a positive candidate text region by the region extraction network after the model passes through the feature pyramid network and the region extraction network, and extracting the same test picture Itst_kThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for the predicted text bounding box, if and only if the text classification score P_rcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) and (4) carrying out non-maximum suppression operation (NMS) on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain the final reserved quadrilateral bounding box of the positive text. Then extracting features with fixed scale from the filtered positive class text quadrilateral bounding box and inputting the features to a multidirectional rectangular prediction network to predict Y_or＝(Δx_or，Δy_or，Δh_or，Δw_orθ). Calculating a predicted multidirectional text bounding box according to the coordinates, the length and the width and the rotation angle of the center point of the predicted multidirectional rectangle; and rotating the multi-directional text features into horizontal features according to the predicted multi-directional text bounding boxes, and inputting the horizontal features into the boundary point detection network. Boundary point detection network predicts regression quantity Y of 7 boundary points of upper and lower boundaries_bp＝{(Δx_i,Δy_i) And, | i ∈ [0,14) }. And combining 14 preset default boundary points, calculating the coordinates of the boundary points in the horizontal frame by using the formula in (1.2.2), and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the predicted rotation angle of the multidirectional rectangle to obtain the positions of the boundary points in the original image.

And (2.2) generating a sampling grid by using a thin plate spline interpolation algorithm according to the boundary points of the text example predicted in the step (2.1), and correcting the text characteristic of an arbitrary shape into a horizontal shape. The corrected text feature resolution is 16 x 64, and the feature map is input into a sequence recognition network to obtain a probability distribution sequence { p₀,p₁,…,p_N-1In which p is_iThe probability distribution of each step of prediction of the RNN is represented, the dimensionality is 63, and N represents the maximum step size of the RNN and takes 35. In the test process, when the predicted value of the k step is a stop character, stopping prediction, and finally predicting the probability distribution of the sequence to be { p }₀,p₁,…,p_k-1}. According to the probability distribution, the category of the maximum probability obtained in each step is the current predicted character, and the predicted character sequence S is finally obtained_q。

Through the technical scheme, compared with the prior art, the invention has the following technical effects:

(1) the accuracy is high: aiming at the problem of recognizing texts in any shapes in scene texts, the method converts the texts in any shapes into horizontal texts by predicting boundary points of the texts, and more accurately detects the text positions and recognizes the texts.

(2) The speed is high: the detection and recognition model provided by the invention has the advantages that the detection and recognition accuracy is ensured, the training speed is high, iterative training is not needed, and the whole network can be trained end to end.

(3) The universality is strong: the invention discloses an end-to-end trainable text detection and recognition model, which can not only simultaneously detect and recognize texts, but also process texts in various shapes without marking at a character level, including horizontal, directional and curved texts;

(4) the robustness is strong: the invention can overcome the change of text dimension and shape, and can detect the recognition level, orientation and curve text at the same time.

Drawings

FIG. 1 is a flowchart of a method for recognizing a scene text end-to-end based on boundary point detection according to the present invention, in which a solid arrow represents training and a dotted arrow represents testing;

FIG. 2 is a diagram of an end-to-end recognition network model for scene text based on boundary point detection according to the present invention;

FIG. 3 is a schematic diagram of a network structure of a feature pyramid structure module in an end-to-end scene text recognition model based on boundary point detection according to the present invention;

FIG. 4 is a diagram of a sequence recognition network structure based on attention mechanism in a scene text end-to-end recognition model based on boundary point detection according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The technical terms of the present invention are explained and explained first:

ResNet-50: a neural network for classification mainly comprises 50 convolutional layers, a pooling layer and a short connecting layer. The convolution layer is used for extracting picture characteristics; the pooling layer has the functions of reducing the dimensionality of the feature vector output by the convolutional layer and reducing overfitting; the shortcut connection layer is used for transferring gradient and solving the problems of extinction and explosion gradient. The network parameters can be updated through a reverse conduction algorithm;

area extraction network: a network for generating candidate text regions is used for generating full-connection features with the height of a specific dimension on an extracted feature map by using a sliding window, generating two full-connection branch classification and regression candidate text regions according to the full-connection features, and finally generating candidate text regions with different scale proportions for a subsequent network according to different anchor points and proportions.

Jaccard coefficient: the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, in the field of text detection, the Jaccard coefficient is defaulted to be equal to IOU (input/output), namely the intersection area/combination area of two frames, and describes the overlapping rate of a predicted text box and an original marked text box generated by a model, wherein the IOU is larger, the overlapping degree is higher, and the detection is more accurate.

Non-maximum inhibition (NMS): the non-maximum suppression is a post-processing algorithm widely applied in the field of computer vision detection, and the non-maximum suppression is used for filtering overlapped detection frames by means of sorting, traversing and rejecting to realize loop iteration according to a set threshold value, and removing redundant detection frames to obtain a final detection result.

Thin plate spline interpolation algorithm (TPS): the thin-plate spline interpolation algorithm is an interpolation method that finds a smooth surface with minimal curvature through all control points. By the algorithm, characters in any shapes can be converted into horizontal shapes, so that the distortion degree of the whole characters is minimum.

As shown in fig. 1, the method for recognizing a scene text end-to-end based on boundary point detection of the present invention includes the following steps:

and (1.2) defining a scene text end-to-end identification network model based on boundary point detection, wherein the model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism. Calculating a training label according to the standard training data set with the label in the step (1.1), designing a loss function, and training the scene text end-to-end recognition network based on the boundary point detection by using a reverse conduction method to obtain a scene text end-to-end recognition network model based on the boundary point detection; the method specifically comprises the following substeps:

(1.2.1) constructing a scene text end-to-end identification network model based on boundary point detection, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism, as shown in fig. 2; the feature pyramid structure network is shown in fig. 3, and is formed by adding a bottom-up connection, a top-down connection and a transverse connection to a base network of a ResNet-50 deep convolutional neural network, and is used for extracting features fused with different resolutions from an input standard data set picture; inputting the extracted features of different scales into a region extraction network to obtain a candidate text region, and after the alignment operation of the region of interest, obtaining the candidate text region of a fixed scale; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; the multidirectional rectangle detection network is composed of 3 full-connection layers FC1, FC2 and FC3, and outputs a prediction vector with dimension 5, which respectively represents the offset of the center of a candidate text region from the center of a minimum circumscribed rectangle, the width and height of the minimum circumscribed rectangle and the rotation angle of the minimum circumscribed rectangle. The boundary point detection network is composed of 4 convolutional layers Conv1, Conv2, Conv3 and Conv4 and a full-connection layer, and outputs a vector with dimension of 28, wherein the vector respectively represents the offset of 7 boundary points of the upper boundary and the lower boundary of the text example; the attention-based sequence recognition network is shown in fig. 4 and is composed of three convolutional layers and an attention-based model, and the attention-based model outputs a probability distribution of a predicted character at each step.

For a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P₁，p₂…p_mThe smallest horizontal rectangular bounding box, G, which is represented by the center point (x, y) of the rectangle, as well as the height h and width w, translates into a polygonal text label box_d(x, y, h, w); for the area extraction network, labeling bounding box G according to the labeling data set_d(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated₀Annotation bounding box G with respect to an annotation data set_dWhen all the labeled bounding boxes G are labeled_dAnd an initial bounding box Q₀All Jaccard coefficients are less than 0.5, then the initial bounding box Q₀Labeled negative class non-text, class label P_rpnThe value is 0; otherwise, i.e. there is at least one label bounding box G_dAnd Q₀Jaccard coefficient ofNot less than 0.5, Q₀Marked as positive text, class label P_rpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:

x＝x₀+w₀Δx

y＝y₀+h₀Δy

w＝w₀exp(Δw)

h＝h₀exp(Δh)

gt_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn，P_rpn)

x＝x_rpn+w_rpnΔx_or

y＝y_rpn+h_rpnΔy_or

w＝w_rpnexp(Δw_or)

h＝h_rpnexp(Δh_or)

gt_or＝(Δx_or，Δy_or，Δh_or，Δw_or，θ)

a. setting default boundary points: according to the detected multidirectional rectangular bounding box G_rotate(x, y, h, w, theta), rotating the rectangle clockwise by theta degrees to obtain a horizontal bounding box G_horizon(x, y, h, w), sampling K boundary points at equal intervals on each long side of the horizontal bounding box to obtain an upper default boundary point sequence and a lower default boundary point sequence: p_du＝{p₁，p₂…p_KAnd P_dd＝{p₁，p₂…p_KIs of P_d＝P_du∪P_dd。

b. Generating a target boundary point:

b) Will P₁And P₂Inputting boundary points in Algorithm 1 which generate upper and lower boundaries: p_tu＝{p₁，p₂…p_KAnd P_td＝{p₁，p₂…p_KIs of P_t＝P_tu∪P_td。

Wherein the content of the first and second substances,

and

for input picture Itr_kExtracting 5 stage features { F2, F3, F4, F5, F6} through a feature pyramid network, and extracting the stage features according to stages { P2, P3, P4, P5 }P6 defines the characteristic dimension of the anchor at different stages as 32²，64²，128²，256²，512²And each scale layer has 5 aspect ratios {1:5, 1:2, 1:1, 2:1, 5:1 }; thus, 25 candidate text boxes { Ftr with different scales and proportions can be extracted₁，Ftr₂，…，Ftr₂₅Is denoted as Ftr_pSubscript p ═ 1, …, 25; in the region extraction network, the probability that each candidate text box is a correct text region bounding box is predicted to be P through classification_rpnPredicting candidate textbox offsets by regression:

Y_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn)。

(1.2.6) after the boundary point of each text instance is predicted by the boundary point prediction network, generating a sampling grid by a thin plate spline interpolation algorithm, and randomly selecting the text instanceThe text features of the shape are rectified into a horizontal, fixed scale 16 x 64 feature map. And inputting the feature map into a sequence recognition network based on an attention mechanism to predict text content. As shown in fig. 4, the identification network is composed of 3 convolutional layers and an RNN network whose basic unit is a GRU. After 3 convolutional layers, the resolution of text features is 2 × 32, the RNN model outputs probability distribution with dimension 63(62 characters and a stop character) for each step length, and the value of each dimension is [0,1 ]]And is 1. Predicted probability distribution P combining all steps_recogAnd the beam search algorithm to predict the character sequence S_q。

(1.2.7) taking training label gt as expected output of the network to predict labels

For the network prediction output, an objective loss function between the desired output and the prediction output is designed for the constructed network model: taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6)

wherein L is_rpn(P_rpn，Y_rpn) Extracting the loss function of the network for the region, L_or(Y_or) Detecting the loss function of the network for multidirectional rectangles, L_bp(Y_bp) As boundary pointsDetecting loss functions of the network, L_recog(P_recog) Identifying loss function of network for sequence α₁，α₂，α₃Are respectively a loss function L_rcnn、L_bpAnd L_recogThe weight coefficient of (1) is simply set to 1;

The character recognition is carried out on the text picture to be recognized by utilizing the trained model, and the character recognition method comprises the following substeps:

(2.1) sequentially inputting the extracted features of the text picture of the scene to be detected and identified into a region extraction network and a multidirectional rectangular detection network to generate a multidirectional candidate text region, and filtering the non-maximum suppression operation to obtain a more accurate multidirectional candidate text region: for the data set I to be detected_tstIth picture Itst_kInputting the text data into the model trained in the step (1.2), extracting a positive candidate text region by the region extraction network after the model passes through the feature pyramid network and the region extraction network, and extracting the same test picture Itst_kThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for the predicted text bounding box, if and only if the text classification score P_rcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) and (4) performing non-maximum suppression operation (NMS) on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain the final reserved positive text quadrilateral bounding box. Then extracting features with fixed scale from the filtered positive class text quadrilateral bounding box and inputting the features to a multidirectional rectangular prediction network to predict Y_or＝(Δx_or，Δy_or，Δh_or，Δw_orθ). Calculating a predicted multidirectional text bounding box according to the coordinates, the length and the width and the rotation angle of the center point of the predicted multidirectional rectangle; and rotating the multi-directional text features into horizontal features according to the predicted multi-directional text bounding boxes, and inputting the horizontal features into the boundary point detection network. Boundary point detection network predicts regression quantity Y of 7 boundary points of upper and lower boundaries_bp＝{(Δx_i,Δy_i) And, | i ∈ [0,14) }. And combining 14 preset default boundary points, calculating the coordinates of the boundary points in the horizontal frame by using the formula in (1.2.2), and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the predicted rotation angle of the multidirectional rectangle to obtain the positions of the boundary points in the original image.

Claims

1. A scene text end-to-end identification method based on boundary point detection is characterized by comprising the following steps:

(1.2) defining a scene text end-to-end recognition network model based on boundary point detection, calculating a training label according to (1.1) a standard training data set with labels, designing a loss function, and training the scene text end-to-end recognition network based on boundary point detection by using a reverse conduction method to obtain the scene text end-to-end recognition network model based on boundary point detection; the method comprises the following steps:

(1.2.1) constructing a scene text end-to-end identification network model based on boundary point detection, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism;

(1.2.2) generating a horizontal initial bounding box, a multidirectional rectangular bounding box and upper and lower boundary points of a text example on an original image according to a standard training set with labels and a characteristic diagram, and respectively providing training labels for the area extraction network, the multidirectional rectangular detection network, the boundary point detection network and a sequence identification network based on an attention mechanism;

(1.2.3) training data set I with the standard_trAs input for identifying the network model, extracting features by using a feature pyramid network module;

(1.2.4) inputting the features extracted by the feature pyramid network into a region extraction network, and generating a candidate text box by using a region-of-interest alignment method to adjust a feature map through anchor point distribution; generating a candidate text region with a fixed scale of 7 multiplied by 7 by a correct text region selected by the region extraction network through region-of-interest alignment operation, and predicting a multidirectional bounding box of a text example in the candidate text region with the fixed scale by a multidirectional rectangular prediction network;

(1.2.5) after a multidirectional bounding box of each text example is predicted by the multidirectional rectangular prediction network, generating a candidate text region with a fixed scale of 7 x 7 through a rotary region-of-interest alignment operation, and finally learning and predicting boundary points of the text examples by the network;

(1.2.6) after predicting the boundary point of each text example by the boundary point prediction network, generating a sampling grid by a thin-plate spline interpolation algorithm, correcting the text characteristics of any shape into a horizontal characteristic diagram with a fixed scale of 16 x 64, inputting the characteristic diagram into a sequence recognition network based on an attention mechanism to predict the text content, and predicting the text content according to all the prediction probability distributions P_recogTo predict the character sequence S_q；

Designing a target loss function between the expected output and the predicted output for the network prediction output aiming at the constructed network model;

(2.1) sequentially inputting the extracted features of the text picture of the scene to be detected and identified into a region extraction network and a multidirectional rectangular detection network to generate a multidirectional candidate text region, and filtering the multidirectional candidate text region by carrying out non-maximum suppression operation to obtain a more accurate multidirectional candidate text region; rotating the multi-directional text characteristics into horizontal characteristics according to the predicted multi-directional text bounding boxes, and inputting the horizontal characteristics into a boundary point detection network; calculating coordinates of the boundary points in the horizontal frame by using a formula in (1.2.2) in combination with 14 preset default boundary points, and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the rotation angle of the multidirectional rectangle predicted in (2.1) to obtain the positions of the boundary points in the original image;

(2.2) generating a sampling grid by using a thin plate spline interpolation algorithm according to the boundary points of the text example predicted in the step (2.1), correcting the text features of any shape into a horizontal shape, inputting the feature map into a sequence recognition network to obtain a probability distribution sequence, acquiring the category of the maximum probability in each step as the current predicted character according to the probability distribution, and finally acquiring a predicted character sequence S_q。

2. The method for end-to-end recognition of scene text based on boundary point detection according to claim 1, wherein the scene text end-to-end recognition network model based on boundary point detection in step (1.2.1) is specifically:

the scene text end-to-end recognition network model based on the boundary point detection is composed of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence recognition network based on an attention mechanism; the characteristic pyramid structure network is formed by adding a bottom-up connection, a top-down connection and a transverse connection by taking a ResNet-50 deep convolution neural network as a basic network, and is used for extracting and fusing characteristics with different resolutions from an input standard data set picture; inputting the extracted features of different scales into a region extraction network to obtain a candidate text region, and after the alignment operation of the region of interest, obtaining the candidate text region of a fixed scale; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; the multidirectional rectangle detection network comprises 3 full-connection layers FC1, FC2 and FC3, a prediction vector with the output dimension of 5 is respectively used for representing the offset of the center of a candidate text region from the center of a minimum circumscribed rectangle, the width and the height of the minimum circumscribed rectangle and the rotation angle of the minimum circumscribed rectangle, the boundary point detection network comprises 4 convolutional layers Conv1, Conv2, Conv3 and Conv4 and one full-connection layer, and a vector with the output dimension of 28 is used for respectively representing the offset of 7 boundary points of the upper boundary and the lower boundary of a text example; the attention-based sequence recognition network is composed of three convolutional layers and an attention-based model, and the attention model outputs probability distribution of predicted characters at each step.

3. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.2) is specifically as follows:

for the labeled standard training set Itr, the input picture true label contains a polygon P ═ { P ] representing the text region₁，p₂…p_mAnd a character string S ═ S representing the text content₁，s₂…s_mFor input picture Itr_iIn which P is_iIs a picture Itr_iPolygonal bounding box of the middle text region, p_ij＝(x_ij，y_ij) Is a polygon P_iCoordinates of jth vertex, m represents the number of polygonal text label boxes, s_iIs a polygon P_iThe content of the text in the text;

for a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P₁，p₂…p_mThe smallest horizontal rectangular bounding box, G, which is represented by the center point (x, y) of the rectangle, as well as the height h and width w, translates into a polygonal text label box_d(x, y, h, w); for the area extraction network, labeling bounding box G according to the labeling data set_d(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated₀Annotation bounding box G with respect to an annotation data set_dWhen all the labeled bounding boxes G are labeled_dAnd an initial bounding box Q₀All Jaccard coefficients are less than 0.5, then the initial bounding box Q₀Labeled negative class non-text, class label P_rpnThe value is 0; otherwise, i.e. there is at least one label bounding box G_dAnd Q₀Has a Jaccard coefficient of not less than 0.5, Q₀Marked as positive text, class label P_rpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:

x＝x₀+w₀Δx

y＝y₀+h₀Δy

w＝w₀exp(Δw)

h＝h₀exp(Δh)

gt_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn，P_rpn)

x＝x_rpn+w_rpnΔx_or

y＝y_rpn+h_rpnΔy_or

w＝w_rpnexp(Δw_or)

h＝h_rpnexp(Δh_or)

gt_or＝(Δx_or，Δy_or，Δh_or，Δw_or，θ)

a. setting default boundary points:

according to the detected multidirectional rectangular bounding box G_rotate(x，y，h，w，θ)Rotating the rectangle clockwise by theta degrees to obtain a horizontal bounding box G_horizon(x, y, h, w), sampling K boundary points at equal intervals on each long side of the horizontal bounding box to obtain an upper default boundary point sequence and a lower default boundary point sequence: p_du＝{p₁，p₂…p_KAnd P_dd＝{p₁，p₂…p_KIs of P_d＝P_du∪P_dd；

b. Generating a target boundary point:

first, a polygon P is divided into two sides according to a long side, P₁＝{p₁，p₂…p_lAnd P₂＝{p_l+1，...，p_mP represents a point in the polygon;

according to P₁And P₂Generating boundary points of an upper boundary and a lower boundary: p_tu＝{p₁，p₂…p_KAnd P_td＝{p₁，p₂…p_KIs of P_t＝P_tu∪P_td；

c. Calculating training label gt according to the following formula_bp＝{(Δx_i，Δy_i)，|i∈[0，2K-1)}：

Wherein the content of the first and second substances,and

respectively representing the coordinates of the ith target boundary point and the coordinates of the ith default boundary point;

for the sequence recognition network based on the attention mechanism, each text instance in the input image is labeled with a corresponding text instanceIs n, is a character string s_i＝{(c₀，c₁，...，c_n-1)，|c_iE {0, 1., 9, a, B., Z, A, B., Z } } to describe the text content, the training label identifying the network is gt_recog＝(onehot(c₀)，onehot(c₁)，…，onehot(c_n-1) Wherein onehot (c)_i) Indicates a character c₁Converting into a one-hot coding form;

the final training labels are generated as follows: gt ═ { gt_rpn，gt_or，gt_bp，gt_recog}。

4. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.3) is specifically as follows:

standard training data set I_trIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; and the top-down connection in the feature pyramid network module performs up-sampling on the output convolution features of ResNet-50 to generate multi-scale up-sampling features, and the transverse connection structure in the feature pyramid network module performs fusion on the features of each level up-sampled in the top-down process and the features generated in the bottom-up process to generate final features { F2, F3, F4, F5, F6 }.

5. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.4) is specifically as follows:

for input picture Itr_kExtracting 5 stage features { F2, F3, F4, F5 and F6} through a feature pyramid network, and defining the feature scale of the anchor at different stages as {32 } according to stages { P2, P3, P4, P5 and P6}²，64²，128²，256²，512²Each scale layer has 5 length-width ratios {1:5, 1: }2, 1:1, 2:1, 5:1 }; thus, 25 candidate text boxes { Ftr with different scales and proportions can be extracted₁，Ftr₂，…，Ftr₂₅Is denoted as Ftr_pSubscript p ═ 1, …, 25; in the region extraction network, the probability that each candidate text box is a correct text region bounding box is predicted to be P through classification_rpnPredicting candidate textbox offsets by regression: y is_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn)；

Selecting candidate text boxes predicted as correct text area bounding boxes, and inputting the candidate text boxes to a subsequent multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; multidirectional rectangular prediction network prediction quantity Y_or＝(Δx_or，Δy_or，Δh_or，Δw_orθ), which includes 4 prediction offsets and one prediction angle, the network finally learns the multidirectional bounding box of the predicted text instance by computing the loss function and conducting backwards.

6. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.5) is specifically as follows:

after the multidirectional rectangular prediction network predicts the multidirectional bounding box of each text example, a candidate text region with a fixed scale of 7 multiplied by 7 is generated through the alignment operation of a rotating interested region, and the boundary point prediction network outputs 28 prediction regression offsets Y_bp＝{(Δx_i，Δy_i) And i belongs to 0,14), and by calculating a loss function and conducting reversely, the network finally learns and predicts the boundary points of the text instance.

7. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.6) is specifically as follows:

generating a sampling grid by a thin plate spline interpolation algorithm, and correcting the text features of any shape into horizontal special features with fixed dimension of 16 x 64The identification network is composed of 3 convolutional layers and an RNN network with a basic unit of GRU, after the 3 convolutional layers, the resolution of text features is 2 x 32, each step length of the RNN model outputs probability distribution with dimension 63(62 characters and a stop symbol), and the value of each dimension is [0,1 ]]In combination with a predicted probability distribution P of all steps, with a sum of 1_recogAnd the beam search algorithm to predict the character sequence S_q。

8. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.7) is specifically as follows:

taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6)

For network prediction output, aiming at the network model constructed in (1.2.1), designing an objective loss function between expected output and prediction output, wherein the overall objective loss function consists of a region extraction network, a multidirectional rectangular prediction network, a boundary point prediction network and a sequence identification network, and the overall objective loss function expression is as follows: l (P)_rpn，Y_rpn，Y_or，Y_bp，P_recog)＝L_rpn(P_rpn，Y_rpn)+α₁L_or(Y_or)+α₂L_bp(Y_bp)+α₃L_recog(P_recog) Wherein L is_rpn(P_rpn，Y_rpn) Extracting the loss function of the network for the region, L_or(Y_or) Detecting the loss function of the network for multidirectional rectangles, L_bp(Y_bp) Detecting loss functions of the network for boundary points, L_recog(P_recog) Identifying loss functions of the network for the sequence, α₁，α₂，α₃Are respectively a lossFunction L_rcnn、L_bpAnd L_recogThe weight coefficient of (a);

according to the designed overall target loss function, iterative training is carried out on the model by utilizing a back propagation algorithm, the overall target loss function is minimized, the optimal network model is realized, and aiming at a scene character detection and recognition task, iterative training is firstly carried out on a synthetic text data set in the training process to obtain initial network parameters; training is then performed on the real dataset to fine-tune the network parameters.

9. The method for recognizing the scene text end to end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (2.1) is specifically as follows:

for the data set I to be detected_tstIth picture Itst_kInputting the text data into the model trained in the step (1.2), extracting a positive candidate text region by the region extraction network after the model passes through the feature pyramid network and the region extraction network, and extracting the same test picture Itst_kThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for the predicted text bounding box, if and only if the text classification score P_rcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) carrying out non-maximum suppression operation on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain a final reserved positive text quadrilateral bounding box; then extracting features with fixed scale from the filtered positive class text quadrilateral bounding box and inputting the features to a multidirectional rectangular prediction network to predict Y_or＝(Δx_or，Δy_or，Δh_or，Δw_orTheta) calculating a predicted multidirectional text bounding box according to the coordinates, the length and the width of the center point and the rotation angle of the predicted multidirectional rectangle; according to the predicted multidirectional text bounding box, the multidirectional text features are rotated into horizontal features and input into a boundary point detection network, and the boundary point detection network predicts regression quantities Y of 7 boundary points of an upper boundary and a lower boundary_bp＝{(Δx_i，Δy_i) And | i ∈ [0,14) ], calculating the coordinates of the boundary points in the horizontal frame by using a formula in (1.2.2) in combination with 14 preset default boundary points, and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the predicted rotation angle of the multidirectional rectangle to obtain the positions of the boundary points in the original image.

10. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (2.2) is specifically as follows:

the corrected text feature resolution is 16 x 64, and the feature map is input into a sequence recognition network to obtain a probability distribution sequence { p₀，p₁，...，p_N-1In which p is_iRepresenting the probability distribution of each step of prediction of RNN, wherein N represents the maximum step length of RNN, and in the test process, when the predicted value of the k step is a stop character, stopping prediction, and finally, the probability distribution of the predicted sequence is { p }₀，p₁，...，p_k-1And obtaining the type of the maximum probability in each step as the current predicted character according to the probability distribution, and finally obtaining a predicted character sequence S_q。