CN108549893B

CN108549893B - End-to-end identification method for scene text with any shape

Info

Publication number: CN108549893B
Application number: CN201810294058.XA
Authority: CN
Inventors: 白翔; 吕鹏原; 廖明辉; 姚聪; 储佳佳
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-04-04
Filing date: 2018-04-04
Publication date: 2020-03-31
Anticipated expiration: 2038-04-04
Also published as: CN108549893A; WO2019192397A1

Abstract

The invention discloses an end-to-end identification method for a scene text with any shape, which is used for extracting text characteristics through a characteristic pyramid network and generating a candidate text box through a regional extraction network; then, adjusting the position of the candidate text box through the fast region classification regression branch to obtain more accurate position information of the text bounding box; secondly, inputting the position information of the bounding box into a segmentation branch, and obtaining a predicted character sequence through a pixel voting algorithm; and finally, processing the predicted character sequence through a weighted edit distance algorithm, and finding the best matched word of the predicted sequence in the given dictionary to obtain a final text recognition result. The method can simultaneously detect and recognize scene texts in any shapes in natural images, including horizontal texts, multi-directional texts and curved texts, and can completely perform end-to-end training. Compared with the prior art, the detection and identification method provided by the invention has excellent effects in the aspects of accuracy and universality and has strong practical application value.

Description

End-to-end identification method for scene text with any shape

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an end-to-end identification method for a scene text with any shape.

Background

Scene text detection and recognition is a very active and challenging research direction in the field of computer vision, to which many real-life applications are relevant, such as picture-based geo-location, real-time translation, and blind help.

The method for detecting and identifying the scene text aims at simultaneously detecting and identifying the text from the natural scene, namely, the method is divided into two tasks of detecting and identifying. In most of the past researches, text detection and recognition are processed separately, namely, in the first step, a trained detector is used for detecting character areas in a natural scene picture, and in the second step, the character areas detected in the first step are input into a recognition module for recognition to obtain character contents. But since these two tasks are highly correlated and complementary, on the one hand, the quality of the detection step determines the accuracy of the recognition; on the other hand, the result of the recognition may also provide feedback for the detection. Such separate processing may result in less than optimal performance of the detection and identification.

Recently, two approaches have been proposed for an end-to-end trainable framework for scene text recognition. These unified models are significantly superior to previous approaches in view of the complementarity between detection and recognition. However, both of these methods have two major drawbacks, firstly, they are not fully trained in an end-to-end fashion. Second, these methods can only recognize horizontal or directional text, but there may be significant changes in the shape of the text in the actual scene picture, from horizontal or directional to curved form. Therefore, an end-to-end recognition method capable of processing scene texts with arbitrary shapes needs to be designed.

Disclosure of Invention

The invention aims to provide an end-to-end identification method of scene texts with arbitrary shapes, which consists of a text detector based on example segmentation and a text recognizer based on character segmentation. Detecting texts in any shapes by a method of segmenting example text regions; and recognizing the text through semantic segmentation in a two-dimensional space, so that irregular text instances are recognized. The method can detect and recognize text instances of arbitrary shapes and can perform end-to-end training completely.

In order to achieve the above object, the present invention provides an end-to-end recognition method for scene texts with arbitrary shapes, which solves the problem of scene text detection and recognition from a completely new perspective, and comprises the following steps:

(1) training an arbitrarily-shaped scene text end-to-end recognition network model, comprising the following sub-steps:

(1.1) carrying out word-level labeling on multidirectional texts of all pictures in an original data set, wherein labels are polygon clockwise vertex coordinates of a text bounding box at the word level and word character sequences of the texts, and a standard training data set with labels is obtained;

and (1.2) defining an end-to-end identification network model of the scene text in any shape, wherein the detection identification network model consists of a characteristic pyramid structure network, a region extraction network, a rapid region classification regression branch network and a segmentation branch network. Calculating a training label according to the standard training data set with the label in the step (1.1), designing a loss function, and training the end-to-end recognition network of the scene text in any shape by using a reverse conduction method to obtain an end-to-end recognition network model of the scene text in any shape; the method specifically comprises the following substeps:

(1.2.1) constructing a scene text end-to-end identification network model in any shape, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a rapid region classification regression branch network and a segmentation branch network; the feature pyramid structure network is shown in fig. 3, and is formed by adding a bottom-up connection, a top-down connection and a transverse connection to a base network of a ResNet-50 deep convolutional neural network, and is used for extracting features fused with different resolutions from an input standard data set picture; inputting the extracted features of different scales into a region extraction network to obtain candidate text regions, obtaining the candidate text regions of fixed scales after aligning the region of interest, and respectively inputting the candidate text regions into a fast region classification regression branch network and a segmentation branch network; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; as shown in fig. 4, the segmentation branching network is composed of four convolutional layers Conv1, Conv2, Conv3, Conv4, one deconvolution layer DeConv and one final convolutional layer Conv5, and inputs the candidate text region with the resolution of 16 × 64 extracted by the region extraction network into the segmentation branching, and finally generates 38 target segmentation layers with the resolution of 32 × 128 through convolution and deconvolution operations; the method comprises the following steps that 1 global text instance segmentation layer is used for predicting the specific position of a text region, and a 36 character segmentation layer and a 1 character background segmentation layer are used for obtaining a predicted character sequence through a pixel voting algorithm.

(1.2.2) generating a horizontal initial bounding box on an original image according to a standard training data set with labels and a characteristic diagram, and generating training labels for a region extraction network module, a fast region classification regression branch network module and a segmentation branch network module in the recognition network model: for the labeled standard training data set Itr, the input picture true label contains a polygon P ═ { P ] representing the text region₁，p₂…p_mAnd a character label C ═ C that indicates the category and position of the character₁＝(cc₁，cl₁)，c₂＝(cc₂，cl₂)，…，c_n＝(cc_n，cl_n) For input picture Itr_iWherein P is_iIs a picture Itr_iPolygonal bounding box of the middle text region, p_ij＝(x_ij，y_ij) Is a polygon P_iThe coordinates of the jth vertex, m, denote the number of polygonal text labels, cc_kAnd cl_kRespectively, the class and position of the kth character in the text, C is not necessary for all training samples in the present invention.

For a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P₁，p₂…p_mConverts to the smallest horizontal rectangular bounding box of a polygonal text label box, which is denoted G with the center point (x, y) of the rectangle and the height h and width w_d(x, y, h, w); for regionDomain extraction network, labeling bounding box G according to labeling data set_d(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated₀Annotation bounding box G with respect to an annotation data set_dWhen all the labeled bounding boxes G are labeled_dAnd an initial bounding box Q₀All Jaccard coefficients are less than 0.5, then the initial bounding box Q₀Labeled negative class non-text, class label P_rpnThe value is 0; otherwise, i.e. there is at least one label bounding box G_dAnd Q₀Has a Jaccard coefficient of not less than 0.5, Q₀Marked as positive text, class label P_rpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:

x＝x₀+w₀Δx

y＝y₀+h₀Δy

w＝w₀exp(Δw)

h＝h₀exp(Δh)

wherein x is₀、y₀Respectively an initial bounding box Q₀Abscissa, ordinate, w of the center point of (a)₀、h₀Respectively an initial bounding box Q₀And Δ x, Δ y are Q, respectively₀Center point of (D) relative to G_dThe horizontal and vertical coordinate position offset of the central point, exp is exponential operation, and the training label of the area extraction network is obtained as follows:

gt_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn，P_rpn)

for the fast region classification regression branch network, similarly, the training labels can be calculated as follows: gt_rcnn＝(Δx_rcnn，Δy_rcnn，Δh_rcnn，Δw_rcnn，P_rcnn)

For a split branch network, two types of target tags need to be generated: global labels for text instance segmentation and character labels for character semantic segmentation; for a given positive candidate text box r, firstly, obtaining a best matching horizontal rectangle, further obtaining a matching polygon and a character box, and then, shifting and resizing the matching polygon and the character box so as to align the candidate text box r with a target label with a preset height H and a preset width W according to the following formula:

B_y＝(B_y0-min(r_y))×H/(max(r_y))

wherein (r)_x,r_y) Is the vertex of the candidate text box r, (B)_x,B_y) And (B)_x0,B_y0) Are the updated vertices and the original vertices of the polygon and all character boxes, specifically r_xSet of abscissas of all vertices of a candidate text box r, r_ySet of ordinates of all vertices of the candidate text box r, B_x,B_x0,B_y,B_y0Similarly, the target global label X is then generated by drawing a standard polygon on a zero-initialized mask and filling the value to 1_gFor the character label, the character label X is generated by using the center as the origin, reducing the standard character frame to one eighth of the size of the origin frame, avoiding the character masks from overlapping each other, drawing the reduced character frames on the zero initialization mask and using the corresponding category index padding of the reduced character frames_cIf C does not exist, all pixels in the character layer are set to be-1 and are ignored during optimization, and finally the segmentation branch overall label gt is obtained_maskX, in combination with the above label gt_rpn，gt_rcnn，gt_maskGenerating the final training label as follows:

gt＝{Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn，P_rpn，Δx_rcnn，Δy_rcnn，

Δh_rcnn，Δw_rcnn，P_rcnn，X}；

(1.2.3) training data set I with the standard_trAs the input of the recognition network model, extracting the characteristics by using a characteristic pyramid network module, namely extracting the characteristics of a standard training data set I_trIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; the top-down connection in the feature pyramid network module upsamples the output convolution feature of the ResNet-50 to generate a multi-scale upsampling feature, and the transverse connection structure in the feature pyramid network module fuses the feature of each level upsampled in the top-down process and the feature generated in the bottom-up process to generate a final feature { F2, F3, F4, F5, F6}, which is shown in fig. 3.

(1.2.4) inputting the features extracted by the feature pyramid network into a region extraction network, distributing anchor points, adjusting a feature map by using a region-of-interest alignment method, and generating a candidate text box:

for input picture Itr_kExtracting 5 stage features { F2, F3, F4, F5 and F6} through a feature pyramid network, and defining the feature scale of the anchor at different stages as {32 } according to stages { P2, P3, P4, P5 and P6}²，64²，128²，256²，512²And each scale layer has 3 aspect ratios {1:2, 1:1, 2:1 }; 15 feature graphs { Ftr with different scales and proportions can be extracted₁，Ftr₂，…，Ftr₁₅Is denoted as Ftr_pSubscript p ═ 1, …, 15;

by region of interest alignment operation, feature Ftr is aligned_pGenerating a candidate text region of fixed scale, wherein a candidate text region R of 7 × 7 resolution is generated for the region extraction network_rcnnGenerating a candidate text region R with a resolution of 16 × 64 for dividing branches_mask(ii) a And predicting the probability P of each candidate text box as a correct text region bounding box through classification_rpnPredicting candidate textbox offsets by regression:

Y_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn)。

(1.2.5) size (7 × 7) candidate text region R generated by region extraction network_rcnnInputting a fast region classification regression branch network module, calculating a loss function through classification and regression two branches, conducting reversely, and finally generating a predicted text bounding box: the region extraction network is divided into two network branches of classification and regression, and candidate text regions R with the size of 7 multiplied by 7 are obtained_rcnnInputting a classification branch, and outputting a classification score P of the predicted bounding box by convolution operation_rcnnThat is, the probability that the bounding box is the positive text box is predicted, and the value is [0, 1 ]]A decimal fraction in between; r is to be_rcnnInputting regression branches and outputting 4 [0, 1 ]]Fractional component between predicted regression offset Y_rcnn＝(Δx_rcnn，Δy_rcnn，Δh_rcnn，Δw_rcnn) As a prediction bounding box G_qThe abscissa and ordinate of the center point when predicted as a positive type text box, and the height and width of the text box are relative to the labeled bounding box G_dThe abscissa of the center point, the ordinate, and the predicted positional offset of the height and width of the text box.

(1.2.6) size (16 × 64) candidate text region R generated by region extraction network_maskThe input segmentation branch network module generates 38 target segmentation layers based on example segmentation and semantic segmentation operations: the split branch network module comprises 4 convolutional layers Conv1, Conv2, Conv3, Conv4, a deconvolution layer deconnv, and a final convolutional layer Conv 5; candidate text box R with size of 16 x 64 generated by area extraction network_maskInputting the division branch module, and finally generating 38 target division layers { M ] with the scale of 32 x 128 through operations such as convolution, deconvolution and the like_global，M₁，M₂，…，M₃₆，M_backgroundAnd outputting the pixel value X of each pixel in the layer, wherein the value is [0, 1 ]]In the meantime. Outputting global division layer M in layer_globalThe text area polygon Pm can be directly predicted as Pm₁，pm₂…pm_nCharacter segmentation drawingLayer { M₁，M₂，…，M₃₆And character background segmentation layer M_backgroundThe character sequence S can be predicted according to a pixel voting algorithm_q。

(1.2.7) taking training label gt as expected output of the network to predict labels

For the network prediction output, an objective loss function between the desired output and the prediction output is designed for the constructed network model: taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6)

For network prediction output, aiming at the network model constructed in (1.2.1), designing a target loss function between expected output and prediction output, wherein the overall target loss function consists of a region extraction network, a fast region classification regression branch network and a segmentation branch network loss function, and the overall target loss function expression is as follows:

L(P_rpn，Y_rpn，P_rcnn，Y_rcnn，X)＝L_rpn(P_rpn，Y_rpn)+α₁L_rcnn(P_rcnn，Y_rcnn)+α₂L_mask(X)

wherein L is_rpn(P_rpn，Y_rpn) Extracting the loss function of the network for the region, L_rcnn(P_rcnn，Y_rcnn) For fast regionally classifying the loss function of the regression branch network, L_mask(X) is a loss function of the split branch network α₁，α₂Are respectively a loss function L_rcnnAnd L_maskThe weight coefficient of (1) is simply set to 1;

according to a designed overall target loss function, iterative training is carried out on the model by utilizing a back propagation algorithm, the overall target loss function is minimized, an optimal network model is realized, and aiming at a scene character detection and recognition task, iterative training is firstly carried out on a synthetic text data set (SynthText) in the training process to obtain initial network parameters; training is then performed on the real dataset to fine-tune the network parameters.

The character recognition is carried out on the text picture to be recognized by utilizing the trained model, and the character recognition method comprises the following substeps:

(2.1) extracting features of the text picture of the scene to be detected and recognized, inputting the extracted features into a fast region classification regression branch network to generate a candidate text region, and filtering the candidate text region by non-maximum suppression operation to obtain a more accurate candidate text region: for the data set I to be detected_tstIth picture Itst_kInputting the initial bounding boxes into the model trained in the step (1.2), generating the initial bounding boxes after the model passes through the characteristic pyramid network and the region extraction network, inputting the initial bounding boxes into the fast region classification regression branch network, and performing fast region classification on each initial bounding box G_qThe classification branch outputs a prediction value P based on the classification score_rcnnAs an initial bounding box G_qA score predicted as a positive type sample; the regression branch will output a predicted regression offset Y consisting of 4 decimals_rcnn(Δx_rcnn，Δy_rcnn，Δh_rcnn，Δw_rcnn) As G_qCenter point abscissa, ordinate and height and width relative to labeled bounding box G when predicted as a positive type text box_dThe position offset of the horizontal coordinate, the vertical coordinate, the height and the width of the center point can be calculated according to the position offset, and the position Q of the quadrangular text bounding box predicted by the network can be calculated_z；

For predicted text bounding box Q_zCarrying out non-maximum suppression operation for filtering to obtain an output result: network model to feature map Ftst_pEach of the initial bounding boxes Q predicted as positive-type text₀All return to the horizontal quadrilateral position, and the same test picture Itst_kThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for predictedText bounding box, if and only if the text classification score P_rcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) and (4) carrying out non-maximum suppression operation (NMS) on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain the final reserved quadrilateral bounding box of the positive text.

(2.2) inputting the predicted candidate text region into a segmentation branch network to perform text example segmentation and character segmentation, respectively generating a global text example segmentation mask and a character segmentation mask, obtaining a polygonal word text region by calculating the outline of the text region on the global text example segmentation mask, and predicting to obtain a character sequence by utilizing a pixel voting algorithm on the character segmentation mask: bounding box position Q of predicted quadrilateral text_zThe input segmentation branch generates 38 target segmentation layers, firstly, the outline of the text region is directly calculated through a global text example segmentation mask, and the polygon of the text region is obtained. Secondly, generating a character sequence S by using a pixel voting algorithm_q。

Segment layer M for 36 characters₁，M₂，…，M₃₆H, the value p of a pixel in the ith segmentation layer_ci(x, y) represents a pixel p at the corresponding position of the global text segmentation layer_g(x, y) is the character z_iProbability of (a), z_iIs the ith of 36 characters {0, 1.. 9, a, b.. once, z }, and the probability sum of the corresponding pixel positions of the layer divided by 36 characters is 1, namely

Segmenting layer M for character background_backgroundFirstly, binary processing is carried out on the image, and then a character region set on a background image layer is defined as R ═ { R ═ on a binary background image₁，r₂，，…，r_nWherein r_iDividing an ith character area on the character background division layer, wherein n is the number of all characters on the background division layer;

the pixel voting algorithm process is as follows: firstly, character regions r in 36 character division layers and character background division layers are divided_iThe connected region set is defined as C_i＝{c_i1，c_i2，…，c_i36In which c_ijDividing a region block corresponding to the ith character region in the character background division layer in the jth character division layer, and then for the region r_iAnd corresponding connected region C_iThe step of solving the predicted character by using the pixel voting algorithm comprises the following steps: first, a pair connection region C is calculated_iInner c_ijThe values of all pixels are averaged, and secondly, the c with the largest average is found_{ij_max}The character layer M_{j_max}Corresponding character class z_{j_max}The character is predicted for this character area and finally the character area r of each of the character background segmentation layers is segmented_iThe final predicted character sequence S is obtained by the operation_q。

(2.3) processing the character sequence predicted by the segmentation branch through a weighted edit distance algorithm, finding the best matched word of the predicted sequence in a given dictionary, and obtaining a final recognition result: in the pixel voting stage, the probabilities of all character categories of each character region in the prediction sequence can be obtained, and different weights are defined for deleting, inserting and replacing operations according to the probabilities. For deletion operations, the cost is the probability that a character is predicted to be the currently deleted character; for an insertion operation, the cost is the average probability of two characters adjacent to the character insertion position; for the replacement operation, the computational cost is: max (1-s1/s2, 0), where s1 and s2 are the probabilities of the candidate character and the predicted character to be replaced. And regressing the predicted character string according to a given dictionary through a weighted editing distance algorithm, defining different weights for deletion, insertion and replacement, and adjusting the predicted word, so that the accuracy is improved, and the final recognition result is obtained.

Through the technical scheme, compared with the prior art, the invention has the following technical effects:

(1) the accuracy is high: aiming at the problem of recognizing texts in any shapes in scene texts, the method creatively utilizes example segmentation to detect the texts, semantically segments and recognizes the texts, and more accurately detects the text positions and recognizes the texts.

(2) The speed is high: the detection and recognition model provided by the invention has the advantages that the detection and recognition accuracy is ensured, and the training speed is high.

(3) The universality is strong: the invention discloses an end-to-end trainable text detection and recognition model, which can not only simultaneously detect and recognize texts and realize complete end-to-end training, but also process texts in various shapes, including horizontal, directional and curved texts;

(4) the robustness is strong: the invention can overcome the change of text dimension and shape, and can detect the recognition level, orientation and curve text at the same time.

Drawings

FIG. 1 is a flow chart of an arbitrary-shaped scene text end-to-end recognition method of the present invention, in which a solid arrow represents training and a dashed arrow represents testing;

FIG. 2 is a diagram of an arbitrarily shaped scene text end-to-end recognition network model of the present invention;

FIG. 3 is a schematic diagram of a network structure of a feature pyramid structure module in an arbitrary-shaped scene text end-to-end recognition model according to the present invention;

FIG. 4 is a diagram of a segmentation branch network structure in an arbitrary-shaped scene text end-to-end recognition model according to the present invention;

FIG. 5 is a schematic diagram of a test portion pixel voting algorithm of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The technical terms of the present invention are explained and explained first:

ResNet-50: a neural network for classification mainly comprises 50 convolutional layers, a pooling layer and a short connecting layer. The convolution layer is used for extracting picture characteristics; the pooling layer has the functions of reducing the dimensionality of the feature vector output by the convolutional layer and reducing overfitting; the shortcut connection layer is used for transferring gradient and solving the problems of extinction and explosion gradient. The network parameters can be updated through a reverse conduction algorithm;

area extraction network: a network for generating candidate text regions is used for generating full-connection features with the height of a specific dimension on an extracted feature map by using a sliding window, generating two full-connection branch classification and regression candidate text regions according to the full-connection features, and finally generating candidate text regions with different scale proportions for a subsequent network according to different anchor points and proportions.

Jaccard coefficient: the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, in the field of text detection, the Jaccard coefficient is defaulted to be equal to IOU (input/output), namely the intersection area/combination area of two frames, and describes the overlapping rate of a predicted text box and an original marked text box generated by a model, wherein the IOU is larger, the overlapping degree is higher, and the detection is more accurate.

Non-maximum inhibition (NMS): the non-maximum suppression is a post-processing algorithm widely applied in the field of computer vision detection, and the non-maximum suppression is used for filtering overlapped detection frames by means of sorting, traversing and rejecting to realize loop iteration according to a set threshold value, and removing redundant detection frames to obtain a final detection result.

As shown in fig. 1, the method for recognizing a scene text in an arbitrary shape from end to end of the present invention includes the following steps:

(1.2.2) generating a level on the original image according to the standard training data set with labels and the characteristic diagramAn initial bounding box, which is used for generating training labels for the modules of the region extraction network, the fast region classification regression branch network and the segmentation branch network in the recognition network model: for the labeled standard training data set Itr, the input picture true label contains a polygon P ═ { P ] representing the text region₁，p₂…p_mAnd a character label C ═ C that indicates the category and position of the character₁＝(cc₁，cl₁)，c₂＝(cc₂，cl₂)，…，c_n＝(cc_n，cl_n) For input picture Itr_iWherein P is_iIs a picture Itr_iPolygonal bounding box of the middle text region, p_ij＝(x_ij，y_ij) Is a polygon P_iThe coordinates of the jth vertex, m, denote the number of polygonal text labels, cc_kAnd cl_kRespectively, the class and position of the kth character in the text, C is not necessary for all training samples in the present invention.

For a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P₁，p₂…p_mConverts to the smallest horizontal rectangular bounding box of a polygonal text label box, which is denoted G with the center point (x, y) of the rectangle and the height h and width w_d(x, y, h, w); for the area extraction network, labeling bounding box G according to the labeling data set_d(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated₀Annotation bounding box G with respect to an annotation data set_dWhen all the labeled bounding boxes G are labeled_dAnd an initial bounding box Q₀All Jaccard coefficients are less than 0.5, then the initial bounding box Q₀Labeled negative class non-text, class label P_rpnThe value is 0; otherwise, i.e. there is at least one label bounding box G_dAnd Q₀Has a Jaccard coefficient of not less than 0.5, Q₀Marked as positive text, class label P_rpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:

x＝x₀+w₀Δx

y＝y₀+h₀Δy

w＝w₀exp(Δw)

h＝h₀exp(Δh)

gt_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn，P_rpn)

for the fast region classification regression branch network, similarly, the training labels can be calculated as follows:

gt_rcnn＝(Δx_rcnn，Δy_rcnn，Δh_rcnn，Δw_rcnn，P_rcnn)

B_y＝(B_y0-min(r_y))×H/(max(r_y))

wherein (r)_x,r_y) Is the vertex of the candidate text box r, (B)_x,B_y) And

are the updated vertices and the original vertices of the polygon and all character boxes, specifically r_xSet of abscissas of all vertices of a candidate text box r, r_yIs the set of ordinates of all the vertices of the candidate text box r,

similarly, the target global label X is then generated by drawing a standard polygon on a zero-initialized mask and filling the value to 1_gFor the character label, the character label X is generated by using the center as the origin, reducing the standard character frame to one eighth of the size of the origin frame, avoiding the character masks from overlapping each other, drawing the reduced character frames on the zero initialization mask and using the corresponding category index padding of the reduced character frames_cIf C does not exist, all pixels in the character layer are set to be-1 and are ignored during optimization, and finally the segmentation branch overall label gt is obtained_maskX, in combination with the above label gt_rpn，gt_rcnn，gt_maskGenerating the final training label as follows:

gt＝{Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn，P_rpn，Δx_rcnn，Δy_rcnn，

Δh_rcnn，Δw_rcnn，P_rcnn，X}；

(1.2.3) training data set I with the standard_trAs the input of the recognition network model, extracting the characteristics by using a characteristic pyramid network module, namely extracting the characteristics of a standard training data set I_trIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; top-down connection pair in feature pyramid network moduleThe output convolution characteristics of et-50 are subjected to upsampling to generate multi-scale upsampled characteristics, and the transverse connection structure in the characteristic pyramid network module fuses the characteristics of each level of upsampled characteristics in the top-down process and the characteristics generated in the bottom-up process to generate final characteristics { F2, F3, F4, F5 and F6}, wherein the process is shown in FIG. 3.

Y_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn)。

(1.2.5) size (7 × 7) candidate text region R generated by region extraction network_rcnnInputting a fast region classification regression branch network module, calculating a loss function through classification and regression two branches, conducting reversely, and finally generating a predicted text bounding box: the region extraction network is divided into two network branches of classification and regression, and candidate text regions R with the size of 7 multiplied by 7 are obtained_rcnnInput a classification branch, byThe convolution operation outputs a classification score P for the predicted bounding box_rcnnThat is, the probability that the bounding box is the positive text box is predicted, and the value is [0, 1 ]]A decimal fraction in between; r is to be_rcnnInputting regression branches and outputting 4 [0, 1 ]]Fractional component between predicted regression offset Y_rcnn＝(Δx_rcnn，Δy_rcnn，Δh_rcnn，Δw_rcnn) As a prediction bounding box G_qThe abscissa and ordinate of the center point when predicted as a positive type text box, and the height and width of the text box are relative to the labeled bounding box G_dThe abscissa of the center point, the ordinate, and the predicted positional offset of the height and width of the text box.

(1.2.6) size (16 × 64) candidate text region R generated by region extraction network_maskThe input segmentation branch network module generates 38 target segmentation layers based on example segmentation and semantic segmentation operations: the split branch network module comprises 4 convolutional layers Conv1, Conv2, Conv3, Conv4, a deconvolution layer deconnv, and a final convolutional layer Conv 5; candidate text box R with size of 16 x 64 generated by area extraction network_maskInputting the division branch module, and finally generating 38 target division layers { M ] with the scale of 32 x 128 through operations such as convolution, deconvolution and the like_global，M₁，M₂，…，M₃₆，M_backgroundAnd outputting the pixel value X of each pixel in the layer, wherein the value is [0, 1 ]]In the meantime. Outputting global division layer M in layer_globalThe text area polygon Pm can be directly predicted as Pm₁，pm₂…pm_n}, character segmentation layer { M₁，M₂，…，M₃₆And character background segmentation layer M_backgroundThe character sequence S can be predicted according to a pixel voting algorithm_q。

For the network prediction output, an objective loss function between the desired output and the prediction output is designed for the constructed network model: calculated in step (1.2.2)Training label gt is the expected output of the network, with the predicted labels in steps (1.2.4) (1.2.5) and (1.2.6)

L(P_rpn，Y_rpn，P_rcnn，Y_rcnn，X)＝L_rpn(P_rpn，Y_rpn)

+α₁L_rcnn(P_rcnn，Y_rcnn)+α₂L_mask(X)

(2.1) inputting extracted features of text pictures of the scene to be detected and identified into a fast regional classification regression branch networkGenerating a candidate text region, and filtering the candidate text region by non-maximum suppression operation to obtain a more accurate candidate text region: for the data set I to be detected_tstIth picture Itst_kInputting the initial bounding boxes into the model trained in the step (1.2), generating the initial bounding boxes after the model passes through the characteristic pyramid network and the region extraction network, inputting the initial bounding boxes into the fast region classification regression branch network, and performing fast region classification on each initial bounding box G_qThe classification branch outputs a prediction value P based on the classification score_rcnnAs an initial bounding box G_qA score predicted as a positive type sample; the regression branch will output a predicted regression offset Y consisting of 4 decimals_rcnn(Δx_rcnn，Δy_rcnn，Δh_rcnn，Δw_rcnn) As G_qCenter point abscissa, ordinate and height and width relative to labeled bounding box G when predicted as a positive type text box_dThe position offset of the horizontal coordinate, the vertical coordinate, the height and the width of the center point can be calculated according to the position offset, and the position Q of the quadrangular text bounding box predicted by the network can be calculated_z；

For predicted text bounding box Q_zCarrying out non-maximum suppression operation for filtering to obtain an output result: network model to feature map Ftst_pEach of the initial bounding boxes Q predicted as positive-type text₀All return to the horizontal quadrilateral position, and the same test picture Itst_kThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for the predicted text bounding box, if and only if the text classification score P_rcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) and (4) carrying out non-maximum suppression operation (NMS) on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain the final reserved quadrilateral bounding box of the positive text.

(2.2) inputting the predicted candidate text area into a segmentation branch network to perform text instance segmentation and character segmentation, respectively generating a global text instance segmentation mask and a character segmentation mask, and calculating the text area on the global text instance segmentation maskObtaining a polygonal word text region by the contour of the domain, and predicting by a pixel voting algorithm on a character segmentation mask to obtain a character sequence: bounding box position Q of predicted quadrilateral text_zThe input segmentation branch generates 38 target segmentation layers, firstly, the outline of the text region is directly calculated through a global text example segmentation mask, and the polygon of the text region is obtained. Secondly, generating a character sequence S by using a pixel voting algorithm_q。

the pixel voting algorithm process is as follows: firstly, character regions r in 36 character division layers and character background division layers are divided_iThe connected region set is defined as C_i＝{c_i1，c_i2，…，c_i36In which c_ijDividing a region block corresponding to the ith character region in the character background division layer in the jth character division layer, and then for the region r_iAnd corresponding connected region C_iThe step of solving the predicted character by using the pixel voting algorithm comprises the following steps: first, a pair connection region C is calculated_iInner c_ijThe values of all pixels are averaged, and secondly, the c with the largest average is found_{ij_max}Word of the positionSymbol layer M_{j_max}Corresponding character class z_{j_max}The character is predicted for this character area and finally the character area r of each of the character background segmentation layers is segmented_iThe final predicted character sequence S is obtained by the operation_q。

Claims

1. An end-to-end identification method for scene texts with arbitrary shapes is characterized by comprising the following steps:

(1.2) defining a scene text end-to-end recognition network model in any shape, calculating a training label according to the standard training data set with labels in the step (1.1), designing a loss function, and training the scene text end-to-end recognition network by using a reverse conduction method to obtain the scene text end-to-end recognition network model; the method comprises the following steps:

(1.2.1) constructing a scene text end-to-end identification network model in any shape, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a rapid region classification regression branch and a segmentation branch;

(1.2.2) generating a horizontal initial bounding box on an original image according to the feature map, and generating training labels for a region extraction network module, a fast region classification regression branch network module and a segmentation branch network module in the recognition network model;

(1.2.3) training data set I with the standard_trAs input for identifying the network model, extracting features by using a feature pyramid network module;

(1.2.4) inputting the features extracted by the feature pyramid network into a region extraction network, and generating a candidate text box by using a region-of-interest alignment method to adjust a feature map through anchor point distribution;

(1.2.5) inputting the candidate text box into a rapid regional classification regression network module, calculating a loss function and conducting reversely through two branches of classification and regression, and finally generating a predicted text bounding box;

(1.2.6) inputting the candidate text box into a segmentation branch network module, and generating a target segmentation layer based on example segmentation and semantic segmentation;

Designing a target loss function between the expected output and the predicted output for the network prediction output aiming at the constructed network model;

(2) the character detection and recognition of the text picture of the scene to be detected and recognized by utilizing the trained model comprises the following substeps:

(2.1) inputting extracted features of the text picture of the scene to be detected into a fast region classification regression branch network to generate a candidate text region, and filtering the candidate text region by non-maximum suppression operation to obtain a more accurate candidate text region;

(2.2) inputting the predicted candidate text region into a segmentation branch network to perform text example segmentation and character segmentation, respectively generating a global text example segmentation mask and a character segmentation mask, obtaining a polygonal word text region by calculating the outline of the text region on the global text example segmentation mask, and predicting by utilizing a pixel voting algorithm on the character segmentation mask to obtain a character sequence;

and (2.3) processing the character sequence predicted by the segmentation branch through a weighted edit distance algorithm, finding the best matched word of the predicted sequence in the given dictionary, and obtaining the final recognition result.

2. The method for recognizing the scene text in the arbitrary shape end-to-end as claimed in claim 1, wherein the step (1.2.1) of detecting and recognizing the network model specifically comprises:

the identification network model consists of a characteristic pyramid structure network, a regional extraction network, a rapid regional classification regression branch network and a segmentation branch network; the characteristic pyramid structure network is formed by adding a bottom-up connection, a top-down connection and a transverse connection by taking a ResNet-50 deep convolution neural network as a basic network, and is used for extracting and fusing characteristics with different resolutions from an input standard training data set picture; inputting the extracted features of different scales into a region extraction network to obtain candidate text regions, obtaining the candidate text regions of fixed scales after aligning the region of interest, and respectively inputting the candidate text regions into a fast region classification regression branch network and a segmentation branch network; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; the segmentation branch network is composed of four convolutional layers Conv1, Conv2, Conv3, Conv4, an anti-convolutional layer Deconv and a final convolutional layer Conv5, the candidate text regions with the resolution of 16 × 64 extracted by the region extraction network are input into the segmentation branch, and 38 target segmentation layers with the resolution of 32 × 128 are finally generated through convolution and deconvolution operations; the method comprises the following steps that 1 global text instance segmentation layer is used for predicting the specific position of a text region, and a 36 character segmentation layer and a 1 character background segmentation layer are used for obtaining a predicted character sequence through a pixel voting algorithm.

3. The method for recognizing the scene text in the arbitrary shape end-to-end as claimed in claim 1 or 2, wherein the step (1.2.2) is specifically as follows:

for labeled Standard training dataset I_trThe input picture true tag includes a polygon P ═ { P ] representing a text region₁，p₂…p_mAnd a character label C ═ C that indicates the category and position of the character₁＝(cc₁，cl₁)，c₂＝(cc₂，cl₂)，…，c_n＝(cc_n，cl_n) For input pictures }

Wherein, P_iIs a picture

Polygonal bounding box of the middle text region, p_ij＝(x_ij，y_ij) Is a polygon P_iThe coordinates of the jth vertex, m, denote the number of polygonal text labels, cc_kAnd cl_kRespectively, the category and position of the kth character in the text;

for a given standard training data set I_trFirst, the polygon P in the dataset label is given as { P ═ P₁，p₂…p_mConverts to the smallest horizontal rectangular bounding box of a polygonal text label box, which is denoted G with the center point (x, y) of the rectangle and the height h and width w_d(x, y, h, w); for the area extraction network, the labeled bounding box G of the data set is trained according to the standard_d(x, y, h, w), each pixel on each feature map to be extracted output by the feature pyramid corresponds to the original image according toRegion extraction network predicted candidate text region to generate multiple initial bounding boxes, calculating initial bounding box Q₀Labeled bounding box G relative to a standard training data set_dWhen all the labeled bounding boxes G are labeled_dAnd an initial bounding box Q₀All Jaccard coefficients are less than 0.5, then the initial bounding box Q₀Labeled negative class non-text, class label P_rpnThe value is 0; otherwise, i.e. there is at least one label bounding box G_dAnd Q₀Has a Jaccard coefficient of not less than 0.5, Q₀Marked as positive text, class label P_rpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:

x＝x₀+w₀Δx

y＝y₀+h₀Δy

w＝w₀exp(Δw)

h＝h₀exp(Δh)

gt_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn，P_rpn)

gt_rcnn＝(Δx_rcnn，Δy_rcnn，Δh_rcnn，Δw_rcnn，P_rcnn)；

wherein (r)_x,r_y) Is the vertex of the candidate text box r, (B)_x,B_y) And

are the updated vertices and the original vertices of the polygon and all character boxes, specifically r_xSet of abscissas of all vertices of a candidate text box r, r_ySet of ordinates of all vertices of the candidate text box r, B_x,

B_y,

gt＝{Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn，P_rpn，Δx_rcnn，Δy_rcnn，Δh_rcnn，Δw_rcnn，P_rcnn，X}。

4. the method for recognizing the scene text in the arbitrary shape end-to-end as claimed in claim 1 or 2, wherein the step (1.2.3) is specifically as follows:

standard training data set I_trIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; and the top-down connection in the feature pyramid network module performs up-sampling on the output convolution features of ResNet-50 to generate multi-scale up-sampling features, and the transverse connection structure in the feature pyramid network module performs fusion on the features of each level up-sampled in the top-down process and the features generated in the bottom-up process to generate final features { F2, F3, F4, F5, F6 }.

5. The method for recognizing the scene text in the arbitrary shape end-to-end as claimed in claim 1 or 2, wherein the step (1.2.4) is specifically as follows:

by region of interest alignment operation, feature Ftr is aligned_pGenerating fixed-scale candidate text regionsA domain in which candidate text regions R with a resolution of 7 x 7 are generated for the region extraction network_rcnnGenerating a candidate text region R with a resolution of 16 × 64 for dividing branches_mask(ii) a And predicting the probability P of each candidate text box as a correct text region bounding box through classification_rpnPredicting candidate textbox offset Y by regression_rpn＝(Δx_rpn，Δy_rpn，Δh_rpn，Δw_rpn)。

6. The method for recognizing the scene text in the arbitrary shape end-to-end as claimed in claim 1 or 2, wherein the step (1.2.5) is specifically as follows:

the region extraction network is divided into two network branches of classification and regression, and candidate text regions R with the size of 7 multiplied by 7 are obtained_rcnnInputting a classification branch, and outputting a classification score P of the predicted bounding box by convolution operation_rcnnThat is, the probability that the bounding box is the positive text box is predicted, and the value is [0, 1 ]]A decimal fraction in between; r is to be_rcnnInputting regression branches and outputting 4 [0, 1 ]]Fractional component between predicted regression offset Y_rcnn＝(Δx_rcnn，Δy_rcnn，Δh_rcnn，Δw_rcnn) As a prediction bounding box G_qThe abscissa and ordinate of the center point when predicted as a positive type text box, and the height and width of the text box are relative to the labeled bounding box G_dThe abscissa of the center point, the ordinate, and the predicted positional offset of the height and width of the text box.

7. The method for recognizing the scene text in the arbitrary shape end-to-end as claimed in claim 1 or 2, wherein the step (1.2.6) is specifically as follows:

the split branch network module comprises 4 convolutional layers Conv1, Conv2, Conv3, Conv4, a deconvolution layer deconnv, and a final convolutional layer Conv 5; candidate text box R with size of 16 x 64 generated by area extraction network_maskInputting the division branch module, and finally generating 38 target division layers { M ] with the scale of 32 x 128 through operations such as convolution, deconvolution and the like_global，M₁，M₂，…，M₃₆，M_backgroundAnd outputting the pixel value X of each pixel in the layer, wherein the value is [0, 1 ]]In the output layer, the global division layer M in the output layer_globalThe text area polygon Pm can be directly predicted as Pm₁，pm₂…pm_n}, character segmentation layer { M₁，M₂，…，M₃₆And character background segmentation layer M_backgroundThe character sequence Sq can be predicted according to a pixel voting algorithm.

8. The method for recognizing the scene text in the arbitrary shape end-to-end as claimed in claim 1 or 2, wherein the step (1.2.7) is specifically as follows:

taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6)

wherein L is_rpn(P_rpn，Y_rpn) Extracting the loss function of the network for the region, L_rcnn(P_rcnn，Y_rcnn) For fast regionally classifying the loss function of the regression branch network, L_mask(X) is a loss function for splitting the branched network, α₁，α₂Are respectively a loss function L_rcnnAnd L_maskThe weight coefficient of (1) is simply set to 1;

according to the designed overall target loss function, iterative training is carried out on the model by utilizing a back propagation algorithm, the overall target loss function is minimized, the optimal network model is realized, and aiming at a scene character detection and recognition task, iterative training is firstly carried out on a synthetic text data set in the training process to obtain initial network parameters; training is then performed on the real dataset to fine-tune the network parameters.

9. The method for recognizing the scene text in the arbitrary shape end-to-end as claimed in claim 1 or 2, wherein the step (2.1) is specifically as follows:

for the data set I to be detected_tstIth picture Itst_kInputting the initial bounding boxes into the model trained in the step (1.2), generating the initial bounding boxes after the model passes through the characteristic pyramid network and the region extraction network, inputting the initial bounding boxes into the fast region classification regression branch network, and performing fast region classification on each initial bounding box G_qThe classification branch outputs a prediction value P based on the classification score_rcnnAs an initial bounding box G_qA score predicted as a positive type sample; the regression branch will output a predicted regression offset Y consisting of 4 decimals_rcnn(Δx_rcnn，Δy_rcnn，Δh_rcnn，Δw_rcnn) As G_qCenter point abscissa, ordinate and height and width relative to labeled bounding box G when predicted as a positive type text box_dThe position offset of the horizontal coordinate, the vertical coordinate, the height and the width of the center point can be calculated according to the position offset, and the position Q of the quadrangular text bounding box predicted by the network can be calculated_z，

For predicted text bounding box Q_zCarrying out non-maximum suppression operation for filtering to obtain an output result: network model to feature map Ftst_pEach of the initial bounding boxes Q predicted as positive-type text₀All return to the horizontal quadrilateral position, and the same test picture Itst_kThe positive type text quadrangles regressed on each feature map usually overlap with each other, and all the positive type texts need to be processedPerforming non-maximum suppression operation on the quadrilateral position, which comprises the following specific steps: 1) for the predicted text bounding box, if and only if the text classification score P_rcnnWhen the number of the text boxes is more than or equal to 0.5, the text boxes are detected and reserved; 2) and (4) carrying out non-maximum suppression operation on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain the final reserved quadrilateral bounding box of the positive text.

10. The method for recognizing the scene text in the arbitrary shape end-to-end as claimed in claim 1 or 2, wherein the step (2.2) is specifically as follows:

bounding box position Q of predicted quadrilateral text_zInputting a segmentation branch to generate 38 target segmentation layers, firstly, segmenting a mask through a global text example, directly calculating the outline of a text region to obtain a polygon of the text region, and secondly, generating a character sequence S by utilizing a pixel voting algorithm_q，