CN110929665A

CN110929665A - Natural scene curve text detection method

Info

Publication number: CN110929665A
Application number: CN201911199614.6A
Authority: CN
Inventors: 王敏; 蔡鑫鑫
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-03-27
Anticipated expiration: 2039-11-29
Also published as: CN110929665B

Abstract

The invention discloses a natural scene curve text detection method, which comprises the following steps: (1) acquiring a plurality of image data sets for training scene curve text detection; (2) performing feature learning on the image data set obtained in the step (1) by using a Convolutional Neural Network (CNN), and generating a text proposal of an input image by using a dimension decomposition area proposal network DeRPN; (3) validating and refining the text proposal in step (2) using a refining network, including text/non-text classification, bounding box regression, and arbitrary shape text region representation; (4) carrying out supervision training on the network built in the step (3) to obtain a detector model; (5) and (4) detecting the picture to be detected by using the detector model in the step (4), and outputting the polygonal text area to obtain a final detection result. The method can better position the curve text tightly and robustly and improve the detection performance.

Description

Natural scene curve text detection method

Technical Field

The invention relates to the technical field of image processing, in particular to a natural scene curve text detection method.

Background

Text is the most basic medium for delivering semantic information, and is ubiquitous in everyday life: road signs, shop signs, product packaging, restaurant menus, etc., and text in such a natural environment is referred to as scene text. It would be beneficial to automatically detect and recognize scene text, and have applications in real-time text translation, blind assistance, shopping, robotics, smart cars, and education. An end-to-end text recognition system typically includes two steps: text detection and text recognition, in which text regions are detected and marked with their bounding boxes; in text recognition, text information is retrieved from a detected text region. Text detection is an important step in achieving end-to-end text recognition, without which text can not be recognized from a scene image. Therefore, scene text detection has attracted much attention in recent years.

Traditional OCR techniques can only process text on printed documents or business cards, whereas scene text detection attempts to detect various text in complex scenes. Scene text detection becomes a very challenging task due to the complexity of the background, variations in fonts, size, color, language, lighting conditions, orientation, etc. Before the deep learning method is popularized, the performance of the deep learning method is poor by using manually designed features and a traditional classifier. However, in recent years, due to the development of deep learning techniques, the detection performance has been greatly improved. Meanwhile, the research focus of scene text detection is also shifted from horizontal scene texts to multi-directional scene texts, and more challenging curve texts or arbitrarily-shaped scene texts.

The main challenge of curved text detection comes from the irregular shape and the direction of height variation. Conventional bounding boxes do not scale well in curved scene text. Because texts may appear in various shapes, the traditional quadrilateral bounding box cannot avoid a large amount of redundant overlapping and containing multiple lines of texts, and is also influenced by background noise, so that the texts in a curved scene are difficult to be positioned in a compact and robust manner. Most target detection methods now use the area proposal network RPN. Although RPN has proven to be an effective method for generating a regional proposal, the anchor box employed in RPN is very sensitive, limiting the ability to adapt to different targets. Performance is much reduced as soon as the anchor box deviates significantly from the ground truth in the data set.

Disclosure of Invention

The invention aims to provide a natural scene curve text detection method, which can better position curve texts closely and robustly and improve the detection performance.

In order to solve the technical problem, the invention provides a natural scene curve text detection method, which comprises the following steps:

(1) acquiring a plurality of image data sets for training scene curve text detection;

(2) performing feature learning on the image data set obtained in the step (1) by using a Convolutional Neural Network (CNN), and generating a text proposal of an input image by using a dimension decomposition area proposal network DeRPN;

(3) validating and refining the text proposal in step (2) using a refining network, including text/non-text classification, bounding box regression, and arbitrary shape text region representation;

(4) carrying out supervision training on the network built in the step (3) to obtain a detector model;

(5) and (4) detecting the picture to be detected by using the detector model in the step (4), and outputting the polygonal text area to obtain a final detection result.

Preferably, in the step (1), the image data set is an existing common scene curve text image data set, or a curve text image data set in a scene is temporarily collected, the image data set includes N training pictures, each training picture has at least one curve text region, and there is an annotation file describing position information of all text regions in the picture by vertex coordinates of a rectangle or a polygon, and the annotation file is called a label.

Preferably, in the step (2), the features (x) are extracted from the convolutional neural network CNN and input to a regression layer and a classification layer, the regression layer is realized by a convolutional layer or a full-link layer and is a linear operation for predicting parameterized coordinates (t), and the parameterized coordinates are decoded according to an anchor box (B) in order to obtain a predicted bounding box; the classification layer applies an activation function (e.g., Sigmoid or Softmax, denoted as σ) to the predicted value to generate a probability (P) of the bounding box_B) (ii) a Using VGG16 as the backbone network, a DeRPN is attached to its conv5 layer, which the DeRPN passes throughDimension decomposition mechanism, introducing anchor

Simultaneous prediction of independent segments (S) as independent regression references for object width and height_w(x,w),S_h(y, h)) and corresponding probabilities

Rather than a complete bounding box, the mathematical description of this process is as follows:

wherein, W_r,b_rRepresenting the weight and deviation of the regression layer, W_c,b_cWeights and biases representing the classification levels, x, y, w, h are the coordinates of the bounding box, x_a,y_a,w_a,h_aIs the corresponding coordinate of anchor string,. psi.^w,t^hParameterized coordinates representing predicted width and height, S_w(x,w),S_h(y, h) represents individual segments predicting width and height,

an independent regression reference representing the width and height of the object,

representing respective wide and high probabilities;

since the detection result requires a two-dimensional bounding box, the predicted segments need to be reasonably combined to recover the bounding box, and the combining process is mathematically described as follows:

B(x,y,w,h)＝f(S_w(x,w),S_h(y,h))

where f represents a rule or algorithm for combining predicted segments and g is a function (e.g., arithmetic mean, harmonic mean) that evaluates the probability of combining bounding boxes. P_BRepresenting the probability of generating a bounding box, and B (x, y, w, h) representing the combined bounding box.

Preferably, in step (3), the geometric property of the text is utilized: the text area, the text center line and the boundary box offset accurately represent the shape of the text boundary box in the step (2), wherein the text center line is formed by contracting the text boundary box, the boundary offset is four channel graphs, and the value is only arranged at the position corresponding to the positive response of the center line feature graph; sampling n points on the central line of the predicted text at equal intervals from left to right, drawing a normal line perpendicular to the tangent line of the predicted text, and intersecting the normal line with an upper boundary line and a lower boundary line to obtain two boundary points; for each centerline sampled point, obtaining four boundary offsets by calculating the distance from itself to its two associated boundary points; by connecting all boundary points clockwise, a complete text polygon representation can be obtained.

Preferably, in step (4), in constructing a detector model of the natural scene curve text detection method, the following loss function is used to calculate the loss:

L＝L_a+λL_b

wherein, L, L_aAnd L_bTotal loss, first stage loss and second stage loss, respectively, and λ is a weight coefficient that balances between the first stage loss and the second stage loss.

Preferably, the first stage losses are defined as follows:

wherein R is_j＝{k|S_k＝a_j,k＝1,2,…,M}，

Here, anchor string is set to a geometric series of { a }_nDenotes that (16,32,64,128,256,512,1024), N is a geometric series { a }_nNumber of items in the (S) }, M is batch size, S represents anchor string, P_iRepresenting the predicted probability of the ith anchoring in the small batch; if anchor string is positive, then the ground-route tag

Set to 1, otherwise

Is 0; t is t_iA prediction vector representing the parametric coordinates,

is the corresponding ground-truth vector, A is the aligned anchoring set, R_jRepresenting anchor string index sets containing the same scale, j is used to represent the same as { a }_nItem a in_jCorresponding scale, similarly, G_jIs a set of aligned anchor string indices containing the same scale; loss of classification L_clsIs the cross entropy loss, the regression loss L_regIs a smooth L1 loss, λ₁Is L_clsAnd L_regThe balance parameter of (1).

Preferably, the second stage losses are defined as follows:

L_b＝L₁+λ₂L₂+λ₃L₃

wherein L is₁、L₂And L₃Respectively text/non-text classification loss, bounding box regression loss and arbitrary shape representation loss, lambda₂、λ₃Is a balance parameter.

Preferably, the text/non-text classification penalty is a binary classification penalty, L₁＝L_cls(P, t) ═ logPt, t is the label of the classification label, and t ═ 1 indicatesText, t ═ 0 means not text, and parameter P ═ P (P)₀,P₁) Is the confidence of the text and non-text after softmax calculation.

Preferably, the bounding box regression loss is lost using a smooth L1,

v＝(v_x,v_y,v_w,v_h) Is the target of the bounding box regression, including the coordinates of the center point, width and height,

is a predicted value for each text proposal using v and v given in Faster R-CNN^*Parameterization, where v and v^*The scale invariance and log-space height/width skewness of a target proposal are specified.

Preferably, the arbitrary shape represents the loss L₃＝μ₁L_Pr+μ₂L_Pcl+μ₃L_border，L_tr、L_tclIs the loss of rice coefficient, L, for text regions and text centerlines_borderCalculated by smoothing the L1 loss, μ₁、μ₂、μ₃Is a balance parameter.

The invention has the beneficial effects that: the invention uses a new regional proposal network DeRPN which has strong adaptability, under the condition of no hyper-parameter modification, the DeRPN can be directly used for different models, tasks or data sets and can be matched with an object with the best regression reference, so that the network can be trained more smoothly and a more accurate regional proposal can be obtained; meanwhile, the text with any shape is represented by the form of the text center line and the corresponding offset, so that the curve text can be positioned closely and robustly, and the detection performance is improved.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of a model architecture according to the present invention.

Detailed Description

As shown in fig. 1, a natural scene curve text detection method includes the following steps:

step 1: acquiring a plurality of image data sets for training scene curve text detection;

the image data set is an existing public scene curve text image data set or a curve text image data set in a temporary collected scene, the image data set comprises N training pictures, each training picture has at least one curve text region, and a labeling file which describes position information of all the text regions in the picture by using vertex coordinates of a rectangle or a polygon is provided, and the labeling file is called a label.

Step 2: performing feature learning on the image data set obtained in the step 1 by using a Convolutional Neural Network (CNN), and generating a text proposal of an input image by using a dimension decomposition area proposal network DeRPN;

extracting features (x) from a Convolutional Neural Network (CNN) and inputting the features (x) into a regression layer and a classification layer, wherein the regression layer is realized by a convolution layer or a full-connection layer and is a linear operation for predicting parameterized coordinates (t), and the parameterized coordinates are decoded according to an anchorbox (B) in order to obtain a predicted boundary frame; the classification layer applies an activation function (e.g., Sigmoid or Softmax, denoted as σ) to the predicted value to generate a probability (P) of the bounding box_B). Step 2, using VGG16 as a backbone network, attaching DeRPN to the conv5 layer thereof, and introducing the DeRPN into the anchor through a dimension decomposition mechanism

Rather than a complete bounding box. The mathematical description of this process is as follows:

representing a corresponding wide and high probability.

B(x,y,w,h)＝f(S_w(x,w),S_h(y,h))

DeRPN reasonably matches objects with anchor string according to length rather than IoU in RPN, the best matching anchor string is obtained by the following formula:

wherein M is_jIndex set, e, representing the matching anchor string for the jth object_jIs the object edge (width or height), N, q represents the geometric progression { a } respectively_nThe number and common ratio of items, item a_iIs { a_nThe ith anchor string in the}. The first term in the equation represents the anchor string closest to the edge, and the second term describes

Transition intervals are used to reduce blurring caused by image noise and groudtuth bias, if e_jIn the transition interval, both i and i +1 are selected as matching indices.

DeRPN uses a pixel-level combinatorial algorithm that first decodes the predicted width and height segments according to the following four equations:

x＝x_a+w_a×t_x

y＝y_a+h_a×t_y

w＝w_a×exp(t_w)

h＝h_a×exp(t_h)

where x, y, w, h are the predicted width and height segments, x_a,y_a,w_a,h_aIs the corresponding coordinate of anchor string, t_x,t_y,t_w,t_hAre the predicted corresponding parametric coordinates.

Then, considering the whole width segment set (denoted by W), screening the width segments according to the probability, and selecting the first N (W)_N). For W_NOf the first k height segments (y) are selected at the corresponding pixel^(k),h^(k)) These width and height segment pairs define a series of specific bounding boxes { (x, y)^(k),w,h^(k)) Denoted as B_wThe probability of combining bounding boxes is

Similarly, the above steps may be repeated for the height segment to obtain B_n＝{(x^(k),y,w^(k)H) then pair B_wAnd B_nUsing non-maximum suppression (NMS), IoU threshold 0.7, and finally the top M bounding box after NMS is considered as the text region proposal.

And step 3: validating and refining the text proposal in the step 2 by using a refining network, wherein the text proposal comprises text/non-text classification, bounding box regression and arbitrary shape text region representation;

using the geometric properties of the text: the text area, the text center line and the boundary box offset accurately represent the shape of the text boundary box in the step 2, wherein the text center line is formed by contracting the text boundary box, the boundary offset is four channel graphs, and the value is only arranged at the position corresponding to the positive response of the center line feature graph. And sampling n points on the central line of the predicted text at equal intervals from left to right, drawing a normal perpendicular to the tangent line of the predicted text, and intersecting the normal with the upper boundary line and the lower boundary line to obtain two boundary points. For each centerline sampled point, four boundary offsets are obtained by calculating the distance from itself to its two associated boundary points. By connecting all boundary points clockwise, a complete text polygon representation can be obtained.

And 4, step 4: and (3) performing supervision training on the network built in the step (3) to obtain a detector model, inputting a marked training image to train the model as shown in fig. 2, wherein the training image can be marked by a quadrangle or a rectangle.

Designing a two-stage multitask loss function, and calculating loss by using the designed loss function:

L＝L_a+λL_b

Wherein R is_j＝{k|S_k＝a_j,k＝1,2,…,M}，

Here, anchor string is set to a geometric series of { a }_nDenotes that (16,32,64,128,256,512,1024), N is a geometric series { a }_nNumber of items in the (S) }, M is batch size, S represents anchor string, P_iRepresenting the predicted probability of the ith anchor in the mini-batch. If anchor string is positive, then the ground-route tag

Set to 1, otherwise

Is 0. t is t_iA prediction vector representing the parametric coordinates,

is the corresponding group-treth vector. A is an aligned anchoring set, R_jRepresenting anchor string index sets containing the same scale, j is used to represent the same as { a }_nItem a in_jCorresponding scale, similarly, G_jIs a set of aligned anchor string indices that contain the same scale. Loss of classification L_clsIs the cross entropy loss, the regression loss L_regIs a smooth L1 loss, λ₁Is L_clsAnd L_regThe balance parameter of (1).

L_b＝L₁+λ₂L₂+λ₃L₃

L₁＝L_cls(P, t) ═ logPt, text/non-textThe classification loss is a binary classification loss, t is a label of a classification label, t-1 indicates text, t-0 indicates no text, and the parameter P-is (P)₀,P₁) Is the confidence of the text and non-text after softmax calculation.

Bounding box regression loss using smoothed L1 loss, v ═ v_x,v_y,v_w,v_h) Is the target of the bounding box regression, including the coordinates of the center point, width and height,

L₃＝μ₁L_Pr+μ₂L_Pcl+μ₃L_border，L_tr、L_PclIs the loss of rice coefficient, L, for text regions and text centerlines_borderCalculated by smoothing the L1 loss, μ₁、μ₂、μ₃Is a balance parameter.

And 5: and (4) detecting the picture to be detected by using the detector model in the step (4), and outputting the polygonal text area to obtain a final detection result.

Due to the irregular shape and the height change direction of the curve text, the traditional bounding box has no good flexibility in the curve scene text, and the curve scene text is difficult to be positioned in a compact and robust manner. The RPN is proved to be an effective area proposal generation method, but the anchor box adopted in the RPN is very sensitive, which limits the adaptability to different targets, and the poor setting of the anchor box can cause the performance reduction. The method has the advantages that the method has strong adaptability for proposing the network in the region and representing the text region in any shape, improves the performance of detecting the natural scene curve text, and positions the text more closely and robustly.

Claims

1. A natural scene curve text detection method is characterized by comprising the following steps:

2. The natural scene curve text detection method according to claim 1, wherein in the step (1), the image dataset is an existing common scene curve text image dataset or a curve text image dataset in a scene is temporarily collected, the image dataset includes N training pictures, each training picture has at least one curve text region, and there is an annotation file describing position information of all text regions in the picture by vertex coordinates of a rectangle or a polygon, and the annotation file is called a label.

3. The natural scene curve text detection method of claim 1, wherein in the step (2), the features (x) are extracted from the convolutional neural network CNN and input to a regression layer and a classification layer, the regression layer is implemented by convolutional layers or fully-connected layers, which are linear operations for predicting parameterized coordinates (t), and the parameterized coordinates are decoded according to an anchor box (B) in order to obtain a predicted bounding box; the classification layer applies an activation function to the predicted values to generate a probability (P) of a bounding box_B) (ii) a Using VGG16 as a backboneThe network adds DeRPN to the conv5 layer thereof, and the DeRPN introduces anchor string through a dimension decomposition mechanism

representing respective wide and high probabilities;

B(x,y,w,h)＝f(S_w(x,w),S_h(y,h))

where f denotes a rule or algorithm for combining predicted segments, g is a function that evaluates the probability of combining bounding boxes, P_BRepresenting the probability of generating a bounding box, and B (x, y, w, h) representing the combined bounding box.

4. The natural scene curve text detection method of claim 1, wherein in the step (3), the geometric properties of the text are utilized to: the text area, the text center line and the boundary box offset accurately represent the shape of the text boundary box in the step (2), wherein the text center line is formed by contracting the text boundary box, the boundary offset is four channel graphs, and the value is only arranged at the position corresponding to the positive response of the center line feature graph; sampling n points on the central line of the predicted text at equal intervals from left to right, drawing a normal line perpendicular to the tangent line of the predicted text, and intersecting the normal line with an upper boundary line and a lower boundary line to obtain two boundary points; for each centerline sampled point, obtaining four boundary offsets by calculating the distance from itself to its two associated boundary points; by connecting all boundary points clockwise, a complete text polygon representation is obtained.

5. The natural scene curve text detection method according to claim 1, wherein in the step (4), in constructing a detector model of the natural scene curve text detection method, the following loss function is used for calculating the loss:

L＝L_a+λL_b

6. The natural scene curve text detection method of claim 5, wherein the first-stage loss is defined as follows:

wherein R is_j＝{k|S_k＝a_j,k＝1,2,…,M}，

Here, anchor string is set to a geometric series of { a }_nDenotes that (16,32,64,128,256,512,1024), N is a geometric series { a }_nNumber of items in the (S) }, M is batch size, S represents anchor string, P_iRepresenting the prediction probability of ith anchor string in the small batch; if anchor string is positive, then the ground-route tag

Set to 1, otherwise

Is 0; t is t_iA prediction vector representing the parametric coordinates,

7. The natural scene curve text detection method of claim 5, wherein the second stage loss is defined as follows:

L_b＝L₁+λ₂L₂+λ₃L₃

8. The natural scene curve text detection method of claim 7, wherein the text/non-text classification loss is a binary classification loss, L₁＝L_cls(P, t) ═ logPt, t is the label of the classification label, t ═ 1 indicates text, t ═ 0 indicates no text, and the parameter P ═ P (P ═ logPt)₀,P₁) Is the confidence of the text and non-text after softmax calculation.

9. The natural scene curve text detection method of claim 1, wherein the bounding box regression loss is lost using a smooth L1,

10. The natural scene curve text detection method of claim 1, wherein an arbitrary shape represents a loss L₃＝μ₁L_Pr+μ₂L_Pcl+μ₃L_border，L_Pr、L_PclIs the loss of rice coefficient, L, for text regions and text centerlines_borderCalculated by smoothing the L1 loss, μ₁、μ₂、μ₃Is a balance parameter.