CN106897732A

CN106897732A - Multi-direction Method for text detection in a kind of natural picture based on connection word section

Info

Publication number: CN106897732A
Application number: CN201710010596.7A
Authority: CN
Inventors: 白翔; 石葆光
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-01-06
Filing date: 2017-01-06
Publication date: 2017-06-27
Anticipated expiration: 2037-01-06
Also published as: CN106897732B

Abstract

The invention discloses a kind of based on multi-direction Method for text detection in the natural picture for connecting word section, word section and connection are two steps of key in the detection method, are defined as follows：Word section refers to marking off many single multidirectional bounding box regions on picture, and they surround a part for a word bar or word；Connection refers to coupling together adjacent field, it is meant that they belong to same word or with short.The full convolutional neural networks that word section and connection are used in combination an end-to-end training are equally spaced detected with various yardsticks.Last testing result is first to connect multiple word section composition new regions, obtained from being then combined to these new regions.Detection method proposed by the present invention all achieves the effect of brilliance in terms of these relative to prior art in accuracy rate, speed and model ease, efficiency high and strong robustness, the picture background of complexity can be overcome, in addition also can in detection image non-latin text long text.

Description

Multi-direction Method for text detection in a kind of natural picture based on connection word section

Technical field

The invention belongs to technical field of computer vision, more particularly, to a kind of natural figure based on connection word section Multi-direction Method for text detection in piece.

Background technology

The text read in nature picture is a full of challenges popular task, in photo optical identification, geo-location There are many actual applications with image retrieval aspect.In Reading text system, text detection is exactly in word level or text Character area is positioned with bounding box in brief note rank, it is generally all as the very crucial first step.In a sense and Speech, text detection can also be considered as a kind of special object detection, will word, character or word bar as detection target.

Although existing technology is applied to be achieved in text detection greatly successfully by object detecting method, It is that object detecting method still has some clearly disadvantageous in terms of character area is positioned.First, the length-width ratio of word or word bar Generally all bigger than general object many, method before is difficult to produce the bounding box of this ratio；Second, some non-Latin languages Text between adjacent words and do not include space, such as Chinese character.Existing technology is all only able to detect word, in inspection Will not applied to when surveying this text, because this text not comprising space cannot provide the vision letter for dividing various words Breath.3rd, in large-scale natural scene picture, word is probably any direction, but existing technology is most all only The word in energy detection level direction.Therefore the text detection in natural scene picture is still the difficulty of technical field of computer vision One of point.

The content of the invention

It is an object of the invention to provide multi-direction Method for text detection in a kind of natural picture based on connection word section, The method detection text accuracy rate is high, and speed is fast, and model is simple, and strong robustness, can overcome the picture background of complexity, in addition The long text of non-latin text can be detected.

To achieve the above object, the present invention solves the problems, such as scene text detection from a brand-new visual angle, there is provided one Multi-direction Method for text detection in the natural picture based on connection word section is planted, is comprised the steps：

(1) training word section connecting detection network model, including following sub-step：

(1.1) content of text of all text images is concentrated with entry level flag training image, label is the square of entry Four point coordinates of the initial bounding box of shape, obtain training dataset；

(1.2) the word section detection model for output character section and connection can be predicted according to entry label, institute are defined State network model to be made up of concatenated convolutional neutral net and convolution fallout predictor, word section is calculated according to above-mentioned training dataset With the label of connection, allowable loss function, with reference to online amplification and online negative sample hardly possible example digging technology means, using reversely biography Guiding method trains the network, obtains word section detection model, including following sub-step：

(1.2.1) builds word section detection convolutional neural networks model：Preceding several layers of convolution units of feature are extracted from pre- The VGG-16 networks of training, preceding several layers of convolution units turn respectively for convolutional layer 1 to pond layer 5, full articulamentum 6 and full articulamentum 7 Convolutional layer 6 and convolutional layer 7 are changed to, it is behind some extra convolutional layers for adding, the feature for extracting more depth to connect Detected, including convolutional layer 8, convolutional layer 9, convolutional layer 10, last layer is convolutional layer 11；6 different convolutional layers divide afterwards Various sizes of characteristic pattern not being exported, being easy to extract the high-quality characteristics of various yardsticks, detection word section and connection are at this Carried out on six various sizes of characteristic patterns；For this 6 convolutional layers, the filtering that size is 3 × 3 is all added after each layer Device as convolution fallout predictor, to detect word section and connection jointly；

(1.2.2) produces word section bounding box label from the word bounding box of mark：For original training image collection Itr, note Training image collection after scaling is Itr ', w_I、h_IRespectively the width of Itr ' and height, can take 384 × 384 or 512 × 512 Pixel, the i-th pictures Itr_i' as mode input, Itr_iAll word bounding boxs of ' upper mark are denoted as W_i=[W_i1..., W_ip], Wherein W_ijIt is j-th word bounding box on the i-th pictures, word bounding box can be that word level can also be entry rank, j= 1 ..., total quantity that p, p are word bounding box on the i-th pictures；The characteristic pattern composition set that 6 layers of convolutional layer is exported respectively after note Itro_i'=[Itro_i1' ..., Itro_i6'], wherein Itro_il' it is the l layers of characteristic pattern of output, w in rear 6 layers of convolutional layer_l、h_l The respectively width and height of this feature figure, Itro_il' on coordinate (x, y) correspondence Itr_i' on (x_a, y_a) centered on point coordinates The initial bounding box B of level_ilq, they meet following equation：

Initial bounding box B_ilqWide and height be all configured to a constant a_l, for the ratio of controlled output word section, l =1 ..., 6；Remember the l layers of characteristic pattern Itro of output_il' corresponding initial bounding box collection is combined into B_il=[B_il1..., B_ilm], q =1 ..., m, wherein m are the number of initial bounding box on the l layers of characteristic pattern of output；As long as initial bounding box B_ilqCenter It is comprised in the word bounding box W of the upper any marks of Itr '_ijInside, and B_ilqSize a_lWith the word bounding box W of the mark_ijHeight Degree h meets：So this initial bounding box B_ilqBe marked as positive class, label value is 1, and with height Closest that word bounding box W_ijMatching；Otherwise, B is worked as_ilqWith all word bounding box W_iAll it is unsatisfactory for above-mentioned two condition When, B_ilqNegative class is flagged as, label value is 0；Word section is produced on initial bounding box, with initial bounding box label class It is not identical；Wherein, proportionality constant 1.5 is empirical value；

Word section is produced on the initial bounding box of the tape label that (1.2.3) is produced in the step (1.2.2) and is calculated just Class word field offset amount：Negative class word section bounding box s^-It is the initial bounding box B of negative class^-；Positive class word section bounding box s⁺As at the beginning of positive class Beginning bounding box B⁺Obtained by following steps：A) the positive initial bounding box B of class is remembered⁺The mark word bounding box W and horizontal direction for matching Angle is θ_s, with B⁺Central point centered on, W is turned clockwise θ_sAngle；B) W is cut, removal exceeds B⁺The left side and the portion on the right Point；C) with B⁺Central point centered on, by the word bounding box W ' rotate counterclockwises θ after cutting_sAngle, obtains word section s⁺True mark The geometric parameter x of label_s、y_s、w_s、h_s、θ_s；D) it is calculated literary s⁺Relative to B⁺Side-play amount (Δ x_s, Δ y_s, Δ w_s, Δ h_s, Δ θ_s), computing formula is as follows：

x_s=a_lΔx_s+x_a

y_s=a_lΔy_s+y_a

w_s=a_lexp(Δw_s)

h_s=a_lexp(Δh_s)

θ_s=Δ θ_s

Wherein, x_s、y_s、w_s、h_s、θ_sRespectively word section bounding box s⁺Central point abscissa, central point ordinate, width Degree, height and the angle between horizontal direction；x_a、y_a、w_a、h_aThe respectively initial bounding box B of level⁺The horizontal seat of central point Mark, central point ordinate, width, height；Δx_s、Δy_s、Δw_s、Δh_s、Δθ_sRespectively word section bounding box s⁺Central point is horizontal Coordinate x_sRelatively initial bounding box B⁺Side-play amount, ordinate y_sThe side-play amount of relatively initial bounding box, width w_sOffset variation Amount, height h_sOffset variation amount, angle, θ_sSide-play amount；

(1.2.4) calculates connection label for the word section bounding box that step (1.2.3) is produced：Word section s is initial On bounding box B produce, therefore between s connection label and their corresponding initial bounding box B between connection label it is identical； For feature set of graphs Itro_i'=[Itro_i1' ..., Itro_i6'], if in same characteristic pattern Itro_il' initial encirclement Box set B_ilIn, two initial bounding boxsLabel be all positive class, andSame word is matched, SoBetween layer in connection be marked as positive class, otherwise labeled as negative class；If in characteristic pattern Itro_il' right The initial bounding box set B for answering_ilIn initial bounding boxAnd Itro_i(l-1)' corresponding initial bounding box set B_i(l-1) In initial bounding boxLabel be all positive class and match same word bounding box W_ij, thenBetween parallel link be marked as positive class, otherwise labeled as negative class；

(1.2.5), using the training image collection Itr ' after scaling as word section detection model input, s is defeated for predictive text section Go out：To model initialization weight and biasing, preceding 60,000 training iterative step learning rate is set to 10^-3, afterwards learning rate decay to 10^-4；For rear 6 layers of convolutional layer, in l layers of characteristic pattern Itro_il' on coordinate (x, y) place, (x, y) corresponds to input picture Itr_i' on (x_a, y_a) centered on point coordinates, with a_lIt is the initial bounding box B of size_ilq, 3 × 3 convolution fallout predictor all can be pre- Measure B_ilqIt is divided into the score c of positive and negative class_s, c_sIt is bivector, span is the decimal between 0-1.Simultaneously Predict 5 numeralsAs being divided into positive class word section s⁺When geometrical offset amount, whereinThe word section bounding box s for respectively predicting⁺The relatively positive class of central point abscissa is initial Bounding box B⁺Side-play amount, the initial bounding box B of relatively positive class of ordinate⁺Side-play amount, the offset variation amount of height, width Offset variation amount, offset；

(1.2.6) it is predicted that word section on the basis of connection and parallel link output in prediction interval：For being connected in layer, In same characteristic pattern Itro_il' upper coordinate points (x, y) place, takes the point of neighbour in the range of x-1≤x '≤x+1, y-1≤y '≤y+1 (x ', y '), this 8 points correspond to input picture Itr_i' when, just obtain benchmark word section s corresponding with (x, y)^{(x, y, l)}It is connected Neighbour's word section s in the layer for connecing^{(x ', y ', l)}, neighbour's word section is represented by set in 8 layers：

3 × 3 convolution fallout predictors can predict s^{(x, y, l)}Gather with neighbour in layerConnection positive and negative score c_l1, c_l1 It is 16 dimensional vectors, wherein, w is connection in subscript, expression layer；

For parallel link, a parallel link will be corresponding at two points on two characteristic patterns of continuous convolution layer output Word section is connected；Due to every by one layer of convolutional layer, the width and height of characteristic pattern can all reduce half, and l layers of output is special Levy figure Itro_il' width w_lWith height h_lIt is l-1 layers of characteristic pattern Itro_i(l-1)' width w_l-1With height h_l-1Half, and Itro_il' corresponding initial bounding box yardstick a_lIt is Itro_i(l-1)' corresponding initial bounding box yardstick a_l-12 times, in l Layer output characteristic figure Itro_il' on (x, y), in characteristic pattern Itro_i(l-1)' above take 2x≤x '≤2x+1,2y≤y '≤2y+1 models Enclose 4 interior cross-layer Neighbor Points (x ', y '), Itro_il' upper (x, y) corresponds to input picture Itr_i' on initial bounding box just With Itro_i(l-1)' upper 4 cross-layer Neighbor Points correspond to input picture Itr_i' on 4 initial bounding box locus overlap, 4 Individual cross-layer neighbour word section s^{(x ', y ', l-1)}It is represented by set：

3 × 3 convolution fallout predictors can predict l layers of benchmark word section s^{(x, y, l)}With neighbour's text set of fields on l-1 layersBetween parallel link positive and negative score c_l2, c_l2It is 8 dimensional vectors：

Wherein,Represent that fallout predictor predicts s^{(x, y, l)}Be connected between its all 4 neighbour's words section just, Negative score, c is subscript, represents parallel link；

Connection in all of layerWith all of parallel linkConstitute articulation set N_s；

Literary field label, positive class word section true excursions that (1.2.7) is obtained with step (1.2.3) and step (1.2.4) Amount, connection label are used as output reference, the word section classification predicted with step (1.2.5) and score, the word field offset of prediction Amount, the connection of step (1.2.6) prediction are scored at prediction output, the target loss letter between design prediction output and output reference Number, is constantly trained to minimize the classification of word section, word section to word section connecting detection model using reverse conduction method Skew returns the loss with link sort, for word section connecting detection modelling target loss function, target loss Function is three weighted sums of loss：

Wherein y_sIt is the label of all word sections, c_sIt is the word section score of prediction, y_lIt is the connection label of prediction, c_lIt is The connection score of prediction, by connecting score c in layer_l1With cross-layer score c_l2Composition；If i-th initial bounding box is labeled as just Class, then y_sI ()=1, is otherwise 0；L_conf(y_s, c_s) it is the word section score c for predicting_sSoftmax loss, L_conf(y_s, c_l) It is the connection score c of prediction_lSoftmax loss,It is the word section geometric parameter s and true tag of predictionBetween Smooth L₁Return loss；n_sIt is the quantity of the initial bounding box of positive class, for carrying out normalizing to the classification of word section and recurrence loss Change；n_lIt is positive class connection sum, for being normalized to link sort loss；λ₁And λ₂It is weight constant, 1 is taken in practice；

(1.2.8) in the training process of step (1.2.7), using online amplification method to training data I_trCarry out online Amplification, and positive sample and negative sample are balanced using online negative sample hardly possible example Mining Strategy.In training picture I_trIt is scaled to phase Before same size and batch is loaded, they are randomly cut into image block one by one, and each image block is true with word section The jaccard overlap coefficients o of real bounding box is minimum；For multi-direction word, data amplification is in multi-direction word bounding box Carried out on minimum area-encasing rectangle, the overlap coefficient o of each sample is randomly choosed from 0,0.1,0.3,0.5,0.7 and 0.9, schemed As the size of block is between 0.1-1 times of original image size；Training image not flip horizontal；In addition, word section and connection are negative Sample occupies the major part of training sample, positive sample and negative sample is balanced using online negative sample hardly possible example Mining Strategy, to text Field and connection are separately excavated, and the ratio between control negative sample and positive sample is no more than 3: 1.

(2) word section and connecting detection are carried out to text image to be detected using the above-mentioned convolutional neural networks for training, Including following sub-step：

(2.1) word section detection is carried out to text image to be detected, the characteristic pattern exported by different convolutional layers can be predicted Go out the word section of different scale, the characteristic pattern exported by same convolutional layer predicts the word section of same scale：To figure to be detected I-th text image Itst to be detected in image set Itst_i, uniform sizes are zoomed to, specific size can be with text diagram to be detected The situation of picture is manually set, and the text image to be detected after note scaling is Itst_i′.By image Itst_i' it is input to step (1.2) In the word section connecting detection model that trains, the set Itsto that the characteristic pattern that 6 layers of convolutional layer are exported respectively after obtaining is constituted_i′ =[Itsto_i1' ..., Itsto_i6'], wherein Itsto_il' it is the l layers of characteristic pattern of output, l=in rear 6 layers of convolutional layer 1 ..., 6, in every output characteristic figure Itsto_il' on coordinate (x, y) place, 3 × 3 convolution fallout predictor can all predict (x, Y) corresponding initial bounding box B_ilqIt is predicted to be the score c of positive and negative class word section_s, while also predicting 5 numeralsAs being predicted to be positive class word section s⁺When geometrical offset amount；

(2.2) the word section on all characteristic layers detected to text image to be detected is attached detection, the company Connect including connection and parallel link in layer：Connection and parallel link in prediction interval on the basis of the word section of (2.1) prediction, for Connection in layer, in same characteristic pattern Itsto_il' upper coordinate points (x, y) place, 3 × 3 convolution fallout predictors predict s^{(x, y, l)}With it 8 Individual neighbour's word sectionInterbed in connection positive and negative score c_l1；For across Layer connection, 3 × 3 convolution fallout predictors can predict l layers of benchmark word section s^{(x, y, l)}4 neighbour's words section upper with l-1 layersThe positive and negative score c of parallel link_l2, c_l1And c_l2Constitute the connection of prediction Score c_l；

(2.3) the word section confidence score and connection confidence score combination for obtaining, its Chinese Fields confidence will be detected Degree score includes the positive and negative category score of word section and side-play amount score, and softmax standardized scores are exported using convolution fallout predictor： Connection and parallel link in prediction interval on the basis of the word section of (2.1) prediction, for being connected in layer, in same characteristic pattern Itsto_il' upper coordinate points (x, y) place, 3 × 3 convolution fallout predictors predict s^{(x, y, l)}With 8 neighbour's word sectionsInterbed in connection positive and negative score c_l1；For parallel link, volume 3 × 3 Product fallout predictor can predict l layers of benchmark word section s^{(x, y, l)}4 neighbour's words section upper with l-1 layersThe positive and negative score c of parallel link_l2, c_l1And c_l2Constitute the connection of prediction Score c_l。

(3) cypher section and connection, obtain exporting bounding box, including following sub-step：

(3.1) standardized score obtained according to (2.3) middle detection, word section and connect that filtering convolution fallout predictor is exported Connect, using the word section after filtering as node, to connect as side, build connection figure：For step (2) text image to be detected The word section s and connection N of the fixed qty for being input to word section detection model and producing_s, filtered by their score； It is word section s and connection N_sDifferent filtering thresholds, respectively α and β are set；Filtering threshold can be artificial according to different data The different value of setting, when carrying out multi-direction text image detection in practice, can take α=0.9, and β=0.7 is carried out multilingual During long text image detection, α=0.9 can be taken, β=0.5 when carrying out horizontal text detection, can take α=0.6, β=0.3； Using the word section s ' after filtering as node, the connection N after filtering_s' as side, build a figure using them；

(3.2) depth-first search is performed on the diagram to find the component of interconnection, and each component is denoted as set B, wraps Containing the word section being connected together by connection；

(3.3) the literary set of fields S that depth-first search is obtained is carried out on the diagram to step (3.2), by following step A complete word is combined into, including：

(3.3.1) is input into：| S | is the word segment number in set S, whereinIt is i-th word section, i is subscript,Respectively i-th word section bounding box s⁽ⁱ⁾Center abscissa and ordinate,Respectively word section bounding box s⁽ⁱ⁾Width and height,It is word section Bounding box s⁽ⁱ⁾Angle between horizontal direction；

(3.3.2)Wherein θ_bTo export the deviation angle of bounding box,It is i-th word in set The deviation angle of section bounding box s, is obtained by the mean deviation angle of all word sections in set S；

(3.3.3) finds straight line tan (θ_b) x+b intercept b so that all of word section is to central point in set SDistance summation it is minimum；

(3.3.4) finds two end points (x of straight line_p, y_p) and (x_q, y_q), p represents first end points, and q represents second End points, x_p、y_pRespectively first horizontal stroke of end points, ordinate, x_q、y_qRespectively second horizontal stroke of end points, ordinate；

(3.3.5)B represents output bounding box, x_b、y_bRespectively output is wrapped The horizontal stroke at Wei He centers, ordinate；

(3.3.6)Wherein w_bTo export the width of bounding box, w_p、w_qThe width of the bounding box respectively centered on point p and the width of the bounding box centered on q；

(3.3.7)h_bTo export the height of bounding box,For i-th word section is surrounded in set The height of box s, is obtained by the average height by all word sections in literary set of fields S；

(3.3.8)b：=(x_b, y_b, w_b, h_b, θ_b), b is output bounding box, by coordinate parameters, dimensional parameters, angle parameter Represent；

The bounding box b that (3.3.9) output is combined.

By the contemplated above technical scheme of the present invention, compared with prior art, the present invention has following technique effect：

(1) multi-direction word can be detected：Text in natural scene picture is typically any direction or distortion, this Inventive method character area can carry out partial descriptions by word section bounding box, and word section bounding box can be configured to arbitrarily Direction, therefore multi-direction or distorted shape word can be included.

(2) flexibility ratio is high：The inventive method can also detect the word bar of random length, because cypher section is only relied on In the connection of prediction, therefore word can be both detected, it is also possible to detect word bar；

(3) strong robustness：The inventive method is used carries out partial descriptions, this partial descriptions with word section bounding box Method can overcome complexity natural picture background, text feature is captured from picture；

(4) efficiency high：The inventive method word section detection model be it is end-to-end be trained, it is per second to process super Cross 20 it is big it is small be 512x512 images because word section and connection are all to carry out once positive biography by full convolution CNN models Broadcast acquisition, it is not necessary to offline scaling and rotation is carried out to input picture；

(5) highly versatile：Some non-Latin texts are between adjacent words and do not include space, such as the Chinese Chinese Word.Existing technology is all only able to detect word, will not applied to when this text is detected, because this not comprising space Text cannot provide the visual information for dividing various words.Except latin text, the present invention can also detect non-latin text Long text, because the inventive method need not provide vision division information using space.

Brief description of the drawings

Fig. 1 is the flow chart of multi-direction text detection in natural picture of the present invention based on word section connection；

Fig. 2 is the schematic diagram that the present invention calculates word section true tag parameters；

Fig. 3 is the output composition schematic diagram of convolution fallout predictor of the present invention；

Fig. 4 is word section connecting detection prototype network connection figure of the present invention；

Fig. 5 is to text diagram to be detected in one embodiment of the invention using the word section connecting detection network model for training Result figure as detect word section and connection output bounding box.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as additionally, technical characteristic involved in invention described below each implementation method Not constituting conflict each other can just be mutually combined.

Hereinafter technical term of the invention is explained and illustrated first：

Convolutional neural networks (Concolutional Neural Network, CNN)：One kind can be used for image classification, return The neutral net of task such as return.Network is generally made up of convolutional layer, down-sampled layer and full articulamentum.Convolutional layer and down-sampled layer are negative Duty extracts the feature of image, and full articulamentum is responsible for classifying or is returned.The parameter of network includes the ginseng of convolution kernel and full articulamentum Number and biasing, parameter can be obtained by reverse conduction algorithm from data learning；

VGG16：The second place of ILSVRC is VGGNet within 2014, comprising 16 CONV/FC layers, with highly uniform frame Structure, has much attraction, and 3x3 convolution and 2x2 ponds layer are only carried out from start to end, as classical convolutional neural networks mould Type.Their pre-training model can be used for the plug and play of Caffe.The depth that it demonstrates network is crucial group of superperformance Into part.

Depth-first search (DFS)：It is it is a kind of for travel through or search tree or figure algorithm.Along the extreme saturation of tree The node of tree, the branch of as deep as possible search tree.When node v place side all oneself sought, search will trace back to discovery The start node on that side of node v.This process is performed until and has found untill the reachable all nodes of source node.Such as Also there is undiscovered node in fruit, then select one of them as source node and repeat above procedure, and whole process is entered repeatedly Row is untill all nodes are all accessed.Belong to the classic algorithm in graph theory, can be produced using Depth Priority Algorithm The corresponding topological sorting table of target figure, the graph theoretic problem of many correlations can be easily solved using topological sorting table, such as maximum Routing problem etc..

As shown in figure 1, Method for text detection is comprised the following steps under natural scene of the present invention based on spatial alternation：

(1.2) the word section detection model for output character section and connection can be predicted according to entry mark, institute are defined State network model to be made up of concatenated convolutional neutral net and convolution fallout predictor, word section is calculated according to above-mentioned training dataset With the label of connection, allowable loss function, with reference to online amplification and online negative sample hardly possible example digging technology means, using reversely biography Guiding method trains the network, obtains word section detection model, including following sub-step：

x_s=a_lΔx_s+x_a

y_s=a_lΔy_s+y_a

w_s=a_lexp(Δw_s)

h_s=a_lexp(Δh_s)

θ_s=Δ θ_s

(1.2.6) it is predicted that word section on the basis of connection and parallel link output in prediction interval：For being connected in layer, In same characteristic pattern Itro_il' upper coordinate points (x, y) place, takes the point of neighbour in the range of x-1≤x '≤x+1, y-1≤y '≤y+1 (x ', y '), this 8 points correspond to input picture Itr_i' when, just obtain benchmark word section s corresponding with (x, y)^{(x, y, l)}It is connected Neighbour's word section s in 8 layers for connecing^{(x ', y ', l)}, neighbour's word section is represented by set in 8 layers：

Wherein y_sIt is the label of all word sections, c_sIt is the word section score of prediction, y_lIt is the connection label of prediction, c_lIt is The connection score of prediction, by connecting score c in layer_l1With cross-layer score c_l2Composition；If i-th initial bounding box is labeled as just Class, then y_sI ()=1, is otherwise 0；L_conf(y_s, c_s) it is the word section score c for predicting_sSoftmax loss, L_conf(y_s, c_l) It is the connection score c of prediction_lSoftmax loss,It is the word section geometric parameter s and true tag of predictionBetween Smooth L₁Return loss；n_sIt is the quantity of the initial bounding box of positive class, for carrying out normalizing to the classification of word section and recurrence loss Change；n_lIt is positive class connection sum, for being normalized to link sort loss；λ₁And λ₂It is weight constant, 1 is taken in practice.

(2.3) the word section confidence score and connection confidence score combination for obtaining, its Chinese Fields confidence will be detected Degree score includes the positive and negative category score of word section and side-play amount score, and softmax standardized scores are exported using convolution fallout predictor： Connection and parallel link in prediction interval on the basis of the word section of (2.1) prediction, for being connected in layer, in same characteristic pattern Itsto_il' upper coordinate points (x, y) place, 3 × 3 convolution fallout predictors predict s^{(x, y, l)}With 8 neighbour's word sectionsInterbed in connection positive and negative score c_l1；For parallel link, volume 3 × 3 Product fallout predictor predicts l layers of benchmark word section s^{(x, y, l)}4 neighbour's words section upper with l-1 layersThe positive and negative score c of parallel link_l2, c_l1And c_l2Constitute the connection of prediction Score c_l。

The bounding box b that (3.3.9) output is combined.

Claims

1. it is a kind of based on multi-direction Method for text detection in the natural picture for connecting word section, it is characterised in that methods described bag Include following step：

(1.1) content of text of all text images is concentrated with entry level flag training image, label is at the beginning of the rectangle of entry Four point coordinates of beginning bounding box, obtain training dataset；

(1.2) the word section connecting detection network mould for output character section and connection can be predicted according to entry label is defined Type, the word section connecting detection network model is made up of concatenated convolutional neutral net and convolution fallout predictor, according to above-mentioned training Data set is calculated the label of word section and connection, and allowable loss function digs with reference to online amplification and online negative sample hardly possible example Pick method, this article field connecting detection network is trained using reverse conduction method, obtains word section connecting detection network model；

(2) word section and connection are carried out to text image to be detected using the above-mentioned word section connecting detection network model for training Detection, including following sub-step：

(2.1) word section detection is carried out to text image to be detected, the characteristic pattern exported by different convolutional layers predicts different chis The word section of degree, the characteristic pattern exported by same convolutional layer predicts the word section of same scale；

(2.2) the word section on all characteristic layers detected to text image to be detected is attached detection, the connection bag Include connection and parallel link in layer；

(2.3) confidence score and the connection confidence score combination of the word section for obtaining, its Chinese Fields confidence level will be detected Score includes the positive and negative category score of word section and side-play amount score, and softmax standardized scores are exported using convolution fallout predictor；

(3.1) standardized score obtained according to detection in (2.3), the word section of filtering convolution fallout predictor output and connection, with Word section after filtering, to connect as side, builds connection figure as node；

(3.2) perform depth-first search on the diagram to find the component of interconnection, each component is denoted as set S, comprising by The word section that connection is connected together；

(3.3) the word section during is gathered is combined into a complete entry, calculates complete entry bounding box and exports.

2. according to claim 1 based on multi-direction Method for text detection, its feature in the natural picture for connecting word section It is that the step (1.2) is specially：

(1.2.1) builds word section detection convolutional neural networks model：The preceding several layers of convolution units for extracting feature come from pre-training VGG-16 networks, preceding several layers of convolution units are respectively converted into for convolutional layer 1 to pond layer 5, full articulamentum 6 and full articulamentum 7 Convolutional layer 6 and convolutional layer 7, it is behind some extra convolutional layers for adding to connect, and the feature for extracting more depth is carried out Detection, including convolutional layer 8, convolutional layer 9, convolutional layer 10, last layer is convolutional layer 11；6 different convolutional layer difference are defeated afterwards Go out various sizes of characteristic pattern, be easy to extract the high-quality characteristics of various yardsticks, detection word section and connection are at this six Carried out on various sizes of characteristic pattern；For this 6 convolutional layers, the wave filter that size is 3 × 3 is all added after each layer and is made It is convolution fallout predictor, to detect word section and connection jointly；

(1.2.2) produces word section bounding box label from the word bounding box of mark：For original training image collection Itr, note scaling Training image collection afterwards is Itr ', w_I、h_IThe respectively width of Itr ' and height, with the i-th pictures Itr_i' as mode input, Itr_iAll word bounding boxs of ' upper mark are denoted as W_i=[W_i1..., W_ip], wherein W_ijFor j-th word on the i-th pictures is surrounded Box, word bounding box is word level or entry rank, and j=1 ..., p, p are Itr_iThe total quantity of ' upper word bounding box；6 after note The characteristic pattern that layer convolutional layer is exported respectively constitutes set Itro_i'=[Itro_i1' ..., Itro_i6'], wherein Itro_il' it is rear 6 The l layers of characteristic pattern of output, w in layer convolutional layer_l、h_lThe respectively width and height of this feature figure, Itro_il' on coordinate (x, Y) correspondence Itr_i' on (x_a, y_a) centered on point coordinates the initial bounding box B of level_ilq, they meet following equation：

Initial bounding box B_ilqWide and height be all configured to a constant a_l, for the ratio of controlled output word section, l= 1 ..., 6；Remember the l layers of characteristic pattern Itro of output_il' corresponding initial bounding box collection is combined into B_il=[B_il1..., B_ilm], q= 1 ..., m, wherein m are the number of initial bounding box on the l layers of characteristic pattern of output；As long as initial bounding box B_ilqCenter quilt It is included in the word bounding box W of the upper any marks of Itr '_ijInside, and B_ilqSize a_lWith the word bounding box W of the mark_ijHeight h Meet：So this initial bounding box B_ilqPositive class is marked as, label value is 1, and with height most It is close that word bounding box W_ijMatching；Otherwise, B is worked as_ilqWith all word bounding box W_iWhen being all unsatisfactory for above-mentioned two condition, B_ilqNegative class is flagged as, label value is 0；Word section is produced on initial bounding box, with initial bounding box label classification phase Together；

Word section is produced on the initial bounding box of the tape label that (1.2.3) is produced in the step (1.2.2) and positive class text is calculated Fields offset amount：Negative class word section bounding box s^-It is the initial bounding box B of negative class^-；Positive class word section bounding box s⁺Initially wrapped by positive class Enclose box B⁺Obtained by following steps：A) the positive initial bounding box B of class is remembered⁺The mark word bounding box W for matching and horizontal direction angle It is θ_s, with B⁺Central point centered on, W is turned clockwise θ_sAngle；B) W is cut, removal exceeds B⁺The left side and the part on the right； C) with B⁺Central point centered on, by the word bounding box W ' rotate counterclockwises θ after cutting_sAngle, obtains word section s⁺True tag Geometric parameter x_s、y_s、w_s、h_s、θ_s；D) it is calculated literary s⁺Relative to B⁺Side-play amount (Δ x_s, Δ y_s, Δ w_s, Δ h_s, Δ θ_s), Computing formula is as follows：

x_s=a_lΔx_s+x_a

y_s=a_lΔy_s+y_a

w_s=a_lexp(Δw_s)

h_s=a_lexp(Δh_s)

θ_s=Δ θ_s

Wherein, x_s、y_s、w_s、h_s、θ_sRespectively word section bounding box s⁺Central point abscissa, central point ordinate, width, height Degree and the angle between horizontal direction；x_a、y_a、w_a、h_aThe respectively initial bounding box B of level⁺Central point abscissa, center Point ordinate, width, height；Δx_s、Δy_s、Δw_s、Δh_s、Δθ_sRespectively word section bounding box s⁺Central point abscissa x_sPhase To initial bounding box B⁺Side-play amount, ordinate y_sThe side-play amount of relatively initial bounding box, width w_sOffset variation amount, height h_sOffset variation amount, angle, θ_sSide-play amount；

(1.2.4) calculates connection label for the word section bounding box that step (1.2.3) is produced：Word section s is initially to surround On box B produce, therefore between s connection label and their corresponding initial bounding box B between connection label it is identical；For Feature set of graphs Itro_i'=[Itro_i1' ..., Itro_i6'], if in same characteristic pattern Itro_il' initial bounding box collection Close B_ilIn, two initial bounding boxsLabel be all positive class, andMatch same word, thenBetween layer in connection be marked as positive class, otherwise labeled as negative class；If in characteristic pattern Itro_il' corresponding first Beginning bounding box set B_ilIn initial bounding boxAnd Itro_i(l-1)' corresponding initial bounding box set B_i(l-1)In just Beginning bounding boxLabel be all positive class and match same word bounding boxSoBetween parallel link be marked as positive class, otherwise labeled as negative class；

(1.2.5), using the training image collection Itr ' after scaling as word section detection model input, predictive text section s is exported：It is right Model initialization weight and biasing, preceding 60,000 training iterative step learning rate are set to 10^-3, afterwards learning rate decay to 10^-4； For rear 6 layers of convolutional layer, in l layers of characteristic pattern Itro_il' on coordinate (x, y) place, (x, y) corresponds to input picture Itr_i' on With (x_a, y_a) centered on point coordinates, with a_lIt is the initial bounding box B of size_ilq, 3 × 3 convolution fallout predictor can all predict B_ilq It is divided into the score c of positive and negative class_s, c_sIt is bivector, span is the decimal between 0-1；Also predict 5 simultaneously Individual numeralAs being divided into positive class word section s⁺When geometrical offset amount, whereinThe word section bounding box s for respectively predicting⁺The relatively positive class of central point abscissa is initially wrapped Enclose box B⁺Side-play amount, the initial bounding box B of relatively positive class of ordinate⁺Side-play amount, the offset variation amount of height, width it is inclined Move variable quantity, offset；

(1.2.6) it is predicted that word section on the basis of connection and parallel link output in prediction interval：For being connected in layer, same One characteristic pattern Itro_il' upper coordinate points (x, y) place, take neighbour in the range of x-1≤x '≤x+1, y-1≤y '≤y+1 point (x ', Y '), this 8 points correspond to input picture Itr_i' when, just obtain benchmark word section s corresponding with (x, y)^{(x, y, l)}It is connected Neighbour's word section s in layer^{(x ', y ', l)}, neighbour's word section is represented by set in 8 layers：

3 × 3 convolution fallout predictors can predict s^{(x, y, l)}Gather with neighbour in layerConnection positive and negative score c_l1, c_l1It is 16 Dimensional vector, wherein, w is connection in subscript, expression layer；

For parallel link, corresponding word at two points on the characteristic patterns that a parallel link exports two continuous convolution layers Duan Xianglian；Due to every by one layer of convolutional layer, the width and height of characteristic pattern can all reduce half, l layers of output characteristic figure Itro_il' width w_lWith height h_lIt is l-1 layers of characteristic pattern Itro_i(l-1)' width w_l-1With height h_l-1Half, and Itro_il' corresponding initial bounding box yardstick a_lIt is Itro_i(l-1)' corresponding initial bounding box yardstick a_l-12 times, in l Layer output characteristic figure Itro_il' on (x, y), in characteristic pattern Itro_i(l-1)' above take 2x≤x '≤2x+1,2y≤y '≤2y+1 models Enclose 4 interior cross-layer Neighbor Points (x ', y '), Itro_il' upper (x, y) corresponds to input picture Itr_i' on initial bounding box just With Itro_i(l-1)' upper 4 cross-layer Neighbor Points correspond to input picture Itr_i' on 4 initial bounding box locus overlap, 4 Individual cross-layer neighbour word section s^{(x ', y ', l-1)}It is represented by set：

3 × 3 convolution fallout predictors can predict l layers of benchmark word section s^{(x, y, l)}With neighbour's text set of fields on l-1 layers Between parallel link positive and negative score c_l2, c_l2It is 8 dimensional vectors：

Wherein,Represent that fallout predictor predicts s^{(x, y, l)}What is be connected between its all 4 neighbour's words section is positive and negative Point, c is subscript, represents parallel link；

Literary field label, positive class word section true excursions amount, company that (1.2.7) is obtained with step (1.2.3) and step (1.2.4) Label is connect as output reference, the word section classification and score, the word field offset amount of prediction, step predicted with step (1.2.5) Suddenly the connection of (1.2.6) prediction is scored at prediction output, and the target loss function between design prediction output and output reference is right Word section connecting detection model is constantly trained to minimize the classification of word section, word field offset using reverse conduction method The loss with link sort is returned, for word section connecting detection modelling target loss function, target loss function It is three weighted sums of loss：

Wherein y_sIt is the label of all word sections, c_sIt is the word section score of prediction, y_lIt is the connection label of prediction, c_lIt is prediction Connection score, by connecting score c in layer_l1With cross-layer score c_l2Composition；If i-th initial bounding box is labeled as positive class, then y_sI ()=1, is otherwise 0；L_conf(y_s, c_s) it is the word section score c for predicting_sSoftmax loss, L_conf(y_s, c_l) it is prediction Connection score c_lSoftmax loss,It is the word section geometric parameter s and true tag of predictionBetween it is smooth L₁Return loss；n_sIt is the quantity of the initial bounding box of positive class, for being normalized to the classification of word section and recurrence loss；n_lIt is Positive class connection sum, for being normalized to link sort loss；λ₁And λ₂It is weight constant；

(1.2.8) in the training process of step (1.2.7), using online amplification method to training data I_trExpanded online Increase, and positive sample and negative sample are balanced using online negative sample hardly possible example Mining Strategy.In training picture I_trIt is scaled to identical Size and before batch loads, they are randomly cut into image block one by one, each image block and word section it is true The jaccard overlap coefficients o of bounding box is minimum；For multi-direction word, data amplification be in multi-direction word bounding box most Carried out on small area-encasing rectangle, the overlap coefficient o of each sample is randomly choosed from 0,0.1,0.3,0.5,0.7 and 0.9, image The size of block is between 0.1-1 times of original image size；Training image not flip horizontal；In addition, word section and the negative sample of connection Originally the major part of training sample is occupied, positive sample and negative sample is balanced using online negative sample hardly possible example Mining Strategy, to word Section and connection are separately excavated, and the ratio between control negative sample and positive sample is no more than 3: 1.

3. according to claim 1 and 2 based on multi-direction Method for text detection in the natural picture for connecting word section, it is special Levy and be, the step (2.1) is specially：

Treat i-th text image Itst to be detected in detection image collection Itst_i, zooming to uniform sizes, specific size can be with The situation of text image to be detected is manually set, and the text image to be detected after note scaling is Itst_i′.By image Itst_i' input The word section connecting detection model trained in step (1.2), obtains the characteristic pattern composition that rear 6 layers of convolutional layer are exported respectively Set Itsto_i'=[Itsto_i1' ..., Itsto_i6'], wherein Itsto_il' it is the l layers of feature of output in rear 6 layers of convolutional layer Figure, l=1 ..., 6, in every output characteristic figure Itsto_il' on coordinate (x, y) place, 3 × 3 convolution fallout predictor can all predict Go out (x, y) corresponding initial bounding box B_ilqIt is predicted to be the score c of positive and negative class word section_s, while also predict 5 numerals making To be predicted to be positive class word section s⁺When geometrical offset amount.

4. according to claim 1 and 2 based on multi-direction Method for text detection in the natural picture for connecting word section, it is special Levy and be, the step (2.2) is specially：

Connection and parallel link in prediction interval on the basis of the word section of (2.1) prediction, for being connected in layer, in same Zhang Tezheng Figure Itsto_il' upper coordinate points (x, y) place, 3 × 3 convolution fallout predictors predict s^{(x, y, l)}With its 8 neighbour's word sectionInterbed in connection positive and negative score c_l1；For parallel link, volume 3 × 3 Product fallout predictor can predict l layers of benchmark word section s^{(x, y, l)}4 neighbour's words section upper with l-1 layersThe positive and negative score c of parallel link_l2, c_l1And c_l2Constitute the connection of prediction Score c_l。

5. according to claim 1 and 2 based on multi-direction Method for text detection in the natural picture for connecting word section, it is special Levy and be, the step (2.3) is specially：

According to step (2.1) and the result of step (2.2), in each characteristic pattern Itsto_il' upper coordinate (x, y) place, will predict Word section score c_s, word section skewConnection score c in layer_l1, parallel link score c_l2This four be concatenated into one 33 dimension vector, after the output channel of convolution fallout predictor increase by one layer extra softmax layers with Word section score and connection score are standardized respectively.

6. according to claim 1 and 2 based on multi-direction Method for text detection in the natural picture for connecting word section, it is special Levy and be, the step (3.1) is specially；

For step (2) text image to be detected be input to word section detection model and the word section s of the fixed qty that produces and Connection N_s, filtered by their score；It is word section s and connection N_sDifferent filtering thresholds, respectively α and β are set； Using the word section s ' after filtering as node, the connection N after filtering_s' as side, build a figure using them.

7. according to claim 1 and 2 based on multi-direction Method for text detection in the natural picture for connecting word section, it is special Levy and be, the step (3.3) is specially：Carry out the literary set of fields that depth-first search is obtained on the diagram to step (3.2) S, a complete word is combined into by following step, including：

(3.3.2)Wherein θ_bTo export the deviation angle of bounding box,It is i-th word section bag in set The deviation angle of box s is enclosed, is obtained by the mean deviation angle of all word sections in set S；

(3.3.3) finds straight line tan (θ_b) x+b intercept b so that all of word section is to central point in set S's The summation of distance is minimum；

(3.3.5)B represents output bounding box, x_b、y_bRespectively export bounding box The horizontal stroke at center, ordinate；

(3.3.6)Wherein w_bTo export the width of bounding box, w_p、w_qPoint It is not the width and the width of the bounding box centered on q of the bounding box centered on point p；

(3.3.7)h_bTo export the height of bounding box,It is i-th word section bounding box s in set Highly, obtained by the average height by all word sections in literary set of fields S；

(3.3.8)b：=(x_b, y_b, w_b, h_b, θ_b), b is output bounding box, is represented by coordinate parameters, dimensional parameters, angle parameter；

The bounding box b that (3.3.9) output is combined.