CN109002463A

CN109002463A - A kind of Method for text detection based on depth measure model

Info

Publication number: CN109002463A
Application number: CN201810568042.3A
Authority: CN
Inventors: 赵永彬; 刚毅凝; 李巍; 刘树吉; 陈硕; 熊先亮; 梁凯; 周杨浩; 杨育彬; 郝跃冬; 刘嘉华; 康睿
Original assignee: Nanjing University; Nari Information and Communication Technology Co; Information and Telecommunication Branch of State Grid Liaoning Electric Power Co Ltd
Current assignee: Nanjing University; State Grid Corp of China SGCC; Nari Information and Communication Technology Co; Information and Telecommunication Branch of State Grid Liaoning Electric Power Co Ltd
Priority date: 2018-06-05
Filing date: 2018-06-05
Publication date: 2018-12-14

Abstract

The invention discloses a kind of Method for text detection based on depth measure model, comprising: step 1, using MSER detection algorithm, obtains the other candidate region of character level.Step 2, candidate region is filtered using classifier, removes non-character region.Step 3, according to geometric position information, obtained character is clustered into line of text.Step 4, according to heuristic rule, line of text is split, is divided into each specific word.Step 5, the training set of word rank is constructed.Step 6, training depth measure learning model.Step 7, the depth measure model obtained using step 6, classifies to text box, obtains final text box field.

Description

A kind of Method for text detection based on depth measure model

Technical field

The invention belongs to computer vision field more particularly to a kind of Method for text detection based on depth measure model.

Background technique

In machine learning model, loss function usually can be expressed as loss item and regular terms.Loss item is for describing Fitting degree between model itself and training data, regular terms are used to restricted model, enable model in fitting data It is unlikely to too complicated again simultaneously, prevents over-fitting.Common loss function includes 0-1 loss function, square damage in statistical learning Lose function, absolute loss function, logarithm loss function etc..It is mainly quadratic loss function and to be based on used in deep learning The cross entropy loss function of one-hot coding.Existing loss function does not all account for the relationship between sample pair, only pair The borrow of loss function in statistical machine learning does not make full use of other available discriminant informations.

Summary of the invention

Goal of the invention: line of text classification problem is typical two classification problem in text detection.The present invention will measure The thought of study introduces deep learning, the distance between similar, the distance between maximization inhomogeneity is minimized, so that classification Boundary is more obvious, improves the identification of model.

For the deficiency currently considered two classification problems, the present invention provides introduce depth measure learning model always Processing method.

The method specifically includes following steps:

Step 1, using MSER (Maximally Stable Extremal Regions, most stable extremal region) to defeated Enter image and carry out detection algorithm, obtains the other candidate region of character level；

Step 2, the other training dataset of character level is constructed, training dataset of the present invention is mainly derived from scene lteral data Collect ICDAR2003, ICDAR 2011 and ICDAR 2013, according to the character zone marked, intercepts text in character zone Information chooses the candidate region not being overlapped with positive class as negative class, just for the candidate region that step 1 obtains as positive class Class and negative class form the other training dataset of character level, and as input, training deep neural network uses this trained depth Neural network is spent as classifier (classifier can judge whether contain character in character zone), to the time of candidate region Word selection symbol is classified, and screening and filtering removes negative class；

Step 3, lesser threshold values is arranged according to the abscissa of each central point in the central point for choosing each candidate region (generally 5 pixels), by the candidate characters region within this threshold values according to horizontal direction, are all divided into same Line of text region；

Step 4, the average distance in the line of text region that step 3 obtains between each character is calculated, it is average for being greater than Twice of distance of two characters are split, and are divided into two different words, conversely, will be twice less than average distance Two characters belong to same word, to obtain the candidate region of word rank；

Step 5, the candidate region of the word rank obtained according to step 4, each character belong to a word, and one Word is made of at least one character, using all words constructed as the data set of word rank, according to the text of word rank This markup information (the text marking information of contained character in text marking information, that is, word rank data set of word rank) is cut Take corresponding region as positive class, using the region not being overlapped with positive class as negative class；

Step 6, the positive class and negative class obtained according to step 5 is built depth measure model and is instructed using them as input Practice, which can be used in the classification of word rank；

Step 7, the depth measure model obtained according to step 6, treats test image and is filtered, and obtains final text One's respective area.

In step 1 when with MSER algorithm, sets the smallest by 1 for the threshold values of MSER, opened in image in detection one When text filed, need in H, L, S (H (hue, form and aspect), L (lightness, brightness), S (saturation, saturation degree)) and Four channels of gray scale use MSER algorithm.

In step 2, data set is constructed by oneself, will be in view of the data set constructed and detection when constructing data set Similarity between picture.In general, the higher the better for similarity.

In step 5, the negative class of line of text is removed.Construct the data set of word rank, the construction process and step 2 of training set It is similar.According to the markup information of word rank, corresponding region is intercepted as positive class.Those region conducts not being overlapped with positive class Negative class.

Step 6 includes:

Piece image in the data set for the word rank that step 5 obtains is transformed into d dimension Euclidean space, then by step 6-1 Have:

In this formula (1),It is a pair of of triple,WithIt is building word rank in step 5 Belong to the sample of same class (positive class or negative class) in data set,Be withThe different sample of classification, f () instruction Be depth measure model, margin is sample pairWith sample pairBetween parameter value；

Step 6-2 designs following loss function:

Specific gradient derivation process is as follows:

Wherein, Ni indicates sample number,Indicate feature of the depth measure model to i-th of ancestors' sample extraction, fIndicate the feature of sample extraction identical with i-th of ancestor categories,It indicates different from i-th of ancestor categories The feature of sample extraction；

Step 6-3, using loss function training depth measure model, it includes two layers volume that the network of depth measure model, which has altogether, Lamination, two layers of pond layer, two layers of full articulamentum, all image whole normalizings in the data set for the word rank for first obtaining step 5 32 × 32 are turned to, first convolutional layer convolution kernel number is 6, and convolution kernel size is 5 × 5；Second convolutional layer convolution nucleus number Mesh is 12, and convolution kernel size is 5 × 5, and convolution kernel parameter initialization mode is random, first convolutional layer 6 convolution of output Figure, size are 28 × 28, and the size of pond layer is 2 × 2, and pondization strategy is using maximum pond mode, first time pond Afterwards, characteristic pattern size is 14 × 14；After second of convolution, characteristic pattern size is 10 × 10, and the number of full articulamentum is respectively 150 and 50, L2 regularization layer is added after convolution is complete, the characteristic criterion made, after the processing of these layers, step 5 is obtained Word rank data set in all images become effective characteristic function, be finally introducing loss layers of training of Triplet, step The up time function proposed in 6-2 is triplet loss.

Step 7 includes: image to be tested for one, obtains the other candidate regions of character level using the method detection of step 1 The negative sample of candidate region is removed using the deep neural network of step 2 in domain, utilizes the method for step 3 and step 4, construction Candidate word rank region out, the depth measure model in recycle step 6 filter out negative each word rank territorial classification Class, to obtain final text filed.

In step 6, for metric learning, most importantly how the distance between picture is measured.Think of of the invention Want to be desirable to minimize inter- object distance, between class distance is maximized, so that classification boundaries are more obvious.For this purpose, selecting Triplet loss realizes idea of the invention by building triple.It is empty that piece image is transformed into d dimension Euclid Between.Guaranteed with this(anchor, ancestors' node) can with it is similarThe distance of (positive, positive sample) is closer, with It is inhomogeneous(negative, negative sample) is farther.Therefore, have:

When training, decline loss in iteration the smaller the better.Namely allow ancestors' node (anchor) with it is right The closer the positive sample (positive) answered the better, ancestors' node (anchor) and corresponding negative sample (negative) it is more remote more It is good.For the value of marginal value (margin):

(1) when the value of marginal value is smaller, loss function value is just easier to be intended to 0.Ancestors' node with it is corresponding just Sample is not needing the too close of drawing, when not needing be too far, can make loss function value very with corresponding negative sample drawing Fastly close to 0.In this way training as a result, similar image often can not be distinguished well.

(2) when the value of marginal value is larger, it is necessary to so that the parameter of network training risk one's life further dotted line node with The distance between corresponding positive sample zooms out the distance between ancestors' node and corresponding negative sample, especially when marginal value Value setting it is too big when, frequently can lead to loss function value keep a very big value.

Therefore, one reasonable marginal value value of setting is very crucial, this is the important finger of similarity between measuring sample Mark.Gap size needs to do one and accepts or rejects well between differentiation for similar image and inhomogeneity image.Above Thinking is specifically set with certain reference significance to boundary value, but can not directly give it is certain by detailed rules. In experiment, by many experiments effect, adjusted repeatedly to choose appropriate value.

Whole network structure has used convolutional layer to the extraction property of feature, has used screening of the pond layer to feature With the characteristic for reducing parameter, comparatively, or ratio is more complete.

In mathematics, one is measured distance function in other words, and expression is in a definition set, between each element Distance.One set with certain metric function is referred to as metric space.Metric learning, that is, often say based on similar The feature learning of degree.Its distance measure the destination of study is to measure the similarity degree between each sample.And this measurement It also is exactly one of the most crucial problem of pattern-recognition.If it is desired to the similarity between two pictures is calculated, then how to measure Similarity between picture is so that similarity is small between different classes of picture and similarity between the picture of the same category It greatly, is exactly metric learning problem in need of consideration.If saying that target is face, then just needing to construct a suitable distance Function goes the feature of quantization face.Such as color development, shape of face etc.；If target is posture identification, with regard to needing building one A distance function that can measure posture similarity.Feature is various, can basis in order to go to model these similarities Specific task, by selecting suitable feature and manually selecting distance function.Certainly, this method may may require that very big Manual time and energy investment, it is also possible to generate to the changes of data very not robust the case where.Metric learning conduct One selectable alternative can freely learn out according to specific different task for certain particular tasks Distance metric function.

The thought of metric learning is introduced deep learning and this model is applied to text detection field by the present invention, for text The problem of positive and negative class is classified in this detection process.

The utility model has the advantages that the present invention solves the problems of the prior art: metric learning being combined with deep learning, to depth The loss function of degree study improves.Original Softmax function, the spy that will learn are substituted using Triplet loss Euclidean distance is taken over for use to express.The sample distance between similar is minimized, the distance between inhomogeneity sample is maximized, so that not Diversity factor between generic is bigger, and boundary between the two becomes apparent from.Our improvement detects line of text and classifies this A two classification problem has the effect of apparent.

Detailed description of the invention

The present invention is done with reference to the accompanying drawings and detailed description and is further illustrated, it is of the invention above-mentioned or Otherwise advantage will become apparent.

Fig. 1 is the thought example of depth measure study.

Fig. 2 is to carry out sentencing method for distinguishing example using metric learning,

Fig. 3 Triplet network model figure.

Fig. 4 is the parameter list of model.

Fig. 5 is the result figure that model inspection obtains.

Fig. 6 is the flow chart of entire method.

Specific embodiment

The present invention will be further described with reference to the accompanying drawings and embodiments.

The present invention suitable for coping with the line of text test problems image, ask by two classification for being particularly suitable for candidate text box Topic.The invention proposes the new methods for text detection and classification.1) when carrying out the detection and filtering of candidate region, Using MSER detection algorithm, detection obtains the other candidate region of character level, constructs the other training dataset of character level, we Training set is mainly derived from scene lteral data collection ICDAR 2003, ICDAR 2011 and ICDAR 2013, we are according to having marked The character zone being poured in, intercepts corresponding text information as positive class, for the candidate region that detection algorithm obtains, choose with The candidate region that positive class is not overlapped is as negative class, and in this, as input, training deep learning network is right using this classifier Candidate characters are classified, screening and filtering, remove negative class.2) candidate characters are clustered into line of text using seed growth algorithm. Lesser threshold values is arranged according to the abscissa of each central point in the central point for choosing each candidate region, will this threshold values with Interior candidate characters region is all divided into the same line of text region according to horizontal direction.It calculates flat between each character Equal distance is split for being significantly greater than between two characters of average distance, is divided into two different words. 3) when classifying to line of text, using depth measure model.The similarity between similar is minimized, inhomogeneity is maximized Between diversity factor.The present invention includes the following steps:

Step 4, the average distance in the line of text region that step 3 obtains between each character is calculated, it is average for being greater than Twice of distance of two characters are split, and are divided into two different words, conversely, will be twice less than average distance Two characters belong to same word；To obtain the candidate region of word rank；

Step 6 includes:

Step 6-2 designs following loss function:

Embodiment:

The present invention using the above scheme, realizes the work that character area is detected on ICDAR2011.

Be implemented as follows: these three data sets are text detection standard data sets.It is detected first using detection algorithm Candidate characters.After being filtered, screening out non-character candidate, character is clustered into line of text according to geological information.For every One line of text classifies to these candidate characters, removes non-text by trained depth measure Study strategies and methods One's respective area obtains final testing result.

Step 1, using MSER detection algorithm, in H, detection work is carried out respectively on L, S and four channels of gray scale, with To candidate region as much as possible.

Step 2, the other training dataset of character level is constructed, training set is mainly derived from ICDAR2003, ICDAR 2011 Corresponding text information is intercepted as positive class, for detection algorithm according to the character zone marked with ICDAR 2013 The candidate region not being overlapped with positive class is chosen as negative class, in this, as input, training depth in obtained candidate region It practises network to classify to candidate characters using this classifier, screening and filtering removes negative class；

Step 3, candidate characters are clustered into line of text using seed growth algorithm.Choose the center of each candidate region Lesser threshold values is arranged according to the abscissa of each central point in point.By the candidate characters region within this threshold values according to water Square to being all divided into the same line of text region.The average distance between each character is calculated, it is flat for being significantly greater than It is split between two characters of equal distance, is divided into two different words；

Step 4, it in the horizontal text box that step 3 obtains, is refined again, calculates being averaged between each character Distance is split for being significantly greater than between two characters of average distance, is divided into two different words；

Step 5, the data set of word rank is constructed, the construction process of training set is similar to step 2, according to the mark of word rank Information is infused, interception corresponding region is as positive class, and the region that those are not overlapped with positive class is as negative class.Training depth measure mould Type.Step 6, depth measure model and training are built.The training objective of depth measure model as shown in figures 1 and 2, in Fig. 1 Anchor is represented ancestors' image (containing text labels), and postive represents the image as anchor classification (containing text This self), negtive is represented and the different image of anchor classification.Input is a pair of of image in Fig. 2, by neural network (w represents its parameter), has obtained hiding input h, and Distance Metric represents metric range predetermined, finally judges It whether is same image (Same or different).Tertiary target is by connecing between depth measure model latter three Short range degree (similar sample is approached than dissimilar), it is special that Fig. 3 indicates how depth measure model extracts on an image Sign, Fig. 3 indicate the training process of depth measure model: the left side is the image containing text of input, by a nerve net Network obtains character representation (Feature Represention) finally by Triplet Loss defined in step 6, calculates Penalty values out.Fig. 4 is the parameter of network structure.That be represented in Fig. 4 is the details of multilayer neural network, Input image Size indicates input picture size (32*32), and Kernel size indicates convolution kernel size (5*5), and Pooling size is indicated Down-sampled ratio, C1 and C2 indicate that two convolutional layers, F1 and F2 indicate two full articulamentums.

Step 7, image to be tested for one obtains the other candidate region of character level using the method detection of step 1, Using the deep neural network of step 2, remove the negative sample of candidate region, using the method for step 3 and step 4, constructs time The word rank region of choosing, the depth measure model in recycle step 6 filter out negative class to each word rank territorial classification, from And it obtains final text filed.Fig. 5 is the concrete outcome that detection obtains, and flow chart of the present invention is as shown in Fig. 6.

The present invention provides a kind of Method for text detection based on depth measure model, implement the side of the technical solution There are many method and approach, the above is only a preferred embodiment of the present invention, it is noted that for the common of the art For technical staff, various improvements and modifications may be made without departing from the principle of the present invention, these improve and Retouching also should be regarded as protection scope of the present invention.The available prior art of each component part being not known in the present embodiment is subject to It realizes.

Claims

1. a kind of Method for text detection based on depth measure model, which comprises the steps of:

Step 1, using MSER detection algorithm, input picture is detected, obtains the other candidate region of character level；

Step 2, the other training dataset of character level is constructed, according to the character zone marked, intercepts text in character zone Information chooses the candidate region not being overlapped with positive class as negative class, just for the candidate region that step 1 obtains as positive class Class and negative class form the other training dataset of character level, and as input, training deep neural network uses this trained depth Neural network classifies to the candidate characters of candidate region as classifier, and screening and filtering removes negative class；

Step 3, lesser threshold values is arranged according to the abscissa of each central point in the central point for choosing each candidate region, will be Candidate characters region within this threshold values is all divided into the same line of text region according to horizontal direction；

Step 4, the average distance in the line of text region that step 3 obtains between each character is calculated, for being greater than average distance Twice of two characters are split, and are divided into two different words, conversely, by less than the two of twice of average distance Character belongs to same word, to obtain the candidate region of word rank；

Step 5, the candidate region of the word rank obtained according to step 4, each character belong to a word, and a word is It is made of at least one character, using all words constructed as the data set of word rank, according to the text marking of word rank Information intercepts corresponding region as positive class, using the region not being overlapped with positive class as negative class；

Step 6, the positive class and negative class obtained according to step 5 builds depth measure model and training using them as input, should Trained model can be used in the classification of word rank；

Step 7, the depth measure model obtained according to step 6, treats test image and is filtered, and obtains final text area Domain.

2. the method according to claim 1, wherein in step 1 when with MSER algorithm, by the threshold values of MSER It is set as the smallest by 1, when text filed in image is opened in detection one, needs in H, L, four channels S and gray scale use MSER algorithm.

3. according to the method described in claim 2, it is characterized in that, step 6 includes:

Piece image in the data set for the word rank that step 5 obtains is transformed into d dimension Euclidean space, then had by step 6-1:

In this formula (1),It is a pair of of triple,WithIt is the data set that word rank is constructed in step 5 In belong to of a sort sample,Be withThe different sample of classification, f () instruction is depth measure model, Margin is sample pairWith sample pairBetween parameter value；

Step 6-2 designs following loss function L:

Specific gradient derivation process is as follows:

Wherein, Ni indicates sample number,Indicate depth measure model to the feature of i-th of ancestors' sample extraction, Indicate the feature of sample extraction identical with i-th of ancestor categories,Indicate that the sample different from i-th of ancestor categories mentions The feature taken；

Step 6-3, using loss function training depth measure model, it includes two layers of convolution that the network of depth measure model, which has altogether, Layer, two layers of pond layer, two layers of full articulamentum, all images all normalization in the data set for the word rank for first obtaining step 5 It is 32 × 32, first convolutional layer convolution kernel number is 6, and convolution kernel size is 5 × 5；Second convolutional layer convolution kernel number It is 12, convolution kernel size is 5 × 5, and convolution kernel parameter initialization mode is random, first convolutional layer 6 trellis diagram of output, Size is 28 × 28, and the size of pond layer is 2 × 2, and pondization is tactful using maximum pond mode, behind first time pond, spy Levying figure size is 14 × 14；After second of convolution, characteristic pattern size is 10 × 10, and the number of full articulamentum is respectively 150 and 50, It is added L2 regularization layer after convolution is complete, the characteristic criterion made, after the processing of these layers, word rank that step 5 obtains Data set in all images become effective characteristic function, be finally introducing loss layers of training of Triplet, proposed in step 6-2 Up time function be triplet loss.

4. according to the method described in claim 3, utilizing step it is characterized in that, step 7 includes: image to be tested for one Rapid 1 method detection obtains the other candidate region of character level, using the deep neural network of step 2, removes the negative of candidate region Sample constructs candidate word rank region, the depth measure mould in recycle step 6 using the method for step 3 and step 4 Type filters out negative class to each word rank territorial classification, to obtain final text filed.