A kind of Method for text detection based on depth measure model
Technical field
The invention belongs to computer vision field more particularly to a kind of Method for text detection based on depth measure model.
Background technique
In machine learning model, loss function usually can be expressed as loss item and regular terms.Loss item is for describing
Fitting degree between model itself and training data, regular terms are used to restricted model, enable model in fitting data
It is unlikely to too complicated again simultaneously, prevents over-fitting.Common loss function includes 0-1 loss function, square damage in statistical learning
Lose function, absolute loss function, logarithm loss function etc..It is mainly quadratic loss function and to be based on used in deep learning
The cross entropy loss function of one-hot coding.Existing loss function does not all account for the relationship between sample pair, only pair
The borrow of loss function in statistical machine learning does not make full use of other available discriminant informations.
Summary of the invention
Goal of the invention: line of text classification problem is typical two classification problem in text detection.The present invention will measure
The thought of study introduces deep learning, the distance between similar, the distance between maximization inhomogeneity is minimized, so that classification
Boundary is more obvious, improves the identification of model.
For the deficiency currently considered two classification problems, the present invention provides introduce depth measure learning model always
Processing method.
The method specifically includes following steps:
Step 1, using MSER (Maximally Stable Extremal Regions, most stable extremal region) to defeated
Enter image and carry out detection algorithm, obtains the other candidate region of character level;
Step 2, the other training dataset of character level is constructed, training dataset of the present invention is mainly derived from scene lteral data
Collect ICDAR2003, ICDAR 2011 and ICDAR 2013, according to the character zone marked, intercepts text in character zone
Information chooses the candidate region not being overlapped with positive class as negative class, just for the candidate region that step 1 obtains as positive class
Class and negative class form the other training dataset of character level, and as input, training deep neural network uses this trained depth
Neural network is spent as classifier (classifier can judge whether contain character in character zone), to the time of candidate region
Word selection symbol is classified, and screening and filtering removes negative class;
Step 3, lesser threshold values is arranged according to the abscissa of each central point in the central point for choosing each candidate region
(generally 5 pixels), by the candidate characters region within this threshold values according to horizontal direction, are all divided into same
Line of text region;
Step 4, the average distance in the line of text region that step 3 obtains between each character is calculated, it is average for being greater than
Twice of distance of two characters are split, and are divided into two different words, conversely, will be twice less than average distance
Two characters belong to same word, to obtain the candidate region of word rank;
Step 5, the candidate region of the word rank obtained according to step 4, each character belong to a word, and one
Word is made of at least one character, using all words constructed as the data set of word rank, according to the text of word rank
This markup information (the text marking information of contained character in text marking information, that is, word rank data set of word rank) is cut
Take corresponding region as positive class, using the region not being overlapped with positive class as negative class;
Step 6, the positive class and negative class obtained according to step 5 is built depth measure model and is instructed using them as input
Practice, which can be used in the classification of word rank;
Step 7, the depth measure model obtained according to step 6, treats test image and is filtered, and obtains final text
One's respective area.
In step 1 when with MSER algorithm, sets the smallest by 1 for the threshold values of MSER, opened in image in detection one
When text filed, need in H, L, S (H (hue, form and aspect), L (lightness, brightness), S (saturation, saturation degree)) and
Four channels of gray scale use MSER algorithm.
In step 2, data set is constructed by oneself, will be in view of the data set constructed and detection when constructing data set
Similarity between picture.In general, the higher the better for similarity.
In step 5, the negative class of line of text is removed.Construct the data set of word rank, the construction process and step 2 of training set
It is similar.According to the markup information of word rank, corresponding region is intercepted as positive class.Those region conducts not being overlapped with positive class
Negative class.
Step 6 includes:
Piece image in the data set for the word rank that step 5 obtains is transformed into d dimension Euclidean space, then by step 6-1
Have:
In this formula (1),It is a pair of of triple,WithIt is building word rank in step 5
Belong to the sample of same class (positive class or negative class) in data set,Be withThe different sample of classification, f () instruction
Be depth measure model, margin is sample pairWith sample pairBetween parameter value;
Step 6-2 designs following loss function:
Specific gradient derivation process is as follows:
Wherein, Ni indicates sample number,Indicate feature of the depth measure model to i-th of ancestors' sample extraction, fIndicate the feature of sample extraction identical with i-th of ancestor categories,It indicates different from i-th of ancestor categories
The feature of sample extraction;
Step 6-3, using loss function training depth measure model, it includes two layers volume that the network of depth measure model, which has altogether,
Lamination, two layers of pond layer, two layers of full articulamentum, all image whole normalizings in the data set for the word rank for first obtaining step 5
32 × 32 are turned to, first convolutional layer convolution kernel number is 6, and convolution kernel size is 5 × 5;Second convolutional layer convolution nucleus number
Mesh is 12, and convolution kernel size is 5 × 5, and convolution kernel parameter initialization mode is random, first convolutional layer 6 convolution of output
Figure, size are 28 × 28, and the size of pond layer is 2 × 2, and pondization strategy is using maximum pond mode, first time pond
Afterwards, characteristic pattern size is 14 × 14;After second of convolution, characteristic pattern size is 10 × 10, and the number of full articulamentum is respectively
150 and 50, L2 regularization layer is added after convolution is complete, the characteristic criterion made, after the processing of these layers, step 5 is obtained
Word rank data set in all images become effective characteristic function, be finally introducing loss layers of training of Triplet, step
The up time function proposed in 6-2 is triplet loss.
Step 7 includes: image to be tested for one, obtains the other candidate regions of character level using the method detection of step 1
The negative sample of candidate region is removed using the deep neural network of step 2 in domain, utilizes the method for step 3 and step 4, construction
Candidate word rank region out, the depth measure model in recycle step 6 filter out negative each word rank territorial classification
Class, to obtain final text filed.
In step 6, for metric learning, most importantly how the distance between picture is measured.Think of of the invention
Want to be desirable to minimize inter- object distance, between class distance is maximized, so that classification boundaries are more obvious.For this purpose, selecting
Triplet loss realizes idea of the invention by building triple.It is empty that piece image is transformed into d dimension Euclid
Between.Guaranteed with this(anchor, ancestors' node) can with it is similarThe distance of (positive, positive sample) is closer, with
It is inhomogeneous(negative, negative sample) is farther.Therefore, have:
When training, decline loss in iteration the smaller the better.Namely allow ancestors' node (anchor) with it is right
The closer the positive sample (positive) answered the better, ancestors' node (anchor) and corresponding negative sample (negative) it is more remote more
It is good.For the value of marginal value (margin):
(1) when the value of marginal value is smaller, loss function value is just easier to be intended to 0.Ancestors' node with it is corresponding just
Sample is not needing the too close of drawing, when not needing be too far, can make loss function value very with corresponding negative sample drawing
Fastly close to 0.In this way training as a result, similar image often can not be distinguished well.
(2) when the value of marginal value is larger, it is necessary to so that the parameter of network training risk one's life further dotted line node with
The distance between corresponding positive sample zooms out the distance between ancestors' node and corresponding negative sample, especially when marginal value
Value setting it is too big when, frequently can lead to loss function value keep a very big value.
Therefore, one reasonable marginal value value of setting is very crucial, this is the important finger of similarity between measuring sample
Mark.Gap size needs to do one and accepts or rejects well between differentiation for similar image and inhomogeneity image.Above
Thinking is specifically set with certain reference significance to boundary value, but can not directly give it is certain by detailed rules.
In experiment, by many experiments effect, adjusted repeatedly to choose appropriate value.
Whole network structure has used convolutional layer to the extraction property of feature, has used screening of the pond layer to feature
With the characteristic for reducing parameter, comparatively, or ratio is more complete.
In mathematics, one is measured distance function in other words, and expression is in a definition set, between each element
Distance.One set with certain metric function is referred to as metric space.Metric learning, that is, often say based on similar
The feature learning of degree.Its distance measure the destination of study is to measure the similarity degree between each sample.And this measurement
It also is exactly one of the most crucial problem of pattern-recognition.If it is desired to the similarity between two pictures is calculated, then how to measure
Similarity between picture is so that similarity is small between different classes of picture and similarity between the picture of the same category
It greatly, is exactly metric learning problem in need of consideration.If saying that target is face, then just needing to construct a suitable distance
Function goes the feature of quantization face.Such as color development, shape of face etc.;If target is posture identification, with regard to needing building one
A distance function that can measure posture similarity.Feature is various, can basis in order to go to model these similarities
Specific task, by selecting suitable feature and manually selecting distance function.Certainly, this method may may require that very big
Manual time and energy investment, it is also possible to generate to the changes of data very not robust the case where.Metric learning conduct
One selectable alternative can freely learn out according to specific different task for certain particular tasks
Distance metric function.
The thought of metric learning is introduced deep learning and this model is applied to text detection field by the present invention, for text
The problem of positive and negative class is classified in this detection process.
The utility model has the advantages that the present invention solves the problems of the prior art: metric learning being combined with deep learning, to depth
The loss function of degree study improves.Original Softmax function, the spy that will learn are substituted using Triplet loss
Euclidean distance is taken over for use to express.The sample distance between similar is minimized, the distance between inhomogeneity sample is maximized, so that not
Diversity factor between generic is bigger, and boundary between the two becomes apparent from.Our improvement detects line of text and classifies this
A two classification problem has the effect of apparent.
Detailed description of the invention
The present invention is done with reference to the accompanying drawings and detailed description and is further illustrated, it is of the invention above-mentioned or
Otherwise advantage will become apparent.
Fig. 1 is the thought example of depth measure study.
Fig. 2 is to carry out sentencing method for distinguishing example using metric learning,
Fig. 3 Triplet network model figure.
Fig. 4 is the parameter list of model.
Fig. 5 is the result figure that model inspection obtains.
Fig. 6 is the flow chart of entire method.
Specific embodiment
The present invention will be further described with reference to the accompanying drawings and embodiments.
The present invention suitable for coping with the line of text test problems image, ask by two classification for being particularly suitable for candidate text box
Topic.The invention proposes the new methods for text detection and classification.1) when carrying out the detection and filtering of candidate region,
Using MSER detection algorithm, detection obtains the other candidate region of character level, constructs the other training dataset of character level, we
Training set is mainly derived from scene lteral data collection ICDAR 2003, ICDAR 2011 and ICDAR 2013, we are according to having marked
The character zone being poured in, intercepts corresponding text information as positive class, for the candidate region that detection algorithm obtains, choose with
The candidate region that positive class is not overlapped is as negative class, and in this, as input, training deep learning network is right using this classifier
Candidate characters are classified, screening and filtering, remove negative class.2) candidate characters are clustered into line of text using seed growth algorithm.
Lesser threshold values is arranged according to the abscissa of each central point in the central point for choosing each candidate region, will this threshold values with
Interior candidate characters region is all divided into the same line of text region according to horizontal direction.It calculates flat between each character
Equal distance is split for being significantly greater than between two characters of average distance, is divided into two different words.
3) when classifying to line of text, using depth measure model.The similarity between similar is minimized, inhomogeneity is maximized
Between diversity factor.The present invention includes the following steps:
Step 1, using MSER (Maximally Stable Extremal Regions, most stable extremal region) to defeated
Enter image and carry out detection algorithm, obtains the other candidate region of character level;
Step 2, the other training dataset of character level is constructed, training dataset of the present invention is mainly derived from scene lteral data
Collect ICDAR2003, ICDAR 2011 and ICDAR 2013, according to the character zone marked, intercepts text in character zone
Information chooses the candidate region not being overlapped with positive class as negative class, just for the candidate region that step 1 obtains as positive class
Class and negative class form the other training dataset of character level, and as input, training deep neural network uses this trained depth
Neural network is spent as classifier (classifier can judge whether contain character in character zone), to the time of candidate region
Word selection symbol is classified, and screening and filtering removes negative class;
Step 3, lesser threshold values is arranged according to the abscissa of each central point in the central point for choosing each candidate region
(generally 5 pixels), by the candidate characters region within this threshold values according to horizontal direction, are all divided into same
Line of text region;
Step 4, the average distance in the line of text region that step 3 obtains between each character is calculated, it is average for being greater than
Twice of distance of two characters are split, and are divided into two different words, conversely, will be twice less than average distance
Two characters belong to same word;To obtain the candidate region of word rank;
Step 5, the candidate region of the word rank obtained according to step 4, each character belong to a word, and one
Word is made of at least one character, using all words constructed as the data set of word rank, according to the text of word rank
This markup information (the text marking information of contained character in text marking information, that is, word rank data set of word rank) is cut
Take corresponding region as positive class, using the region not being overlapped with positive class as negative class;
Step 6, the positive class and negative class obtained according to step 5 is built depth measure model and is instructed using them as input
Practice, which can be used in the classification of word rank;
Step 7, the depth measure model obtained according to step 6, treats test image and is filtered, and obtains final text
One's respective area.
In step 1 when with MSER algorithm, sets the smallest by 1 for the threshold values of MSER, opened in image in detection one
When text filed, need in H, L, S (H (hue, form and aspect), L (lightness, brightness), S (saturation, saturation degree)) and
Four channels of gray scale use MSER algorithm.
In step 2, data set is constructed by oneself, will be in view of the data set constructed and detection when constructing data set
Similarity between picture.In general, the higher the better for similarity.
In step 5, the negative class of line of text is removed.Construct the data set of word rank, the construction process and step 2 of training set
It is similar.According to the markup information of word rank, corresponding region is intercepted as positive class.Those region conducts not being overlapped with positive class
Negative class.
Step 6 includes:
Piece image in the data set for the word rank that step 5 obtains is transformed into d dimension Euclidean space, then by step 6-1
Have:
In this formula (1),It is a pair of of triple,WithIt is building word rank in step 5
Belong to the sample of same class (positive class or negative class) in data set,Be withThe different sample of classification, f () instruction
Be depth measure model, margin is sample pairWith sample pairBetween parameter value;
Step 6-2 designs following loss function:
Wherein, Ni indicates sample number,Indicate feature of the depth measure model to i-th of ancestors' sample extraction, fIndicate the feature of sample extraction identical with i-th of ancestor categories,It indicates different from i-th of ancestor categories
The feature of sample extraction;
Step 6-3, using loss function training depth measure model, it includes two layers volume that the network of depth measure model, which has altogether,
Lamination, two layers of pond layer, two layers of full articulamentum, all image whole normalizings in the data set for the word rank for first obtaining step 5
32 × 32 are turned to, first convolutional layer convolution kernel number is 6, and convolution kernel size is 5 × 5;Second convolutional layer convolution nucleus number
Mesh is 12, and convolution kernel size is 5 × 5, and convolution kernel parameter initialization mode is random, first convolutional layer 6 convolution of output
Figure, size are 28 × 28, and the size of pond layer is 2 × 2, and pondization strategy is using maximum pond mode, first time pond
Afterwards, characteristic pattern size is 14 × 14;After second of convolution, characteristic pattern size is 10 × 10, and the number of full articulamentum is respectively
150 and 50, L2 regularization layer is added after convolution is complete, the characteristic criterion made, after the processing of these layers, step 5 is obtained
Word rank data set in all images become effective characteristic function, be finally introducing loss layers of training of Triplet, step
The up time function proposed in 6-2 is triplet loss.
Step 7 includes: image to be tested for one, obtains the other candidate regions of character level using the method detection of step 1
The negative sample of candidate region is removed using the deep neural network of step 2 in domain, utilizes the method for step 3 and step 4, construction
Candidate word rank region out, the depth measure model in recycle step 6 filter out negative each word rank territorial classification
Class, to obtain final text filed.
Embodiment:
The present invention using the above scheme, realizes the work that character area is detected on ICDAR2011.
Be implemented as follows: these three data sets are text detection standard data sets.It is detected first using detection algorithm
Candidate characters.After being filtered, screening out non-character candidate, character is clustered into line of text according to geological information.For every
One line of text classifies to these candidate characters, removes non-text by trained depth measure Study strategies and methods
One's respective area obtains final testing result.
Step 1, using MSER detection algorithm, in H, detection work is carried out respectively on L, S and four channels of gray scale, with
To candidate region as much as possible.
Step 2, the other training dataset of character level is constructed, training set is mainly derived from ICDAR2003, ICDAR 2011
Corresponding text information is intercepted as positive class, for detection algorithm according to the character zone marked with ICDAR 2013
The candidate region not being overlapped with positive class is chosen as negative class, in this, as input, training depth in obtained candidate region
It practises network to classify to candidate characters using this classifier, screening and filtering removes negative class;
Step 3, candidate characters are clustered into line of text using seed growth algorithm.Choose the center of each candidate region
Lesser threshold values is arranged according to the abscissa of each central point in point.By the candidate characters region within this threshold values according to water
Square to being all divided into the same line of text region.The average distance between each character is calculated, it is flat for being significantly greater than
It is split between two characters of equal distance, is divided into two different words;
Step 4, it in the horizontal text box that step 3 obtains, is refined again, calculates being averaged between each character
Distance is split for being significantly greater than between two characters of average distance, is divided into two different words;
Step 5, the data set of word rank is constructed, the construction process of training set is similar to step 2, according to the mark of word rank
Information is infused, interception corresponding region is as positive class, and the region that those are not overlapped with positive class is as negative class.Training depth measure mould
Type.Step 6, depth measure model and training are built.The training objective of depth measure model as shown in figures 1 and 2, in Fig. 1
Anchor is represented ancestors' image (containing text labels), and postive represents the image as anchor classification (containing text
This self), negtive is represented and the different image of anchor classification.Input is a pair of of image in Fig. 2, by neural network
(w represents its parameter), has obtained hiding input h, and Distance Metric represents metric range predetermined, finally judges
It whether is same image (Same or different).Tertiary target is by connecing between depth measure model latter three
Short range degree (similar sample is approached than dissimilar), it is special that Fig. 3 indicates how depth measure model extracts on an image
Sign, Fig. 3 indicate the training process of depth measure model: the left side is the image containing text of input, by a nerve net
Network obtains character representation (Feature Represention) finally by Triplet Loss defined in step 6, calculates
Penalty values out.Fig. 4 is the parameter of network structure.That be represented in Fig. 4 is the details of multilayer neural network, Input image
Size indicates input picture size (32*32), and Kernel size indicates convolution kernel size (5*5), and Pooling size is indicated
Down-sampled ratio, C1 and C2 indicate that two convolutional layers, F1 and F2 indicate two full articulamentums.
Step 7, image to be tested for one obtains the other candidate region of character level using the method detection of step 1,
Using the deep neural network of step 2, remove the negative sample of candidate region, using the method for step 3 and step 4, constructs time
The word rank region of choosing, the depth measure model in recycle step 6 filter out negative class to each word rank territorial classification, from
And it obtains final text filed.Fig. 5 is the concrete outcome that detection obtains, and flow chart of the present invention is as shown in Fig. 6.
The present invention provides a kind of Method for text detection based on depth measure model, implement the side of the technical solution
There are many method and approach, the above is only a preferred embodiment of the present invention, it is noted that for the common of the art
For technical staff, various improvements and modifications may be made without departing from the principle of the present invention, these improve and
Retouching also should be regarded as protection scope of the present invention.The available prior art of each component part being not known in the present embodiment is subject to
It realizes.