CN103413132A

CN103413132A - Progressive level cognitive scene image text detection method

Info

Publication number: CN103413132A
Application number: CN2013102534371A
Authority: CN
Inventors: 刘跃虎; 周刚; 苏远歧; 翟少卓
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2013-06-24
Filing date: 2013-06-24
Publication date: 2013-11-27
Anticipated expiration: 2033-06-24
Also published as: CN103413132B

Abstract

The invention relates to a progressive level cognitive scene image text detection method, which is characterized by comprising the steps of: composing different connected component collections according to space neighboring and arrangement relations of connected components on the basis of acquired connected components firstly, wherein the connected component collections include single connected components, connected component pairs and connected component rows; designing different features for the different connected component collections, taking the text confidence degree of the different connected component collections as a feature of subsequent connected component collections; performing supervised learning of classifier parameters of each level through a consistency cognitive hypothesis and a conditional random field model of the connected component collections, calculating text confidence degrees of the connected components in sequence; and positioning text lines finally. The progressive level cognitive scene image text detection method integrates appearance characteristics, low order relations and high order relations, direct calculates parameters and categories through a classifier algorithm, and can effectively improve the recall ratio and precision ratio of scene image text detection results.

Description

A kind of scene image Method for text detection of progressive level cognition

Technical field

The present invention relates to scene image text detection technical field, be specifically related to a kind of scene image Method for text detection of progressive level cognition.

Background technology

Text detection is the visual appearance feature had by word, out, provides powerful support for for follow-up text identification provides from location image text filed.Text detection, as the guardian technique in the text message extraction, becomes the hot research problem in computation vision field already.But text is as a kind of special sensation target, because text size, font, color, languages etc. have uncertainty, simultaneously in natural scene image a large amount of complex background easily and text obscure, these make text filed being difficult to of scene image be detected.Existing employing is according to text, to be communicated with composition to distinguish with the difference that non-text is communicated with composition based on the key step of the Method for text detection that is communicated with composition, but it is similar that text is communicated with the outward appearance outward appearance different, be communicated with composition with non-text of composition, make this connection composition distinguish the difficulty that becomes.

Therefore, be combined with context from the external appearance characteristic that is communicated with composition that to distinguish be the technology path that a class is new.The Pan method is utilized the context of neighbour's binary relation and is considered that external appearance characteristic is (with reference to the method for Pan: Pan YF, Hou XW, Liu CL.A Hybrid Approach to Detect and Localize Texts in Natural Scene Images[J] .IEEE Transactions on Image Processing, 2011,20 (3): 800-813).Yi method and Yao method are communicated with the high-order relationship analysis line of text feature that composition spatially forms (with reference to the method for Yi: Chucai Y to text, YingLi T.Text string detection from natural scenes by structure-based partition and grouping[J] .IEEE Transactions on Image Processing, 2011, 20 (9): 2594-2605. is with reference to the method for Yao: Cong Y, Xiang B, Wenyu L, et al.Detecting texts of arbitrary orientations in natural images[C], 2012:1083-1090).But integrated appearance feature, low order relation, high-order relation still lack corresponding theoretical model, this makes characteristic Design and parameter learning all have difficulties, the universality deficiency of model.

Summary of the invention

The problem existed in order to solve above-mentioned prior art, the object of the present invention is to provide a kind of scene image Method for text detection of progressive level cognition, for Visual intelligent systems such as vehicle-mounted vision guided navigation and scene image semantic analyses, than existing methodical precision ratio and recall ratio, all effectively improve aspect connect component analysis.

For reaching above purpose, the present invention adopts following technical scheme:

A kind of scene image Method for text detection of progressive level cognition, use for reference the level characteristics of human cognitive, be communicated with on the composition basis obtaining scene image, at first utilize the adjacent and Rankine-Hugoniot relations in the space that is communicated with composition to form different connection composition set: single connection composition, be communicated with composition to be communicated with in lines; Then for difference, be communicated with the composition set and design respectively different features, difference is communicated with to a kind of feature of the text degree of confidence of composition set as the set of follow-up connection composition; By the classifier parameters of the cognitive hypothesis of the consistance that is communicated with the composition set and each level of conditional random field models supervised learning, and calculate successively the text degree of confidence that is communicated with composition; Final localization of text row; Specifically comprise the steps:

Step 1: in ground floor is analyzed, extract the external appearance characteristic of single connection composition, with the sorter supervised learning and estimate the text degree of confidence of single connection composition;

Step 2: before the second layer was analyzed, the single connection composition of candidate was with spatial relation, and cluster forms and is communicated with composition pair in twos;

Step 3: in the second layer is analyzed, extract and be communicated with the right similarity feature of composition and the average composition energy feature that is communicated with, be communicated with the right text degree of confidence of composition with sorter supervised learning estimation;

Step 4: before the 3rd layer analysis, the candidate is communicated with composition to the relation of being connected and Rankine-Hugoniot relations, forms and is communicated with into branch;

Step 5: in the 3rd layer analysis, extract the energy feature average that is communicated with into difference in appearance feature in lines, histogram of gradients feature, all single connection compositions and be communicated with the right energy feature average of composition, utilizing sorter supervised learning localization of text row.

For single connection composition, design be characterized as external appearance characteristic, comprise geometric properties, live width feature and textural characteristics.

For being communicated with composition pair, design be characterized as similarity feature and the average composition energy feature that is communicated with.

For being communicated with into branch, the energy feature average that is characterized as difference in appearance feature, histogram of gradients feature and all single connection compositions of design and the right energy feature average of connection composition.

The present invention's difference (innovative point) compared with the prior art is as follows:

1) the present invention adopts the level characteristics of human cognitive, from three level objects, design one by one character pair, analysis result is propagated between level, the non-text of progressive filtering is communicated with composition, the present invention introduces sorter output and propagates as the level that is communicated with composition set text degree of confidence, can effectively improve recall ratio and the precision ratio of scene image text detection result;

2) for model parameter estimation and classification, infer, the present invention considers external appearance characteristic, low order relation and high-order relation, can directly pass through classifier algorithm calculating parameter and classification.And existing method is difficult to estimated parameter and infers classification under the high-order relation condition.

The accompanying drawing explanation

Fig. 1 is that process is inferred in parameter learning and the cognition of level cognitive model.

Fig. 2 is energy feature analysis and the comparative graph of level cognitive model, and wherein Fig. 2 A is the classification results comparative graph of three kinds of different characteristic set in the second layer; Fig. 2 B is the classification results comparative graph of three kinds of different characteristic set in the 3rd layer.

Embodiment

The present invention is described in further detail below in conjunction with drawings and the specific embodiments.

At first in invention, being communicated with composition set generation, carry out following description.

In order to obtain the connection composition set of every one deck, we need to carry out cluster analysis to being communicated with composition.Cluster is divided into two steps, and before the second layer was analyzed, cluster went out candidate's connection composition pair.Then before the 3rd layer analysis, we need to become branch to the connection that forms the candidate being communicated with composition.Below with regard to two sorting procedures, be specifically described.

Two adjacent and connection component X almost parallel appearance _iAnd X _j, just be marked as candidate's connection composition pair, meet following two conditions:

dist(X _i,X _j)<2·max(max(w _i,h _i),max(w _j,h _j)) (1)

dist _y(X _i,X _j)<0.5·max(h _i,h _j) (2)

In formula (1) and formula (2): dist (X _i, X _j) mean that two are communicated with component X _iAnd X _jThe Euclidean distance of barycenter, dist _y(X _i, X _j) mean that two center-of-mass coordinates that are communicated with compositions are at distance longitudinally, (w _i, h _i) and (w _j, h _j) be respectively corresponding two width and height that are communicated with the external frame of compositions.

Text is communicated with composition to (X _i, X _j) tiltangleθ _IjBe defined as X _iAnd X _jThe barycenter inclination angle, two texts are communicated with compositions to (X _i, X _j) and (X _j, X _k) between the difference at pitch angle can not be greater than π/12, meet following equation:

|θ _ij-θ _jk|≤π/12 (3)

By the right connection of this composition in twos, can, so that all point-blank connection compositions can both connect together, form and be communicated with into branch.

Then for the text confidence calculations that is communicated with the composition set, analyze.

Suppose to have n cluster formation that is communicated with composition process priori to be communicated with into branch, so just form a graph model G=(v, ε).Wherein ε means the limit formed between all nodes, and v means all nodes.It is X=[x that these nodes have formed whole random series observed reading ₁, x ₂... x _n], the demarcation of corresponding random series is Y=[y ₁, y ₂... y _n].Between these nodes, meet Markov property (cluster is considered the spatial neighbors relation), according to the definition of document to condition random field, when sequence is demarcated as Y=Y ^*The time, the observed value X of take is conditional probability:

P(Y=Y ^*|X)∝exp(-E(X,Y ^*,C,Λ)) (4)

E (X, Y wherein ^*, C, Λ) and be the energy function of whole graph model, the sub-group in C presentation graphs model, Λ is the parameter of energy function.In the level cognitive model, three seeds groups are arranged is single connection composition, be communicated with composition to and be communicated with into branch.And the energy of all sons groups and, formed whole energy function:

E (X, C) = \underset{c &Element; C}{Σ} V_{c} (X) - - - (5)

Wherein: V _c(X) mean certain seed group energy and.Need to further go to solve the parameter Λ in whole model, and infer final calibration result.

As a rule, the various parameter estimation in condition random field are condition log likelihoods, maximize namely that the method for conditional probability solves.These class methods are sought a kind of method (being the minimization of energy function) that maximizes probability often, carry out Optimal Parameters.As the C of fruit group, comprise polynary son and roll into a ball, the Optimal Parameters problem becomes the NP-hard problem, is difficult to solve parameter.And the present invention has done the hypothesis of two aspects to this problem in text detection: an aspect thinks that the set of text connection composition forms a local association usually in image, and with the non-text in image, is not communicated with composition generation relation; In addition on the one hand, because we only are concerned about that it (is Y that text is communicated with this a kind of situation of composition set ^*=1), and the situation that may occur in other random fields we all think that non-text sequence (is Y ^*=0).Therefore, only need estimate Y ^*=1 text degree of confidence P (Y in this case ^*=1|X), be referred to as the consistance cognition.Therefore minimization of energy random field adopted usually obtains the mode of demarcating, become the text degree of confidence solved under certain demarcation, the connection composition scale that makes judgement form is decided to be a binary classification problems, just can set up by the mechanism of supervised learning the parameter of positive sample and negative sample training classifier, and with the output of sorter, directly carry out the energy function of match random field integral body, as shown in Figure 1.The energy value of the son group under different levels is all to obtain by the output of sorter.And from the angle of sorter, the front output which floor obtains, the feature very strong as a kind of classification capacity judges in the sorter of rear layer.The ripe algorithm of various sorters simultaneously, also can guarantee the validity of parameter learning result.

Obtaining on clustering rule and model parameter study and cognitive basis of inferring, the feature of design different levels is also calculated corresponding text degree of confidence.

1) single connection composition level: mainly comprise the feature of three types, geometric properties f _g, live width feature f _SwAnd textural characteristics f _t.Geometric properties f wherein _gComprise that each is communicated with length breadth ratio, axial length ratio, dutycycle and the degree of compacting of composition.And live width feature f _SwTo calculate on the basis that is communicated with the composition live width, designing live width ratio and live width Variance feature.Textural characteristics f _tTo calculate foreground color consistance and the background color consistance that is communicated with the composition regional area.Three kinds of features can be utilized sorter supervised learning model parameter λ _uWith estimation text degree of confidence F _u(), thus the energy value E of this connection composition obtained _u(), as shown in the formula:

E _u(X,y _i=1,λ _u)=1-F _u([f _g(X),f _sw(X),f _t(X)],λ _u) (6)

2) be communicated with composition to level: mainly comprise the feature of two types, on average be communicated with composition to energy feature f _UpWith similarity feature f _Sa.Wherein on average be communicated with composition to energy feature f _Up, i.e. the average of the single connection composition energy feature that obtains of last level.And similarity feature f _Sa, be that two aspect ratio, live width ratio and front background colors that are communicated with composition are poor.Same two kinds of features can be utilized sorter supervised learning parameter lambda _bWith estimation, be communicated with the right text degree of confidence F of composition _b(), thus the right energy value E of this connection composition obtained _b(), as shown in the formula:

E _b(X,(y _i,y _j)=1,λ _b)=1-F _b([f _up(X),f _sa(X)],λ _b) (7)

3) be communicated with into branch's level: comprise and be communicated with into the energy feature f of branch _str, difference in appearance feature f _vAnd histogram of gradients feature f _hog.Be communicated with into the energy feature f of branch _strComprise all single connection composition energy feature averages and be communicated with composition to the energy feature average.Difference in appearance feature f _v, comprise the height variance, live width variance and the foreground color variance that are communicated with composition.The histogram of gradients feature is by calculating 4 gradient directions and six regional areas features of totally 24 dimensions, described the local grain that line of text has and distributed.Adopt this level parameter lambda of sorter supervised learning _sWith estimation, be communicated with the right text degree of confidence F of composition _s(), thus the energy value E that this is communicated with into branch obtained _s(), as shown in the formula:

E _s(X,Y ^*=1,λ _s)=1-F _s([f _str(X),f _v(X),f _hog(X)],λ _s) (8)

Pass through F _s() finally detects line of text.By experiment, this level cognitive model can effectively improve by the non-text connection composition of filtering successively precision ratio and the recall ratio of scene text.On standard testing collection ICDAR2005, carry out positioning result relatively, as shown in table 1.And the energy feature between the level of design also is proved to be and has extraordinary effect, as shown in Figure 2, can find out that composition can both effectively improve the classification capacity of level cognitive model to energy feature with becoming branch's energy feature.

Aforementioned content is only explanation of the principles of the present invention.

Table 1 ICDAR2005 text positioning result relatively

Claims

1. the scene image Method for text detection of a progressive level cognition, it is characterized in that: the level characteristics of using for reference human cognitive, be communicated with on the composition basis obtaining scene image, at first utilize the adjacent and Rankine-Hugoniot relations in the space that is communicated with composition to form different connection composition set: single connection composition, be communicated with composition to be communicated with in lines; Then for difference, be communicated with the composition set and design respectively different features, difference is communicated with to a kind of feature of the text degree of confidence of composition set as the set of follow-up connection composition; By the classifier parameters of the cognitive hypothesis of the consistance that is communicated with the composition set and each level of conditional random field models supervised learning, and calculate successively the text degree of confidence that is communicated with composition; Final localization of text row; Specifically comprise the steps:

Step 5: in the 3rd layer analysis, extract difference in appearance feature, histogram of gradients feature and all single connection composition energy feature averages that is communicated with into branch and be communicated with composition to the energy feature average, going out line of text with sorter supervised learning final decision.

2. the scene image Method for text detection of a kind of progressive level cognition according to claim 1 is characterized in that: for single connection composition, design be characterized as external appearance characteristic, comprise geometric properties, live width feature and textural characteristics.

3. the scene image Method for text detection of a kind of progressive level cognition according to claim 1 is characterized in that: for being communicated with composition pair, design be characterized as similarity feature and the average composition energy feature that is communicated with.

4. the scene image Method for text detection of a kind of progressive level cognition according to claim 1, it is characterized in that: for being communicated with into branch, being characterized as difference in appearance feature, histogram of gradients feature and all single connection composition energy feature averages and being communicated with composition to the energy feature average of design.