CN102073862B

CN102073862B - Method for quickly calculating layout structure of document image

Info

Publication number: CN102073862B
Application number: CN 201110040357
Authority: CN
Inventors: 马磊; 刘江
Original assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Current assignee: SHANDONG SHANDA OUMA SOFTWARE CO., LTD.
Priority date: 2011-02-18
Filing date: 2011-02-18
Publication date: 2013-04-17
Anticipated expiration: 2031-02-18
Also published as: CN102073862A

Abstract

The invention discloses a method for quickly calculating a layout structure of a document image. The method is characterized by comprising the following steps of: (1) inputting an image, and performing grayscale conversion if the input image is a true color image; (2) performing horizontal gradient calculation on the input image; (3) merging areas in and between characters of the input character;(4) marking a text line of the input image; (5) performing target tacking and positioning on the text line; (6) performing length filtering on the input image to obtain an image of the layout structure; and (7) outputting the image of the layout structure and a grayscale image. The calculation of the layout structure is simple and effective, the method has certain adaptability, the condition thatan image is deflected at a small angle can be processed, and the test of an aliasing condition of text lines of English script proves that the method for quickly calculating the layout structure of the document image has high robustness.

Description

A kind of fast file and picture layout structure computing method

Technical field

The present invention relates to a kind of layout structure computing method, specifically, relate to a kind of fast file and picture layout structure computing method.

Background technology

The layout structure of document possesses geometry and logical meaning, and for example general file and picture contains title, paragraph, row essential information.Layout structure can be thought to be comprised of several mutually disjoint rectangular blocks (block), and printed page analysis is to calculate these rectangular block features, is used for describing the architectural feature of file and picture.

Basic file and picture page geometry analytical algorithm can be classified as three classes: bottom-up class, top-down class and integrated approach class.Bottom-up method is utilized the local message of image, and (word merged in character in the merging in the zone by progressively will having same alike result; Sentence merged in word; The sentence section of merging into), obtain the understanding to the document space of a whole page.The method can be processed the document of the different spaces of a whole page and have the document of certain inclination, but the higher and regional merging rule of time cost is complicated.Top-down method is from image overall, rely on the figure of projected outline of image, progressively to Image Segmentation Using, obtain at last the geometry of image, the method has certain adaptability, the method for projection has simply, fast, the advantage that is easy to realize, but not good enough to the complex documents effect, the factor that affects the method validity comprises the randomness of literal line position, the scrambling of region shape and the inclination of file and picture etc.Therefore, the comprehensive two kinds of methods of scholar are arranged, so that its adaptability, Algorithm Performance all are improved.The present invention adopts bottom-up charcter topology analytical approach, avoid higher zone to merge rule, eliminate non-text filed by the length filtering method, top-down analytical approach is considered the link of line of text feature, in order to eliminate the staircase of text block, therefore the space of a whole page is expressed as the set of some continuous linears, their position by highly abstract be that point in the space is gathered.

The printed page analysis method can be used for a plurality of image processing field, for example file image content retrieval, coupling by layout structure increases its retrieval reliability, cut apart the field in the picture and text mixing, printed page analysis, the image property in the difference zones of different, and then adopt different enhancing algorithms to improve quality and the subjective feeling of image, in the OCR field, the accuracy of recognition performance and Character segmentation is closely related, and printed page analysis has very important status aspect Character segmentation.Therefore printed page analysis research has very important researching value and actual application background.

Summary of the invention

The technical problem to be solved in the present invention provides a kind of fast file and picture layout structure computing method, and it is simple, effective that layout structure calculates, and has certain adaptive faculty, can process the situation of low-angle image deflection, has preferably robustness.

The present invention adopts following method to realize goal of the invention:

A kind of fast file and picture layout structure computing method is characterized in that, comprise the steps:

(1) input picture if input picture is rgb image, then carries out gradation conversion;

(2) input picture being carried out horizontal gradient calculates;

(3) in the character of input picture and intercharacter zone is merged;

(4) input picture is carried out the line of text mark;

(5) line of text is carried out target following and location;

(6) input picture is carried out length filtering, obtain the layout structure image;

(7) output space of a whole page structural images and gray level image.

As the further restriction to the technical program, described step (2) comprises the steps:

(2.1) the convolution kernel factor of note horizontal gradient is

,

The presentation video width is

, highly be The gray-scale value of correspondence position pixel, then gradient image Corresponding gray-scale value

Be expressed as:

（1）

(2.2) By the statistics gray-scale value be

The probability that in image, occurs of pixel Obtain gradient image

Grey level histogram

, gray-scale value

Corresponding one-dimension information entropy is designated as

, cut-point then

Calculating be equivalent to following form:

（2）

(2.3) cut-point that maximum informational entropy is corresponding is

, in the formula (2),

Calculating use before the cut-point and the normalization probability behind the cut-point

The computing information entropy.

As the further restriction to the technical program, described step (3) comprises the steps:

(3.1) note perpendicular to the expansion factor on the words direction is

, the words direction expansion factor , average character duration is , the length filtering factor

, then have following relation to set up:

（3）

(3.2) note formula (2) is cut apart

After image be

,

Obtain through reaching the merging of intercharacter zone in the character

,

Width is on the presentation video

, highly be

Pixel grey scale, then:

（4）

As the further restriction to the technical program, described step (4) comprises the steps:

(4.1) note VG (vertical gradient) detection nuclear factor is

, the line of text marking image is

, get coboundary as the line of text mark, then have following relation to set up:

（5）

(4.2) mark of line of text has used coboundary or lower limb, in case line of text occur to be interrupted or since the undulatory property of character picture make Produce wave phenomenon, for overcoming this defective,

Before the calculating, repair For

,

Have better smooth effect, be conducive to the tracking of effective target.

Compared with prior art, advantage of the present invention and good effect are: layout structure of the present invention calculates simple, effective, has certain adaptive faculty, can process the situation of low-angle image deflection, test for English handwritten form line of text aliasing situation shows that this algorithm has preferably robustness.This disposal route algorithm complex is low, and the track and localization of length filtering and line of text target has been avoided connective computation process, and image thinning uses the VG (vertical gradient) alternate algorithm, so the method can be used for real time image processing system.The method can be used for a plurality of image processing field, the field such as cuts apart such as Document image retrieval, Character segmentation, file and picture classification, file and picture analysis, picture and text.

Description of drawings

Fig. 1 is the single character iconic model of the present invention.

Fig. 2 is single character graph connectedness synoptic diagram.

Fig. 3 is pitch structure synoptic diagram between the character.

Fig. 4 is character row pitch characteristics structural representation.

Fig. 5 is row piece preparing structure synoptic diagram.

Fig. 6 is row target following structural representation.

Fig. 7 is non-text area length filtering synoptic diagram.

Fig. 8 is preferred embodiment of the present invention overall flow figure.

Fig. 9 is the space of a whole page figure of preferred embodiment one.

Figure 10 is the gradation conversion image of preferred embodiment one.

Figure 11 is the capable piece marking image of preferred embodiment one.

Figure 12 is the VG (vertical gradient) computed image of preferred embodiment one.

Figure 13 is the filtered layout structure image of preferred embodiment one length.

Figure 14 is preferred embodiment one layout structure comprehensive comparison synoptic diagram.

Figure 15 is the image layout structural drawing after preferred embodiment one tilts.

Figure 16 is the comprehensive comparative structure synoptic diagram of preferred embodiment one inclination layout structure.

Figure 17 is the file and picture of preferred embodiment two.

Figure 18 is the layout structure image of preferred embodiment two.

Figure 19 is preferred embodiment two ladder eradicating efficacy figure.

Figure 20 is preferred embodiment two comprehensive comparative structure synoptic diagram.

Embodiment

Below in conjunction with accompanying drawing and preferred embodiment the present invention is done further to describe in detail.

1, image text line character characteristics:

The file and picture general significance refers to only comprise the image of character information, and the actual file and picture that uses is very complicated, such as picture and text mixing, calligraphy work, form etc., and this brings huge challenge to printed page analysis work.But we recognize that also character picture has some singularity, and for example its color resolution is less demanding, and space requirement resolution is high, and natural image is just opposite.The feature that how to take full advantage of text image is the key point of printed page analysis, and the singularity of file and picture is mainly manifested in the following aspects:

(1) all directions gradient is larger

Character picture has strong marginal information, gradient often reflects edge definition, therefore a strong edge (gradient) pixel represents possible text area, by the expansion to the text area certain size of this pixel representative, can effectively connect the text block that belongs to a line of text, in conjunction with the straight line characteristics of line of text, we have used the horizontal gradient detection method, horizontal text area extended method.

As shown in Figure 1, single character iconic model is the circle of a standard, and the pixel on each circle has stronger marginal information, illustrates that this zone is possible text area, when a plurality of character models effectively link, they will consist of an effective text block.

(2) various piece of monocase image is not a complete connected domain

Character picture, the single character of the expression of Chinese character image particularly, may have different ingredients, therefore working as a pixel is possible text area, propagation size needs greater than the interval between the different ingredients in the basic character, otherwise the cavitation that text block is calculated will occur, bring unfavorable factor to printed page analysis work.

As shown in Figure 2, when the text area of each pixel logo was expanded, the extends perpendicular size should be at least greater than the represented number of pixels of d among the figure.

(3) between the character certain gap is arranged in the same line of text

Interval between the character is the important evidence of Character segmentation algorithm, when printed page analysis, it is an effective text block that the text area expansion connects different character zones, non-text area does not possess this key character, this equates each character picture abstract be a circle, connect the result that these circles are printed page analysis.

As shown in Figure 3, if having indicated, the spacing between the character carries out effective text area merging, the text area propagation size is at least greater than the character pitch of spacing maximum, therefore this size can effectively be removed non-text filed by length filtering much smaller than the distance between the effective gradient pixel of non-text area.

(4) certain gap is arranged between the line of text

Layout structure calculates the text area of always selecting to carry out on the words direction by a relatively large margin and expands, with different ingredients in effective concatenation character and different characters, expansion in non-legible direction causes different literal line generation aliasings, therefore takes full advantage of the text area expansion of line of text gap constraint pixel representative.

As shown in Figure 4, the character row pitch characteristics is a restrictive condition apart from d, if the text area that carries out in the vertical direction expansion surpasses this distance, then can not distinguish two effective line of text.

(5) line of text has the straight line characteristics

The printed page analysis result uses the set of one group of straight line to explain, and utilizes the straight line characteristics of line of text to carry out tracking and the location of straight line.This feature can be used for the pitch angle of file and picture and estimates and correction.

(6) character is complicated at aspects such as size, language classification, color, fonts.

Layout structure calculates

According to the analysis of above-mentioned image text line character feature, this section specifically describes computing method and the layout structure statement model of layout structure.

2.1 horizontal gradient calculates:

The line of text zone that the region representation that horizontal gradient is large is possible, the large selected threshold value of the gradient of file and picture is determined by the one-dimensional maximum entropy for segmentation method.

The convolution kernel factor of note horizontal gradient is

, The presentation video width is , highly be The gray-scale value of correspondence position pixel, then gradient image

Corresponding gray-scale value

Be expressed as:

（1）

Consider the complicacy of file and picture, choosing of its Grads threshold should have adaptivity, obtains gradient image by statistics Grey level histogram

, gray-scale value

Corresponding one-dimension information entropy is designated as

, cut-point then

Calculating be equivalent to following form:

（2）

The cut-point that maximum informational entropy is corresponding is

, in the formula (2),

The computing information entropy.

2.2 in the character and intercharacter zone merges:

The calculating of space of a whole page feature need be finished possible line of text region merging algorithm, also find to be difficult to estimate character pitch and different line of text spacings in the experiment, it has been generally acknowledged that the line of text spacing is more than or equal to the distance of 2 pixels, therefore the expansion factor with the words direction vertical direction is 2, guarantee not occur between the line of text obscuring of the space of a whole page, spacing and character pitch are usually less than character duration in the single character, therefore the expansion factor on the words direction is greater than character duration, usually choose the twice of character duration, too large expansion factor produces the row block length and detects error, in case the expansion factor on the words direction has determined that effectively filter length is expansion factor on the words direction.Note perpendicular to the expansion factor on the words direction is

, the words direction expansion factor

, average character duration is

, the length filtering factor

, then have following relation to set up:

（3）

Unique needs determine it is the character mean breadth, with regard to practical application, choose suitable character duration and can satisfy the major applications demand, this is less than character duration because of most of character pitch, after two characters successfully connect, its block length just is 3 times of character durations, and this length is fit to length filtering very much, and this is the important difference of text area and non-text area just also.

Note formula (2) is cut apart

After image be ,

Obtain through reaching the merging of intercharacter zone in the character

,

Width is on the presentation video

, highly be Pixel grey scale, then:

（4）

2.3 line of text labeling method

The line of text labeling algorithm is simplified subsequent processes with the layout information singular pixel,

Be bianry image, therefore only use its coboundary or lower limb can describe positional information, the length information of line of text.

The note VG (vertical gradient) detects nuclear factor

, the line of text marking image is

（5）

The mark of line of text has used coboundary or lower limb, in case line of text occur to be interrupted or since the undulatory property of character picture make Produce wave phenomenon, for overcoming this defective,

Before the calculating, repair

For

,

Have better smooth effect, be conducive to the tracking of effective target.As shown in Figure 5,1-9 is the represent pixel point respectively, and black shade represents text area, and when step occurred, the position of filling up 4 or 6 correspondences was text area.

2.4 line of text target following and location

For obtaining an effective line of text and length and location thereof, the line trace algorithm is judged the contiguous pixels on three directions.The elimination of step is so that the track and localization of line of text target is very accurate, and the present invention uses the continuous curve pixel count to represent the length of line of text.

As shown in Figure 6, the pixel status of 3 directions of marker for judgment of target, because coboundary or the lower limb of line of text piece have been used in the printed page analysis of line of text, therefore can fully guarantee single pixel characteristic of layout structure feature, labeling algorithm is more simple effectively, only need to judge on three adjacent directions of current pixel whether pixel corresponding to layout structure is arranged, the result that labeling algorithm produces uses a five-tuple to describe the row essential characteristic

:

（6）

The basic meaning of five-tuple is X _MinCapable minimum widith coordinate, the X of expression portrayal straight line boundary rectangle _MaxCapable breadth extreme coordinate, the у of expression portrayal straight line boundary rectangle _MinCapable minimum constructive height coordinate, the у of expression portrayal straight line boundary rectangle _MaxCapable maximum height coordinate and the р of expression portrayal straight line boundary rectangle _TotalThe effective length of expression row information.

2.5 length filtering

Length filtering occurs in and reaches in the character after the merging of intercharacter zone, the significance of length filtering is to carry out non-text filed filtration, its basic thought is to judge whether possible text area pixel both sides satisfy the requirement of length filtering, length computation corresponding to this pixel this moment used target following, if this length is greater than filter length, stop to calculate, think that this article local area pixel is true, otherwise be non-text area pixel.

As shown in Figure 7, the possible text area (current judgement target) of " * " representative, white pixel represents non-text area, the current goal left side and about four text area pixels are respectively arranged, the effective length of current pixel is 8, and in fact, the effective length of each text area pixel shown in the legend is 8, if the filter length parameter is 5, then the text area pixel shown in the legend all keeps.

The line of text feature uses formula (6) to describe, and according to the needs of practical application, may again carry out length filtering, owing to describe the length that has calculated line of text in the process in feature, therefore can be very easy to carry out the filtering work of little target according to length.

[0008]3, the process flow diagram of integral body of the present invention

This algorithm has been avoided the preprocessing process of file and picture, use the one-dimensional maximum entropy for segmentation method of gradient image to determine that text area is in the cut-point of non-text area, printed page analysis has taken into full account the essential characteristic that file and picture is different from natural image, set up the monocase model, and considered spacing, character pitch, line space feature in the character, after length filtering, obtained extraordinary effect.

As shown in Figure 8, if input picture is rgb image, then carry out gradation conversion, image is carried out horizontal gradient to be calculated, behind the row piece mark, image is carried out VG (vertical gradient) calculate, the VG (vertical gradient) computing method adopt formula identical with the horizontal gradient computing method, do not repeat them here, then carry out the target following marking image, optionally carry out length filtering, eliminate step, the effect of calculating for the ease of comparing layout structure, output image has comprised layout structure and two kinds of information of gray level image.

Embodiment one

For the validity of testing algorithm, we choose Sohu's page of layout structure more complicated as test sample book, and the parameter initialization condition is:

（7）

(1) picture and text vision-mix printed page analysis is referring to Fig. 9-Figure 14.

(2) image inclination is on the impact of this algorithm

Referring to Figure 15, Figure 16, in order to test the adaptability of this algorithm, our angle certain to former figure deflection (5 angle), image rotation has used bilinear interpolation method, from horizontal direction, long row has produced aliasing, bilinear interpolation often produces obscurity boundary, can produce more straight line staircase during inclination, but test case shows on length filtering impact not quite, experiment showed, that the method has certain stability to the little image in angle of inclination.When the angle of inclination is excessive, can estimate the angle of inclination, carry out again printed page analysis work behind the correction image.Pitch angle method of estimation based on file image content has some Research foundations at present, but reliability and the accuracy of using morphological method to calculate in conjunction with Hough conversion Effective Raise angle of inclination.

Embodiment two

The printed page analysis of English handwritten form

Referring to Figure 17-20, English handwritten form file and picture has some characteristic features, for example often produces aliasing in the ranks, and row piece step phenomenon is particularly evident, experiment showed, that this algorithm has good robustness to such image layout analysis, and effect is remarkable.

In sum, that layout structure of the present invention calculates is simple, effectively, has certain adaptive faculty, can process the situation of low-angle image deflection, shows that for the test of English handwritten form line of text aliasing situation this invention has preferably robustness.This disposal route algorithm complex is low, and the track and localization of length filtering and line of text target has been avoided connective computation process, and image thinning uses the VG (vertical gradient) alternate algorithm, so the method can be used for real time image processing system.The method can be used for a plurality of image processing field, the field such as cuts apart such as Document image retrieval, Character segmentation, file and picture classification, file and picture analysis, picture and text.

Claims

1. file and picture layout structure computing method fast is characterized in that, comprise the steps:

(2) input picture being carried out horizontal gradient calculates;

(3) in the character of input picture and intercharacter zone is merged;

(4) input picture is carried out the line of text mark;

(5) line of text is carried out target following and location;

(7) output space of a whole page structural images and gray level image.

2. described file and picture layout structure computing method according to claim 1 is characterized in that described step (2) comprises the steps:

(2.1) the convolution kernel factor of note horizontal gradient is

,

The presentation video width is

, highly be

The gray-scale value of correspondence position pixel, then gradient image Corresponding gray-scale value Be expressed as:

（1）

(2.2) The probability that the pixel that is by the statistics gray-scale value occurs in image obtains the grey level histogram of gradient image, gray-scale value

Corresponding one-dimension information entropy is designated as

, cut-point then

Calculating be equivalent to following form:

（2）

(2.3) cut-point that maximum informational entropy is corresponding is , in the formula (2),

Calculating use before the cut-point and the normalization probability behind the cut-point The computing information entropy.

3. described file and picture layout structure computing method according to claim 2 is characterized in that described step (3) comprises the steps:

(3.1) note perpendicular to the expansion factor on the words direction is

, the words direction expansion factor

, average character duration is

, the length filtering factor

, then have following relation to set up:

（3）

(3.2) note formula (2) is cut apart

After image be

, Obtain through reaching the merging of intercharacter zone in the character

,

Width is on the presentation video

, highly be

Pixel grey scale, then:

（4）。

4, described file and picture layout structure computing method according to claim 2 is characterized in that described step (4) comprises the steps:

(4.1) note VG (vertical gradient) detection nuclear factor is

, the line of text marking image is

（5）

(4.2) mark of line of text has used coboundary or lower limb, in case line of text occur to be interrupted or since the undulatory property of character picture make

Produce wave phenomenon, for overcoming this defective, Before the calculating, repair For

,

Have better smooth effect, be conducive to the tracking of effective target.