CN104966051A

CN104966051A - Method of recognizing layout of document image

Info

Publication number: CN104966051A
Application number: CN201510297257.2A
Authority: CN
Inventors: 时金桥; 范晓鹏; 陈小军; 郭莉; 蒲以国; 文新; 邹亚劼; 王洋
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2015-06-03
Filing date: 2015-06-03
Publication date: 2015-10-07
Anticipated expiration: 2035-06-03
Also published as: CN104966051B

Abstract

The invention discloses a method for recognizing the layout of a document image. The method comprises the steps of first designing a layout storage function, according to which layout content and a layout serial number generated by the relative word height and alignment of the layout content are saved in a storage; and if an unknown image is subjected to layout analysis and the obtained layout serial number is the same as one in the storage, then extracting the layout information of the unknown image based on a prompt message in the storage. The invention adopts a highly efficient and accurate layout analysis method for recognizing a document image, and is especially suitable for layout recognition of a Chinese official document image.

Description

A kind of Layout Recognition method of file and picture

Technical field

The invention belongs to area of pattern recognition, is a kind of Layout Recognition method proposed for file scanned image.

Background technology

In recent years, along with China's economic is fast-developing, government department instructs and the policy of formulation gets more and more, and country and local policy issue with the form of official document, and along with the development of science and technology, the documents such as increasing official document are preserved with the form of image.In the face of the official document that enormous amount, format are different, we need the format that can go out official document to its automatic distinguishing, and impersonal force.

Official document and Party and government offices' official document.The kind of official document is called for short language, the official document of state administrative organs is summarized as nine class 13 kinds by " state administrative organs's document treatment tentative method " that the General Office of the State Council issues, order, decision, bulletin, notice, circular, proposal, report, ask for instructions, give an written reply, suggestion, letter, meeting summary.Comprise the cut-off rule in part number, level of confidentiality and security deadline, urgency level, issued organ's mark, documment number, signed by, version head, title in official document, mainly send the attribute such as office, text.In concrete implementation, a official document not necessarily comprises above-mentioned all properties, along with the increase of official document quantity, and the widespread use of the electronic equipments such as scanner, official document is able to preserve with the form of scan image, and how effective therefore to carry out Layout Recognition to pictures such as official documents be very necessary.

How from a large amount of picture, to detect particular document picture, and the corresponding information of correct extraction document picture, up to now, still there is no any good method.At present, printed page analysis technology has developed into and has used different technology for different documents.Ma Zhuan, State of Zhao weighs, and the automatic marking papers system that the people such as Ren Zhanpeng propose based on OCR recognition technology is studied.This is a kind of top-down analytical approach, refers to the entirety from the page, payes attention to global information, general image is divided into several regions, then according to the hierarchical structure information of text image, continues to divide to main areas.The visiting card system that Wu Yukun proposes based on OCR is studied, the bottom-up analytical approach of printed page analysis has been used in this research, from the pixel of image, pay attention to local message, large regions is progressively synthesized in image zonule,---word---line of text---paragraph etc., until cover whole image for word.These methods are all the formats for the similar size of font, and the algorithm of employing is template matching algorithm, connected domain algorithm etc., and its shortcoming is that operand is large, and speed is slow.Current existing line of text, character cutting method can not accurately carry out cutting in Chinese, digital mixing environment and different font size word mixing situation, in official document recognition system, be all that font size differs about sending the documents for word and dispatch department, title etc.Therefore, need one efficiently and printed page analysis method accurately identify document picture.

Summary of the invention

For the problems referred to above, the object of this invention is to provide a kind of Layout Recognition method of file and picture, by efficiently and printed page analysis method accurately identify document picture, be particularly useful for the Layout Recognition of Chinese official document file and picture.

To achieve these goals, the present invention is by the following technical solutions:

A Layout Recognition method for file and picture, comprises the following steps:

1) according to the format picture of different document sample, format feature database is generated.

Further, the format content of different document sample and format sequence number that is high relative to word by format content, alignment thereof generation is preserved in described format feature database.

In order to extract layout information more accurately, first the present invention devises a format and enters library facility, pass through user interface exactly, to input format picture, draw rectangle frame by user and go to indicate which block is title, which block is dispatch department, any block be dispatch for word etc., then put in storage, format content can be preserved in storehouse, and format sequence number that is high relative to word by format content, alignment thereof generation, this format sequence number is extremely important in layout information extracts.It is the sequence number by sequence, and the numeric sequence number generation of alignment thereof generation.As having 3 pieces in format, first sequence produced after ranking results is 001221, first 0 and represents first piece, and second 0 represents first piece is maximum, and 1 represents second piece, and 2 expressions second piece are the third-largest, by that analogy.Second sequence that alignment thereof produces is 212, and wherein 2 represent align center, and 1 represents Right Aligns.So its sequence number is 001221212.

Only have a sequence number in the format analysis phase, if a unknown picture is through format analysis, the format sequence number obtained is the same with certain the format sequence number in storehouse, so removes by the information in storehouse the layout information extracting this unknown picture.This format feature database generated can improve the accuracy that layout information extracts.

2) scan document to be identified, obtain scan image.

This step can also comprise carries out pre-service to scan image, and described pre-service comprises denoising (removal ink, remove seal), Slant Rectify etc.

Some document may produce pad-ink in print procedure, may produce other noises, especially salt-pepper noise in scanning process.Secondly, some seals added a cover by some document pictures, and it can produce interference to normal format region, and this result also in follow-up OCR (Optical Character Recognition, optical character identification) Recognition feedback result is a slice mess code.Again, the inclination of document picture can produce interference to line of text segmentation.Therefore this invention system is needed to provide the denoising function of picture, to strengthen robustness and the accuracy of this invention.

3) Region dividing is carried out to scan image, determine the text of document to be identified.

Carry out line of text segmentation according to projection information to scan image, the textural characteristics mainly through monochrome pixels point determines cutting position.Find out the minimum font size of line of text, the bottom-up end of text (EOT) row finding text, then the top-down searching text initial row that can mate with end line.If can not find start of text (STX) row or end of text capable, start of text (STX) rower is designated as 0, and end of text rower is designated as the ending of line of text.Start of text (STX) row and end of text capable between be the text of document.

4) Region dividing is carried out to part more than document text to be identified, and obtain the layout information in each region.

To part more than text, having, identical word is high, the row of line space, alignment thereof is put into same region.And if have multiple line of text in same intra-zone left side, right side only has a line of text, needs again to divide region, using the subregion of right side line of text as this region.

Ready-portioned region is by generation format sequence number, and this format sequence number is by alignment thereof, and relative word height generates.

Described layout information comprises: font size size in region, sequence, region are relative to the alignment thereof of whole scan image.

5) by step 4) layout information that obtains mates with the layout information in format feature database, if matched, then from format feature database, extracts corresponding layout information; If do not matched, then the layout information in each region and the format word preset are integrated (when document is as official document document, this format word collection comprises lemma collection, and department's word collection and dispatch are for word word collection) coupling, obtain Layout Recognition object information.

Specifically, step 4) layout information that obtains is mainly for document picture to be identified, mainly format sequence number, and the OCR result in each region.Layout information in format feature database is mainly: the rule that each warehouse-in picture is corresponding, also namely: 1) format sequence number; 2) information labels (regional number that namely, information is corresponding), which block such as title is, which block dispatch department is, which block is sent the documents for word is.If certain pending picture match has arrived sequence number, corresponding to pending picture information extraction by information labels, the sequence number as title: 1,1 represents that first region is title.

By above step, the analysis to picture format can be completed, the corresponding layout information of final correct extraction.Wherein find the text of file and picture and determine that text is core of the present invention with the format region of upper part.

Beneficial effect of the present invention is:

Compared with prior art, Layout Recognition method provided by the invention has higher recognition accuracy, precision and efficiency, and has larger practicality and using value.

Accompanying drawing explanation

Fig. 1 is the overall flow figure of Layout Recognition method of the present invention.

Fig. 2 is official document schematic diagram in the embodiment of the present invention 1.

Fig. 3 is the layout information schematic diagram extracted in the embodiment of the present invention 1.

Fig. 4 is official document schematic diagram in the embodiment of the present invention 2.

Fig. 5 is the layout information schematic diagram extracted in the embodiment of the present invention 2.

Embodiment

Below for Chinese official document document, by reference to the accompanying drawings embodiments of the present invention will be elaborated.

The overall flow of Layout Recognition method of the present invention as shown in Figure 1, specifically comprises five steps:

1. pair official document scan image carries out pre-service, carries out adjusted size, removes the operations such as fuzzy, slant correction, be beneficial to the Layout Recognition of official document to image.Concrete processing procedure is as follows:

(1) for removal salt-pepper noise, according to switch filtering thought, the present invention prepares to use max-min operator as salt-pepper noise detecting device, utilizes adaptive neighborhood window to carry out lining by line scan from left to right to image, carries out noise differentiation simultaneously to the pixel being positioned at window center.If the gray-scale value of this point is between maximum value and minimal value, then think that this point is by noise pollution; If the gray-scale value of this point equals extreme value, then think that this point may be polluted by salt-pepper noise, and then utilize the method improved to differentiate, and using the substitution value of operation result as this point.

(2) remove the seal of title upper part, utilize canny rim detection, find profile, according to the trained values of some samples, when the contour area at edge is greater than a certain threshold value, then it is that the possibility of seal is very large, it can be removed.

(3) Slant Rectify is the statistical graph by black pixel number in accumulative picture of publishing picture, and carries out projection obtain level and vertical projection to line direction.For the image tilted, according to the perspective view along text vergence direction, all side is maximum, within the scope of certain angle with specific resolution for interval rotates file and picture respectively, obtain the perspective view of rotated image, then using make perspective view square error maximum rotation angle as angle of inclination.

2., according to projection information, line of text segmentation is carried out to official document.For the character area determined, add up the stain number of every a line.Find the initial row that continuous three row stain numbers are greater than 3, mark current line is as the initial row of text.From initial line of text, add up the average stain number of the first eight row, add up the stain number often arranged in this eight row, first stain number is more than or equal to the row of 5 as the initial row of text.The row that last stain number is more than or equal to 5 arrange as end of text (EOT).5 regions are equally divided into by between initial for text row and the initial row of text.If the stain number in two regions is less than 3, mark current text behavior end of text (EOT) row, otherwise continue the next line of text of scanning.It is the result of line of text segmentation between text initial row and end of text (EOT) row.

3. the row calculating every a line is high, according to alignment thereof and the high information of row of text, determines the row at text place.Find out the minimum font size of line of text, bottom-up ground scan text row.Find the line of text met the following conditions capable as end of text: font size differs with minimum font size within two pixels; Justify align or left-justify; Within after section, distance differs two pixels with distance after the section of the line of text at minimum text line number place.Top-down scan text row, finds the line of text met the following conditions as start of text (STX) row: font size differs with minimum font size within two pixels; Justify align or Right Aligns; Within after section, distance differs two pixels with distance after the section of the line of text at minimum text line number place.If can not find start of text (STX) row or end of text capable, start of text (STX) rower is designated as 0, and end of text rower is designated as the ending of line of text.More than such text be that we will carry out the region of Layout Recognition.

4. determine each region according to communication information, and carry out line of text segmentation in each region, preserve that line of text in this region is high, line number, region reference position, region is relative to the information such as alignment thereof of whole scan image.Concrete steps are as follows:

(1) carry out horizontal projection to the above region of text, form line of text, region divides in advance.

A) denoising is carried out to horizontal projection, delete the impact of some straight lines and discrete point.(filter continuous line number be less than or equal to 7 successive projection capable; Filter continuous line number to be greater than 7 and to be less than or equal to 10, and horizontal projection result average be less than or equal to 20 successive projection capable) merge projection line of text and become region.(horizontal scan projection result from top to bottom, continuous two projection line of text font sizes identical (criterion is that the absolute value of difference is less than or equal to 2), (1) judge whether line-spacing is less than or equal to 2 times of font sizes, be less than or equal to 2 times of font sizes, merging two projection row becomes a region; (2) continuous two row font sizes close (criterion is that the absolute value of difference is greater than 2 and is less than or equal to 4), judge whether line-spacing is less than or equal to 1 times of font size, if be less than or equal to 1 times of font size, merge two and project and capablely become a region; (3) a line is than a line font size is large above below, and difference is for being less than or equal to 10, and line-spacing is less than or equal to 1 times of font size, and the line-spacing of the third line and the second row and the font size of the third line and the first row meet front two rules simultaneously.)

(2) carry out division to each pre-zoning to determine.

A) vertical projection is carried out to region and to projection result denoising, storage area row reference position, end position and width.

B) region line of text divides, text message record.(horizontal projection is carried out to region, and denoising operation is carried out to projection result, redefine line of text information, the details of posting field Chinese one's own profession.)

C) judge in vertical projection, whether to there is a large amount of blank (a large amount of blank refers to that the row that continuous white point number is more than or equal to 10 times of regions is high).Exist and jump to d), do not exist and jump to e).

D) according to a large amount of blank, be several regions by Region dividing.

I. reference position and the end position of the row, column in the region after each segmentation is determined, highly, width.

Ii. horizontal projection is carried out to the region after each segmentation, and denoising operation is carried out to projection result, redefine line of text information, the details of posting field Chinese one's own profession.

E) line of text in region is judged, judge that whether this region is the situation of the corresponding line of text of multiple line of text.

Iii. being presorted in region is three sub spaces (left subspace, sub-spaces, right subspaces).(Subspace partition is defined as, left subspace: reference position on the left of region, to 1/3 place of zone length; Sub-spaces: 1/3 place is to 2/3 place; Right subspace: 2/3 place is to the end position in region).

Iv. respectively horizontal projection is carried out to three sub spaces, and denoising operation is carried out to projection result.

V. the line of text information (row is high, line-spacing for text line number, reference position and end position) of subspace is recorded

Vi. the correlativity of the line of text in 3 sub spaces and whole region is judged.There is a line of text in right subspace, left subspace or sub-spaces have at least a space to there is two and more line of text.And the row of the line of text of right subspace is high occupies the centre that the height (more than 95%) in whole region or line of text be present in the horizontal projection part in region.This kind of situation needs special processing to forward to f), otherwise terminates.

F) situation of the corresponding line of text of multiple line of text.

I. the part of multiple line of text is divided into region, the part of a remaining line of text is as the attached subregion in this region.Determine current region and attached subregion.(according to vertical projection)

Ii. whether detection current region and last region can merge, and combination principle is middle b) similar with (1).If can, merge, can not then continue.

Iii. whether detection current region and a rear region can merge, and combination principle is middle b) similar with (1).Can, merge; Can not, continue.

Iv. reference position and the end position of the row, column in the region after merging or detecting is determined, highly, width.

V. horizontal projection is carried out to region, and denoising operation is carried out to projection result, redefine line of text information, the details of posting field Chinese one's own profession.

Determine the region of complete current official document, travel through each region and obtain layout information, extract font size size in region, sequence, the alignment thereof in region is as layout information.

5. utilize the rule in the information of above reservation and format feature database to carry out mating (comprising location matches and Keywords matching), matched and then extracted layout information by format feature database.If do not match format sequence number, then by the lemma collection of setting, department's word collection, sends the documents for the word collection of word, is mated in each region identified with word collection, obtains Layout Recognition object information.

Embodiment 1

The official document in the one environmental protection Room, width Anhui Province as shown in Figure 2, carries out the layout information of format Detection and Extraction as shown in Figure 3,

First Region dividing is carried out to picture, after dividing, obtain sequence number, and the OCR result in each region.Go to mate with page-layout base according to the method provided in literary composition.Hit first sample figure (hitting id=0 in Fig. 3) in format after coupling, carry out information extraction according to hit format rule.

Embodiment 2

The official document of one width auditing administration as shown in Figure 4, carries out the layout information of format Detection and Extraction as shown in Figure 5.

Claims

1. a Layout Recognition method for file and picture, comprises the following steps:

1) according to the format picture of different document sample, format feature database is generated;

2) scan document to be identified, obtain scan image;

3) line of text segmentation is carried out to scan image, determine the text of document to be identified;

4) Region dividing is carried out to part more than document text to be identified, and obtain the layout information in each region;

5) by step 4) layout information that obtains mates with the layout information in format feature database, if matched, then from format feature database, extracts corresponding layout information; If do not matched, then the layout information in each region is mated with the format word collection preset, obtain Layout Recognition object information.

2. the Layout Recognition method of file and picture as claimed in claim 1, is characterized in that, preserves the format content of different document sample and format sequence number that is high relative to word by format content, alignment thereof generation in described format feature database.

3. the Layout Recognition method of file and picture as claimed in claim 1, is characterized in that, step 2) in, also comprise and pre-service is carried out to scan image.

4. the Layout Recognition method of file and picture as claimed in claim 3, it is characterized in that, described pre-service comprises denoising and Slant Rectify.

5. the Layout Recognition method of file and picture as claimed in claim 4, is characterized in that, described denoising comprises removal ink and removes seal.

6. the Layout Recognition method of file and picture as claimed in claim 1, is characterized in that, step 3) according to projection information, line of text segmentation is carried out to scan image, determine cutting position by the textural characteristics of monochrome pixels point.

7. the Layout Recognition method of file and picture as claimed in claim 6, is characterized in that, the bottom-up end of text (EOT) row finding text, then the top-down searching text initial row that can mate with end line; If can not find start of text (STX) row or end of text capable, start of text (STX) rower is designated as 0, and end of text rower is designated as the ending of line of text; It is the result of line of text segmentation between text initial row and end of text (EOT) row.

8. the Layout Recognition method of file and picture as claimed in claim 1, it is characterized in that, step 4) in having, identical word is high, the row of line space, alignment thereof is put into same region, if and have multiple line of text in same intra-zone left side, right side only has a line of text, need again to divide region, using the subregion of right side line of text as this region.

9. the Layout Recognition method of file and picture as claimed in claim 1, is characterized in that, step 4) in ready-portioned region produce a format sequence number, this format sequence number is by alignment thereof, and relative word height generates.

10. the Layout Recognition method of file and picture as claimed in claim 1, is characterized in that, step 4) in, described layout information comprises: font size size in region, and sequence, region are relative to the alignment thereof of whole scan image.