CN106326854B - A kind of format document paragraph recognition methods - Google Patents

A kind of format document paragraph recognition methods Download PDF

Info

Publication number
CN106326854B
CN106326854B CN201610694835.0A CN201610694835A CN106326854B CN 106326854 B CN106326854 B CN 106326854B CN 201610694835 A CN201610694835 A CN 201610694835A CN 106326854 B CN106326854 B CN 106326854B
Authority
CN
China
Prior art keywords
page
line
text
paragraph
marker space
Prior art date
Application number
CN201610694835.0A
Other languages
Chinese (zh)
Other versions
CN106326854A (en
Inventor
孙上斌
王海
刘伟平
刘晓龙
Original Assignee
掌阅科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 掌阅科技股份有限公司 filed Critical 掌阅科技股份有限公司
Priority to CN201610694835.0A priority Critical patent/CN106326854B/en
Publication of CN106326854A publication Critical patent/CN106326854A/en
Application granted granted Critical
Publication of CN106326854B publication Critical patent/CN106326854B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/00442Document analysis and understanding; Document recognition
    • G06K9/00469Document understanding by extracting the logical structure, e.g. chapters, sections, columns, titles, paragraphs, captions, page number, and identifying its elements, e.g. author, keywords, ZIP code, money amount

Abstract

The present invention relates to a kind of format document paragraph recognition methods, comprising: 1) carries out literal line identification to the page of format document;2) page is scanned with scan line, whether each blank marker space identified in page-out is intersected with literal line according to scan line, and the page is cut into multiple character blocks with the blank marker space;3) each character block is cut into section respectively.The present invention can accurately identify the paragragh in format document, and can provide the sequence between paragragh.

Description

A kind of format document paragraph recognition methods

Technical field

The invention belongs to Layout Recognition technical fields, specifically, converting stream for format document the present invention relates to a kind of The technology of formula document.

Background technique

Traditional published book, various newpapers and periodicals, magazine etc. are read medium and are all mainly made of format document.Format document Feature is that the space of a whole page is fixed, do not run version, i.e. What You See Is What You Get.In use, effect is presented not because of software and hardware ring in format document Border, operator variation and change, format, the space of a whole page, font, in terms of and paper document remain exactly the same.

On the other hand, current mobile Internet is in the high-speed developing period, and mobile phone is greatly popularized.People just get over Mobile phone electronic reading is transferred to from paper reading come more.And traditional reading medium is all mainly made of format document, It is not able to satisfy the demand of various sizes of mobile device convection type reading experience.So format document is converted into streaming document, Traditional format reading experience can be transferred to newest mobile reading experience to come up.

Streaming document is converted by format document, it is necessary first to the paragraph of format document be identified, that is, identified Those words and expressions constitute how to sort between paragragh, paragragh.However, the layout of format document is different, this is given The automatic identification of computer causes many difficulties.Such as: picture not of uniform size is inserted into the space of a whole page of format document often, this Interference will cause to the continuity of text, and the sometimes horizontally-arranged vertical setting of types sometimes of the text in the space of a whole page, also, due to the original of layout Cause, have it is upper and lower text between skip a bulk of region often.The characteristic of these above-mentioned format documents all gives Computer Automatic Recognition Paragraph in the space of a whole page causes difficulty.

Therefore, a kind of solution for identifying format document paragraph of current urgent need.

Summary of the invention

The object of the present invention is to provide a kind of solutions for identifying format document paragraph.

The present invention provides a kind of format document paragraph recognition methods, including the following steps:

1) literal line identification is carried out to the page of format document;

2) page is scanned with scan line, whether each sky identified in page-out is intersected with literal line according to scan line White marker space, and the page is cut into multiple character blocks with the blank marker space;

3) each character block is cut into section respectively.

Wherein, described to scan the page including using vertical sweep line transversal scanning institute with scan line in the step 2) The page and the page described in horizontal scanning line longitudinal scanning are stated, the blank marker space includes vertical blank marker space and level Blank marker space.

Wherein, in the step 2), the page is cut into the method for multiple character blocks such as with the blank marker space Under: the page is repeatedly cut using each blank marker space, wherein preferentially using the separation distance wide blank point It is cut septal area.

Wherein, the step 2) includes substep:

21) transversal scanning is carried out to current page with vertical sweep line, continuous appearance is effective during obtaining transversal scanning The region of scan line, and vertical blank marker space is regarded as into these regions, the effective scanning line be with any literal line all Disjoint scan line;Find out the vertical blank marker space of maximum with maximum transversal length MaxHLine;

22) longitudinal scanning is carried out to the current page with horizontal scanning line, obtains continuous appearance during longitudinal scanning The region of effective scanning line, and horizontal blank marker space is regarded as into these regions;Finding out has maximum longitudinal length The maximum horizontal blank marker space of MaxVLine;

23) maximum of the maximum transversal length MaxHLine of more vertical blank marker space and horizontal blank marker space is longitudinal The size of length MaxVLine:

If MaxHLine > MaxVLine and MaxHLine > 0, with corresponding to the perpendicular of maximum transversal length MaxHLine Directly blank marker space does current page longitudinally cutting, obtains two subpage frames;

If MaxHLine<MaxVLine and MaxVLine>0, with the water for corresponding to maximum longitudinal length MaxVLine Transverse cuts are done to current page in white marker space out of the void, obtain two subpage frames;

If MaxHLine=0 and MaxVLine=0, show that current page cannot be cut again, at this time to current page Processing terminate in face;

The subpage frame that step 24) obtains step 23) cutting is ranked up, and is then successively worked as each subpage frame as new The preceding page, return step 21) it is handled;Recurrence is constantly repeated in this way, until all subpage frames cannot all be cut again, Character block after just having directly obtained sequence at this time.

Wherein, in the step 24), in each cutting, the left-right position of two subpage frames obtained according to cutting or Upper and lower position is ranked up the two subpage frames.

Wherein, in the step 24), the sequence of two subpage frames obtained when according to each cutting is obtained entire described The sequence of all character blocks of the page.

Wherein, the step 1) includes: to extract the location information of all texts and the text in format document page, root According to the location information of each text, each text is merged based on row recognizer, obtains corresponding literal line.

Wherein, in the step 1), row recognizer includes substep:

Step 11) for the current page to be identified object set, according to the position of wherein each text, calculate text it Between distance, find out apart from immediate two texts;Wherein, the object in object set includes text and literal line;

Step 12), which merges found out two texts, becomes literal line LA, by the text merged from current to be identified Leave out in the object set of the page, and literal line LA is added in the object set, then according to the positional relationship of two texts, The directional information of literal line LA is obtained, and further generates the fundamental objects data of literal line LA, the fundamental objects number According to font size and profile including literal line;

Step 13) traverses all texts in the object set of the current page to be identified, finds with the position literal line LA most A close text WB;

Step 14) differentiates whether literal line LA merges with text WB reasonable according to font size, words direction and profile, if It is unreasonable, return step 11);Otherwise, literal line LA and text WB are merged into newline LC, then proceed to execute step 15);

Step 15) uses newline LC as new current character row LA, return step 13) start next round processing;

Above-mentioned steps 11)~15) constantly recycle, until all texts in the object set of the page to be identified are merged into Literal line.

Wherein, the step 14) includes substep:

Step 141) compares the font size of the text in literal line LA and the text WB found, if font size difference is more than pre- If threshold value, return step 11);Otherwise, step 142) is continued to execute;

Step 142), which merges literal line LA and the text WB found, becomes newline LC, than newer row LC and former literal line The direction whether having the same LA, if direction and original text word with direction different text or newline LC in newline LC The direction of row LA is not identical, discharges newline LC, while return step 11);Otherwise, step 143) is continued to execute;

Step 143) is based on profile and judges whether newline LC occurs to overlap with other object, if it happens overlapping, then newline The merging of LC is invalid, discharges newline LC, while return step 11);If do not overlapped, enter step 15).

Wherein, the step 3) include: for each character block, according to line space, row starting or end at whether deposit Each paragraph is identified in text retraction;Paragraph inside each orderly character block is merged in sequence, generates one A orderly paragraph sequence;Adjacent two paragraph between every group of adjacent orderly character block is detected, at the two Paragraph font having the same and when the two paragraphs are not complete paragraph, the two adjacent paragraphs are merged.

Compared with prior art, the present invention has following technical effect:

1, the present invention can accurately identify the paragragh in format document.

2, the present invention can provide the sequence between paragragh.

Detailed description of the invention

Fig. 1 shows the overall procedure of the paragraph recognition methods of one embodiment of the invention;

Fig. 2 shows the processes that line of text is identified in one embodiment of the invention;

Fig. 3 shows the exemplary line of text recognition result schematic diagram of one page pdf document in one embodiment of the invention;

Fig. 4 shows the schematic diagram of vertical sweep line transversal scanning in the block identification process of one embodiment of the invention;

Fig. 5 shows the schematic diagram of horizontal scanning line longitudinal scanning in the block identification process of one embodiment of the invention;

Fig. 6 shows the schematic diagram of block recognition result in one embodiment of the invention;

Fig. 7 shows the section identification process figure in one embodiment of the invention based on the block identified.

Specific embodiment

According to one embodiment of present invention, a kind of paragraph knowledge for converting format document to streaming typesetting is provided Other method, Fig. 1 show the overall procedure of the paragraph recognition methods, and with reference to Fig. 1, which includes the following steps:

Step 1: reading document metadata, carry out capable identification.Document metadata includes the text and figure for forming format document The basic elements such as piece.The basis of format document processing is exactly the processing to the metadata on basis, is identified individually from format document The method of text and single picture is prior art, and details are not described herein again.

Format document and streaming document one are primary difference is that without so-called order information, that is, all texts Information only includes pure location information (this location information can be coordinate position corresponding to text), and text is in format text Sequence in shelves can only cannot function as strict order information and use as reference.So existing format document identification side Case is typically only capable to obtain the scattered text of out-of-order, the identification of Yao Jinhang paragraph, and first choice needs document metadata is (especially literary Word) it is ranked up.In the present embodiment, based on the position according to each text, known by row otherwise by scattered group of text Knit and realize in line range the sequence of text.

Fig. 2 shows the row identification process in one embodiment of the invention, including the following steps:

Step 11: for the object set of the current page to be identified, according to the position of wherein each text, calculate text it Between distance, find out apart from immediate two texts.Herein, object includes text and literal line.

Step 12: found out two texts, which are merged, becomes literal line LA, then according to the positional relationship of two texts, The directional information of literal line LA is obtained, e.g. lateral literal line or longitudinal literal line.It can also further generate simultaneously The fundamental objects data such as the font size (indicate text size) of literal line LA, profile.In the present embodiment, the profile of literal line is to contain Cover the minimum rectangle frame of all texts of the literal line.

In this step, also the text merged is left out from the object set of the current page to be identified, and in the object Literal line LA is added in set.In one embodiment, the object set of the current page to be identified can between object apart from square The mode of battle array is stored and is characterized, wherein every a line of distance matrix respectively represents an object between object, each to arrange also generation respectively One object of table, the element of distance matrix is the distance between object and object between object, i.e., the element is representated by corresponding line The distance between object representated by object and respective column.

Step 13: traversing all texts in the object set of the current page to be identified, find closest with literal line LA A text WB, using text WB as the input object of subsequent processing.

Step 14: compare text in literal line LA with and the font size of text WB that finds, if font size difference is more than Preset threshold value (such as 25%), then return step 11;Otherwise step 15 is continued to execute.

In this step, if font size difference is more than preset threshold value, comment WB discomfort is merged into literal line LA, Return step 11 is merged into the object set of the current page to be identified, to find immediate two texts again New literal line.It may be noted that having eliminated the text being merged, new text in the object set of the page to be identified at this time Word is about to be merged by remaining text.

Step 15: literal line LA and the text WB found, which are merged, becomes newline LC, than newer row LC and former literal line The direction whether having the same LA, if direction and original text word with direction different text or newline LC in newline LC The direction of row LA is not identical, discharges newline LC, while return step 11;Otherwise, step 16 is continued to execute.In this step, if The direction in direction and former literal line LA with direction different text or newline LC in newline LC is not identical, then explanation is new Row LC and original literal line LA has different attributes, so the merging of newline LC is invalid, therefore discharges newline LC, returns simultaneously Step 11 is returned, is merged into newly in the object set of the current page to be identified, to find immediate two texts again Literal line;

Step 16: judging whether newline LC occurs to overlap with other object and (overlap and be referred to as intersecting, it refers to newline That there are area or entire area is overlapping for the profile of LC and the profile of other objects), in case of overlapping, then it is assumed that newline LC is not completely independent, and the merging of newline LC is invalid, discharges newline LC, while return step 11, so as to current wait know In the object set of the other page, immediate two texts are found again and are merged into new literal line;If do not overlapped, Think that newline LC is independent, enters step 17.

Step 17: newline LC by various inspections, can assert that the merging of newline LC is reasonably to merge, enable LA=at this time LC (i.e. using newline LC as new current character row LA), return step 13 starts the processing of next round.In this step, currently Literal line LA in the object set of the page to be identified is also replaced by the content of newline LC together.But it is noted that at other In embodiment, can not also the LA in this step in the object set to the page to be identified be updated, but until current Literal line identifies finish completely, i.e., in the case of needing return step 11 after the differentiation of step 14,15 or 16, with newline LC's Content replaces the raw content of the literal line LA in the object set of the page to be identified.

As can be seen that when each round processing, if returning to 11, so that it may the generation for completing a complete line, when entire collection All texts are all merged into after literal line in conjunction, and circulation terminates, and obtained result is exactly the final result of row identification.When complete It embarks on journey after identification, effect is as shown in figure 3, the page will become is made of capable and picture.

It should be noted that above-mentioned row knowledge is not uniquely, in other embodiments, other also to can be used otherwise Row recognizer, as long as literal line can be identified as the scattered text and its location information that extract from format document page ?.

Step 2: block identification is carried out to current page.After completing row identification, further the structure of the page is divided into Multiple character blocks, and identify the sequence of each character block in the page.

After completing row identification, block knowledge method for distinguishing is carried out to current page and specifically includes substep:

Step 21: transversal scanning is carried out to the page with vertical sweep line, during the scanning process, if vertical sweep line not with Any literal line intersection in the page then determines that row locating for vertical sweep line at this time is vertical effective row.Fig. 4 is shown vertically The schematic diagram of scan line transversal scanning, wherein arrow indicates scanning direction, and dotted line indicates vertical sweep line.

Step 22: obtaining the continuous region for vertical effectively row occur during transversal scanning, these regions are vertical blank Area, would generally be by as the marker space between character block, so alternatively referred to as vertical blank in vertical blank area this in the page Marker space.In Fig. 4, region a shows a vertical blank marker space.It is each during statistics present scan in this step The lateral length of vertical marker space, finds out wherein maximum transversal length MaxHLine, and what it was indicated is left and right adjacent block and block Between separation distance.

Step 23: longitudinal scanning is carried out to the page with horizontal scanning line, during the scanning process, if horizontal scanning line not with Any literal line intersection in the page then determines that row locating for horizontal scanning line at this time is horizontal effective row.Fig. 5 shows level The schematic diagram of scan line longitudinal scanning, wherein arrow indicates scanning direction, and dotted line indicates horizontal scanning line.

Step 24: obtaining the region of the horizontal effectively row of continuous appearance during each longitudinal scanning, these regions are horizontal Blank area, would generally be by as the marker space between character block, so alternatively referred to as level in horizontal blank this in page white area Blank marker space.In Fig. 5, region b shows a horizontal blank marker space.In this step, during statistics present scan The longitudinal length in each horizontal subdivision area finds out wherein maximum longitudinal length MaxVLine, and what it was indicated is neighbouring block Separation distance between block.

In the present embodiment, scan line is the lines that length can run through full page.In scanning process, scan line can be with It moves pixel by pixel.But this move mode is not uniquely, for example, in other embodiments, can also set others Moving step length, the moving step length enable the different separation distances of different blank marker spaces to be distinguished out.

Step 25: the maximum transversal length MaxHLine of more vertical blank area and the maximum longitudinal length in horizontal blank white area The size of MaxVLine:

If MaxHLine > MaxVLine and MaxHLine > 0, with corresponding to the perpendicular of maximum transversal length MaxHLine Directly effective row (or vertical blank area) does current page longitudinally cutting, obtains two subpage frames;

If MaxHLine<MaxVLine and MaxVLine>0, with the water for corresponding to maximum longitudinal length MaxVLine It puts down effective row (or horizontal blank white area) and transverse cuts is done to current page, obtain two subpage frames;

If MaxHLine=0 and MaxVLine=0, show that current page cannot be cut again, at this time to current page Processing terminate in face.

Step 26: the subpage frame obtained to step 25 cutting is ranked up, and is then successively worked as each subpage frame as new The preceding page, return step 21 are handled.Recurrence is constantly repeated in this way, until all subpage frames cannot all be cut again. At this point, the character block identification of full page finishes, Fig. 6 shows the schematic diagram of block recognition result in one embodiment of the invention.

In addition, the ordering rule of each subpage frame is the sequence of subpage frame above prior to following son in this step The page, the subpage frame of the sequence of the subpage frame on the left side prior to the right side.

After executing above-mentioned steps 26, it is other basic that last resulting each subpage frame represents a character block or picture etc. Element.According to basic element information, character block therein can be easily extracted.Also, due to each subpage frame Sequence, therefore the character block after sequence can be directly obtained.

Step 3: section identification is carried out based on the obtained sequence character block of step 2.Character block is mainly identified as one by section Another a paragraph, as shown in Figure 6.Each character block is the input item of section identification, and output item is then that block internal cutting goes out One and another paragraph.

Fig. 7 shows the process that the section in one embodiment knows method for distinguishing, including substep:

Step 31: a paragraph PB is created first, paragraph PB is capable container, i.e. paragraph PB is made of row, when initial, section PB is fallen as sky.

Step 32: a row LA is taken out from current character block.

Step 33: judging whether PB is empty, if so, step 34 is executed, if not, executing step 35.

Step 34: LA being directly added to PB, then branches to step 37.

Step 35: the line space whether being greater than in PB according to the shortest distance of LA distance PB, it is determined whether row LA is added Paragraph PB.A spacing, i.e. line space are had between row in PB, if the shortest distance of LA distance PB is greater than PB in the ranks Away from illustrating that LA is not suitable for that PB is added, think that the identification as previous paragraphs PB has terminated at this time, return step 31 is next to identify A paragraph;Otherwise, step 36 is continued to execute.

Step 36: paragraph composed by all rows has a positional relationship in paragraph PB, if the row being newly added Left side (either upside) relative to paragraph last line left side (or upside) retraction at least 2em (mono- text of 1em=is wide Degree), then illustrate that LA is the beginning of new paragraph with respect to PB, LA is added without PB, and thinks to have terminated when the identification of previous paragraphs PB, return 31 are gone to step, to identify next paragraph;Otherwise, step 37 is continued to execute.

Step 37: PB, return step 32, to take a line to be handled again is added in LA.

After all rows are all disposed, the paragraph of generation be exactly it is last required for paragraph, these paragraphs it is suitable Sequence is exactly their sequences in character block.

Step 4: successive passage processing, by that after step 2, can identify and generate an orderly character block sequence, By step 3, each character block can be split into paragraph, and be also ordered between paragraph, so at successive passage Reason is exactly the paragraph relationship between process block and block, and process is as follows:

Step 41: the paragraph inside orderly block being merged in sequence first, generates an orderly paragraph sequence Column, what the paragraph of such full page was ordered.

Step 42: in orderly paragraph sequence, the adjacent orderly paragraph between adjacent orderly character block being taken out It is detected, if two paragraphs can merge, is merged.Judge combined condition are as follows: when two paragraphs are with identical Font, and when two paragraphs are not complete paragraph, the two adjacent paragraphs are merged.Adjacent orderly character block is corresponding Successive passage merge after, entire orderly paragraph sequence is exactly last result.

In a preferred embodiment, step 42 includes substep:

Step 421: for adjacent character block, take previous character block the last one paragraph A and latter character block One paragraph B.

Step 422: comparing the font of paragraph A and paragraph B, judge whether two paragraph fonts are identical, if it is not, then paragraph A and paragraph B certainly not successive passage, paragraph A and paragraph B do not do merging treatment, if it is, continuing to execute step 423.

Step 423: judging whether paragraph A is head paragraph, and the definition of head paragraph is when previous paragraphs are a complete segments The upper part fallen, but be not a complete paragraph.If paragraph A is not head paragraph, paragraph A and paragraph B is not carried out Merge, if paragraph A is head paragraph, continues to execute step 424.In specific implementation, it can be determined that paragraph A last Capable right side is retracted relative to the right side of other rows of the paragraph with the presence or absence of text, if it is present thinking that paragraph A is not head Paragraph, if it does not exist, then thinking that paragraph A is head paragraph.

Step 424: judging whether paragraph B is tail paragraph, and the definition of tail paragraph is when previous paragraphs are a complete segments The lower half portion fallen, but be not a complete paragraph.If paragraph B is not tail paragraph, not to paragraph A and paragraph B into Row merges, if paragraph B is tail paragraph, continues to execute step 425.In specific implementation, it can be determined that the first of paragraph B Capable left side is retracted relative to the left side of other rows of the paragraph with the presence or absence of text, if it is present thinking that paragraph B is not tail Paragraph, if it does not exist, then thinking that paragraph B is tail paragraph.

Step 425: paragraph A and paragraph B being labeled as continuous paragraph, so that orderly text unit is led into streaming text When part, paragraph A and paragraph B merge automatically as a paragraph.

Finally it should be noted that above embodiments are only to describe technical solution of the present invention rather than to this technology method It is limited, the present invention can above extend to other modifications, variation, application and embodiment, and therefore, it is considered that institute in application There are such modification, variation, application, embodiment all within the scope of spirit or teaching of the invention.

Claims (8)

1. a kind of format document paragraph recognition methods, characterized in that it comprises the following steps:
1) literal line identification is carried out to the page of format document;
2) page is scanned with scan line, divided according to whether scan line intersects each blank in identification page-out with literal line The page is cut into multiple character blocks according to the blank marker space, wherein the scan line is that length can pass through by septal area Wear the lines of full page;
3) each character block is cut into section respectively,
Wherein, described to scan the page including using page described in the transversal scanning of vertical sweep line with scan line in the step 2) Face and the page described in horizontal scanning line longitudinal scanning, the blank marker space include vertical blank marker space and horizontal blank Marker space,
The step 2) includes substep:
21) transversal scanning is carried out to current page with vertical sweep line, obtains transversal scanning and continuously occurs effective scanning in the process The region of line, and vertical blank marker space is regarded as in these regions, the effective scanning line are and any literal line not phases The scan line of friendship;Find out the vertical blank marker space of maximum with maximum transversal length MaxHLine;
22) longitudinal scanning is carried out to the current page with horizontal scanning line, continuous appearance is effective during obtaining longitudinal scanning The region of scan line, and horizontal blank marker space is regarded as into these regions;It finds out with maximum longitudinal length MaxVLine's Maximum horizontal blank marker space;
23) the maximum longitudinal length of the maximum transversal length MaxHLine of more vertical blank marker space and horizontal blank marker space The size of MaxVLine:
If MaxHLine > MaxVLine and MaxHLine > 0, with the vertical sky for corresponding to maximum transversal length MaxHLine White marker space does current page longitudinally cutting, obtains two subpage frames;
If MaxHLine<MaxVLine and MaxVLine>0, with the horizontal blank for corresponding to maximum longitudinal length MaxVLine Transverse cuts are done to current page in white marker space, obtain two subpage frames;
If MaxHLine=0 and MaxVLine=0, show that current page cannot be cut again, at this time to current page Processing terminate;
The subpage frame that step 24) obtains step 23) cutting is ranked up, then successively using each subpage frame as new current page Face, return step 21) it is handled;Recurrence is constantly repeated in this way, until all subpage frames cannot all be cut again, at this time Character block after just having directly obtained sequence.
2. format document paragraph recognition methods according to claim 1, which is characterized in that in the step 2), with described Blank marker space is as follows by the method that the page is cut into multiple character blocks: repeatedly being cut using each blank marker space The page, wherein first being cut using the wide blank marker space of separation distance.
3. format document paragraph recognition methods according to claim 1, which is characterized in that in the step 24), each When cutting, the left-right position of two subpage frames obtained according to cutting or upper and lower position are ranked up the two subpage frames.
4. format document paragraph recognition methods according to claim 3, which is characterized in that in the step 24), according to every The sequence of two subpage frames obtained when secondary cutting obtains the sequence of all character blocks of the entire page.
5. format document paragraph recognition methods according to claim 1, which is characterized in that the step 1) includes: to extract The location information of all texts and the text in format document page, according to the location information of each text, based on row identification Algorithm merges each text, obtains corresponding literal line.
6. format document paragraph recognition methods according to claim 5, which is characterized in that in the step 1), row identification Algorithm includes substep:
Step 11) calculates between text the object set of the current page to be identified according to the position of wherein each text Distance is found out apart from immediate two texts;Wherein, the object in object set includes text and literal line;
Step 12), which merges found out two texts, becomes literal line LA, by the text merged from the current page to be identified Object set in leave out, and in the object set be added literal line LA obtained then according to the positional relationship of two texts The directional information of literal line LA, and the fundamental objects data of literal line LA are further generated, the fundamental objects data packet Include the font size and profile of literal line;
Step 13) traverses all texts in the object set of the current page to be identified, finds closest with the position literal line LA A text WB;
Step 14) differentiates whether literal line LA merges with text WB reasonable, if do not conformed to according to font size, words direction and profile Reason, return step 11);Otherwise, literal line LA and text WB are merged into newline LC, then proceed to execute step 15);
Step 15) uses newline LC as new current character row LA, return step 13) start next round processing;
Above-mentioned steps 11)~15) constantly recycle, until all texts in the object set of the page to be identified are merged into text Row.
7. format document paragraph recognition methods according to claim 6, which is characterized in that the step 14) includes following Sub-step:
Step 141) compares the font size of the text in literal line LA and the text WB found, if font size difference is more than preset Threshold value, return step 11);Otherwise, step 142) is continued to execute;
Step 142), which merges literal line LA and the text WB found, becomes newline LC, is than newer row LC and original literal line LA No direction having the same, if direction and former literal line LA with direction different text or newline LC in newline LC Direction it is not identical, discharge newline LC, while return step 11);Otherwise, step 143) is continued to execute;
Step 143) is based on profile that judge whether newline LC occurs with other object overlapping, in case of overlapping, then newline LC It is invalid to merge, and discharges newline LC, while return step 11);If do not overlapped, enter step 15).
8. format document paragraph recognition methods according to claim 4, which is characterized in that the step 3) include: for Each character block, according to line space, row starting or end at be retracted with the presence or absence of text and identify each paragraph;Have each Paragraph inside preface block merges in sequence, generates an orderly paragraph sequence;It is adjacent orderly by every group Adjacent two paragraph between character block is detected, in the two paragraphs font having the same and the two paragraphs not When being complete paragraph, the two adjacent paragraphs are merged.
CN201610694835.0A 2016-08-19 2016-08-19 A kind of format document paragraph recognition methods CN106326854B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610694835.0A CN106326854B (en) 2016-08-19 2016-08-19 A kind of format document paragraph recognition methods

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610694835.0A CN106326854B (en) 2016-08-19 2016-08-19 A kind of format document paragraph recognition methods

Publications (2)

Publication Number Publication Date
CN106326854A CN106326854A (en) 2017-01-11
CN106326854B true CN106326854B (en) 2019-09-06

Family

ID=57744794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610694835.0A CN106326854B (en) 2016-08-19 2016-08-19 A kind of format document paragraph recognition methods

Country Status (1)

Country Link
CN (1) CN106326854B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802884B (en) * 2017-02-17 2020-09-22 同方知网(北京)技术有限公司 Method for fragmenting text of layout document
CN106980607B (en) * 2017-03-31 2018-06-22 掌阅科技股份有限公司 Paragraph recognition methods, device and terminal device
CN107391457B (en) * 2017-07-26 2020-10-27 成都科来软件有限公司 Document segmentation method and device based on text line
CN107798321A (en) * 2017-12-04 2018-03-13 海南云江科技有限公司 A kind of examination paper analysis method and computing device
CN110126458B (en) * 2019-04-01 2021-01-05 桂林市朗谷科技有限公司 Automatic PCB screen printing adjusting method and device and storage medium
CN110717323B (en) * 2019-10-17 2020-07-31 北京幻想纵横网络技术有限公司 Document seal dividing method and device, terminal and computer readable storage medium
CN112101317A (en) * 2020-11-17 2020-12-18 深圳壹账通智能科技有限公司 Page direction identification method, device, equipment and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW275115B (en) * 1994-08-31 1996-05-01 Telecomm Lab Dgt Motc Intelligent analyzing and processing system for missive-table document
CN101149790A (en) * 2007-11-14 2008-03-26 哈尔滨工程大学 Chinese printing style formula identification method
US7471826B1 (en) * 2008-03-31 2008-12-30 International Business Machines Corporation Character segmentation by slices
CN101789122A (en) * 2009-01-22 2010-07-28 佳能株式会社 Method and system for correcting distorted document image
CN103617422A (en) * 2013-10-29 2014-03-05 浙江工业大学 A social relation management method based on business card recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268127B (en) * 2014-09-22 2018-02-09 同方知网(北京)技术有限公司 A kind of method of electronics shelves layout files reading order analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW275115B (en) * 1994-08-31 1996-05-01 Telecomm Lab Dgt Motc Intelligent analyzing and processing system for missive-table document
CN101149790A (en) * 2007-11-14 2008-03-26 哈尔滨工程大学 Chinese printing style formula identification method
US7471826B1 (en) * 2008-03-31 2008-12-30 International Business Machines Corporation Character segmentation by slices
CN101789122A (en) * 2009-01-22 2010-07-28 佳能株式会社 Method and system for correcting distorted document image
CN103617422A (en) * 2013-10-29 2014-03-05 浙江工业大学 A social relation management method based on business card recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于背景间隔的中文版面分析系统;杨宁;《中国优秀博硕士学位论文全文数据库 (硕士) 信息科技辑》;第33-41页;20021215(第02期);第33-41页

Also Published As

Publication number Publication date
CN106326854A (en) 2017-01-11

Similar Documents

Publication Publication Date Title
US5452374A (en) Skew detection and correction of a document image representation
Antonacopoulos et al. ICDAR 2009 page segmentation competition
JP4873787B2 (en) How to recognize and index documents
JP4181892B2 (en) Image processing method
JP4590433B2 (en) Image processing apparatus, image processing method, and computer program
US9218326B2 (en) Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
EP2162859B1 (en) Image processing apparatus, image processing method, and computer program
CN102117414B (en) The method and apparatus of authenticated print file is compared based on file characteristic multi-level images
US8494273B2 (en) Adaptive optical character recognition on a document with distorted characters
US6009196A (en) Method for classifying non-running text in an image
US7836390B2 (en) Strategies for processing annotations
Kleber et al. Cvl-database: An off-line database for writer retrieval, writer identification and word spotting
US5889886A (en) Method and apparatus for detecting running text in an image
US5721940A (en) Form identification and processing system using hierarchical form profiles
US8542926B2 (en) Script-agnostic text reflow for document images
US4903312A (en) Character recognition with variable subdivisions of a character region
EP2669847B1 (en) Document processing apparatus, document processing method and scanner
CN101206639B (en) Method for indexing complex impression based on PDF
US7705848B2 (en) Method of identifying semantic units in an electronic document
US6006240A (en) Cell identification in table analysis
US5335290A (en) Segmentation of text, picture and lines of a document image
EP2354966A2 (en) System and method for visual document comparison using localized two-dimensional visual fingerprints
US7899249B2 (en) Media material analysis of continuing article portions
US8086039B2 (en) Fine-grained visual document fingerprinting for accurate document comparison and retrieval
US20080114757A1 (en) Versatile page number detector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
GR01 Patent grant
GR01 Patent grant