CN104951429A - Recognition method and device for page headers and page footers of format electronic document - Google Patents

Recognition method and device for page headers and page footers of format electronic document Download PDF

Info

Publication number
CN104951429A
CN104951429A CN201410117009.0A CN201410117009A CN104951429A CN 104951429 A CN104951429 A CN 104951429A CN 201410117009 A CN201410117009 A CN 201410117009A CN 104951429 A CN104951429 A CN 104951429A
Authority
CN
China
Prior art keywords
text
headerfooter
page
line
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410117009.0A
Other languages
Chinese (zh)
Inventor
吴运俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201410117009.0A priority Critical patent/CN104951429A/en
Publication of CN104951429A publication Critical patent/CN104951429A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a recognition method and device for page headers and page footers of a format electronic document. The method comprises the steps that multiple pages of the format electronic document are analyzed, and text content of all text lines contained in all pages is obtained; the text content of all the text lines in all the pages is traversed, and whether the text lines meet the characteristics of page headers and page footers is judged; the text lines where the page headers and the page footers are located are determined according to judging results. By means of the recognition method and device, whether a line in the document is the page header or the page footer is recognized through backstepping according to the similarity of content on multiple pages on a certain line and pages obtained based on the similarity; according to the method, the characteristic values and positions of page headers and pager footers have no pure definition, the coverage rate to current documents is greatly improved, and high recognition accuracy is achieved.

Description

The headerfooter recognition methods of format electronic document and device
Technical field
The application relates to document recognition technical field, is specifically related to headerfooter recognition methods and the device of format electronic document.
Background technology
Along with popularizing of hand held terminal unit, people get more and more to the demand of reading on hand-held terminal device, and as content vector, current electronic document major part is all transform the format e-file based on PDF of coming from composing tools and type-setting document, the page of this file is usually comparatively large, is not suitable for reading on handheld terminal or on small screen device.And the file layout being comparatively applicable to reading on a handheld device is at present the file layout based on streaming, as epub(Electronic Publication, electronic publishing) formatted file.Under this file layout, can upset number of pages and the layout of document during reading, reader also needs automatic fitration to fall the headerfooter of document, allows reader continuous print reading documents can obtain better reading experience.And in order to realize the operation of this filtration headerfooter, the how automatic problem identifying headerfooter program and need to solve from format document file.
Identification at present for headerfooter has several more common method.As the horizontal line of page top can be utilized find and identify headerfooter; Also have a kind of method to be find headerfooter according to the layout of character block on the page in addition, such as, if page top and bottom respectively occur that there is the layout of a bulk fritter and centre, then think that the block of top and bottom is headerfooter region.When being identified by the above headerfooter of these two methods to document, very high to the eigenvalue requirements of document, if document does not have characteristic of correspondence value, the accuracy identified is difficult to ensure.As determined according to the horizontal line on page top, the method for headerfooter is only suitable for meeting the document of this kind of feature, knowing method for distinguishing according to page layout can only be different to the spacing between headerfooter from text, and the document that can only appear on the upper-lower position of document headerfooter is useful.Therefore, the technical matters solved in the urgent need to those skilled in the art is just, how to identify in format electronic document the contents such as headerfooter more exactly, to distinguish and to show the content of format electronic document more accurately.
Summary of the invention
This application provides headerfooter recognition methods and the device of format electronic document, the coverage rate for current document increases greatly, has very high recognition accuracy.
This application provides following scheme:
A headerfooter recognition methods for format electronic document, comprising:
Respectively multiple pages of format electronic document are resolved, obtain the content of text of each line of text comprised in each page;
Travel through the content of text of each line of text in each page, judge whether each line of text meets the feature of headerfooter;
According to the line of text at judged result determination headerfooter place.
A headerfooter recognition methods for format electronic document, comprising:
Respectively multiple pages of format electronic document are resolved, obtain the content of text of each text column comprised in each page;
Travel through the content of text of each text column in each page, judge whether each text column meets the feature of headerfooter;
According to the text column at judged result determination headerfooter place.
A headerfooter recognition device for format electronic document, comprising:
Document resolution unit, for resolving multiple pages of format electronic document respectively, obtains the content of text of each line of text comprised in each page;
Line of text feature judging unit, for traveling through the content of text of each line of text in each page, judges whether each line of text meets the feature of headerfooter;
Headerfooter determining unit, for the line of text according to judged result determination headerfooter place.
A headerfooter recognition device for format electronic document, comprising:
Document resolution unit, for resolving multiple pages of format electronic document respectively, obtains the content of text of each text column comprised in each page;
Text column feature judging unit, for traveling through the content of text of each text column in each page, judges whether each text column meets the feature of headerfooter;
Headerfooter determining unit, for the text column according to judged result determination headerfooter place.
According to the specific embodiment that the application provides, this application discloses following technique effect:
By the embodiment of the present application, when showing format electronic document, respectively multiple pages of format electronic document can be resolved, obtaining the content of text of each line of text comprised in each page; Travel through the content of text of each line of text in each page, judge whether each line of text meets the text feature of headerfooter; According to the line of text at judged result determination headerfooter place.Utilize the text feature of headerfooter line of text, the headerfooter in format electronic document is effectively identified.By the method in the application, can in conjunction with the similarity of pages content in certain a line, identify whether certain a line in document is headerfooter content with based on counter the pushing away of this similarity page out, this method does not have simple definition for the eigenwert of headerfooter and position, coverage rate for current document increases greatly, has very high recognition accuracy.
Certainly, the arbitrary product implementing the application might not need to reach above-described all advantages simultaneously.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment below, apparently, accompanying drawing in the following describes is only some embodiments of the application, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram of the method that the embodiment of the present application provides;
Fig. 2 is the schematic diagram of character coordinates in format electronic document in the embodiment of the present application;
Fig. 3 is the process flow diagram of the other method that the embodiment of the present application provides;
Fig. 4 is the schematic diagram of the device that the embodiment of the present application provides;
Fig. 5 is the schematic diagram of another device that the embodiment of the present application provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, be clearly and completely described the technical scheme in the embodiment of the present application, obviously, described embodiment is only some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, the every other embodiment that those of ordinary skill in the art obtain, all belongs to the scope of the application's protection.
In the embodiment of the present application, in order to identify the headerfooter in format electronic document more accurately, also making recognition methods have more universality, first can identify the character in format electronic document, and judging in units of line of text.Because headerfooter is also a concrete line of text, therefore, this mode can not be subject to the restriction of special format in file etc., even and if in page, do not add the significantly mark such as " horizontal line ", also effectively can identify headerfooter.To introduce in detail below.
Embodiment one:
See Fig. 1, the embodiment of the present application one provides a kind of process flow diagram of headerfooter recognition methods of format electronic document, and as shown in the figure, the method can comprise the following steps:
S101: respectively multiple pages of format electronic document are resolved, obtain the content of text of each line of text comprised in each page;
In the embodiment of the present application, when needing the process before showing certain format electronic document, first can resolve format electronic document, obtaining the content of text of each line of text comprised in each page of format electronic document.Usually, the content comprised in format electronic document page mostly is content of text, such as e-novel etc.; But the content of text also had is included in picture, such as, scan the format electronic document of generation, now, first can carry out Text region to the content of text in picture, extract text wherein.For the picture not comprising text in format electronic document, mostly be the illustration of document, can directly carry out identifying, locate or filtering as picture element.
In the process of content of text resolving and obtain each line of text in each page in format electronic document, the identification can carrying out word for word to the content of text of electronic document.In order to improve the efficiency identified headerfooter content in electronic document, and the follow-up accuracy that headerfooter content is determined, in the process that electronic document is resolved, merger can be carried out to resolving the character obtained with behavior unit.Concrete, first to the parsing that the text in electronic document carries out character by character, the positional information of each character (such as, may be concrete word or punctuation mark etc.) and this character can be identified.Wherein, about the positional information of character, due to words such as Chinese, each character in a document typesetting time shared area be generally identical, and can be represented by a rectangle frame respectively, such as, as shown in Figure 2, each character in document is arranged in respective rectangle frame (only when actual displayed, this rectangle frame directly can't be shown), and the positional information of each like this character just can be represented by the position of the rectangle frame at respective place.Concrete, this positional information can be expressed as the two-dimensional coordinate value of each character in the page, and for each character, minimum x coordinate, minimum y coordinate, maximum x coordinate, maximum y coordinate can be determined respectively.Such as, the rectangle frame shown in Fig. 2, on AD limit, the x coordinate of each point is all identical, and is the minimum x coordinate of this character; On AB limit, the y coordinate of each point is all identical, and is the minimum y coordinate of this character; On BC limit, the x coordinate of each point is all identical, and is the maximum x coordinate of this character; On CD limit, the y coordinate of each point is all identical, and is the maximum y coordinate of this character.In other words, according to minimum x coordinate, minimum y coordinate, maximum x coordinate, maximum y coordinate, the position at a rectangle frame place can just uniquely be determined, also just to the position of the character in this rectangle frame.
In a word, in the process that multiple pages of format electronic document are resolved, the character content of each line of text comprised in each page can be obtained, the position data of each character can also be obtained simultaneously, in this process, can so according to the position data of each character, each character is carried out to the division of line of text, such as, minimum y coordinate or the identical character of maximum y coordinate, just can be identified as one text row.
S102: the content of text traveling through each line of text in each page, judges whether each line of text meets the feature of headerfooter;
Headerfooter partial content in format electronic document has Some features, and as repeated in multiple page, position is fixed and content is identical or close, and such as, some headers or footer include identical Document Title; Some documents, while comprising identical Document Title, also can comprise the page number, in this case the header of the different page or the incomplete same of footer, but similarity is very high; By headerfooter to be then completely made up of numeral.These features of the headerfooter partial content in format electronic document can be utilized, headerfooter is identified.In the method that the embodiment of the present application provides, the content of text of each line of text in each page can be traveled through, obtain the text feature of document Chinese one's own profession, and then judge whether each line of text meets the feature of headerfooter.Lower mask body carries out citing introduction.
First, under a kind of implementation, after acquisition a line text, can to judge in other pages whether containing with current text row between similarity meet the line of text of prerequisite, if so, then judge that current text row meets the text feature of headerfooter.This method make use of the multiple pages recurrent feature of headerfooter at document, when needs judge whether certain a line is headerfooter, can be initial from deserving front line of text, get the content of text sample as a comparison of certain line number, contrast with current line, if there is certain a line in contrast sample, identical with current line or similarity is very high, then judge that current text row meets the text feature of headerfooter.Wherein obtaining contrast sample is selected sampling line number, can be more than 2 times of this format electronic document single page maximum number of lines usually.This is because, the object of this method to find identical with current line in contrast sample or that similarity is very high row, and headerfooter repeatedly occurs in the different pages, carry out effective comparison, determine the text feature meeting headerfooter, just need to allow that contrast sample's choose leap current page, also namely the quantity of contrast sample will meet the same position that at least can arrive the next page usually.
In addition, in a format electronic document, or in some pages in a document, in fact the ratio of part relatively and shared by body part of headerfooter is less, therefore, can judge each provisional capital, the Some features of headerfooter can be utilized, screen the current line needing contrast.Such as, when certain a line is the row of pure digi-tal composition, such row can be the page number mark in header or footer; And for example, based on large quantitative statistics, in the headerfooter of most format electronic document, certain or some specific punctuation marks can't be there is, such as, in the text of page footer, seldom have comma, fullstop, the typical case such as question mark occurs for the punctuation mark identifying sentences and phrases.Like this, certain or some specific punctuation marks can being utilized, to needing the line of text of the document judged to carry out preliminary screening, then to the line of text that these screen, then carrying out the similarity comparison with the line of text in other pages.Namely first can judge whether comprise punctuation mark in current text row, then, to judge in other pages whether containing with current text row between similarity meet the line of text of prerequisite, if so, then judge that current text row meets the feature of headerfooter.
Concrete, in the process that each line of text in every page is traveled through, first can judge whether comprise punctuation mark in current text row; If do not comprise punctuation mark in current text row, then in the page of the front/rear preset number of current page, obtain the line of text not comprising punctuation mark capable as target text; Contrast current text row and target text is capable, acquisition current text row and target text capable between similarity; If the similarity between the two obtained meets prerequisite, then judge that current text row meets the text feature of headerfooter.Do like this and can utilize whether also have punctuation mark, preliminary screening is carried out to doubtful headerfooter content, same screening has also been carried out for the target line being used as contrast sample, and then by the comparison of target line and current line, whether " counter push away " current line meets the text feature of headerfooter, thus be conducive to reducing workload, improve the recognition efficiency of headerfooter content.
After obtaining the target line for contrasting current line, can contrast current line and target line, obtain target text in current text row and other pages capable between similarity.Specifically when contrasting, first can judge that whether the content of text that current text row and target text are capable is identical, method can be the contrast carrying out word for word to the word in current line and the target line that finds, sees that whether the character in same position is identical.If identical, then judge that current text row and the capable similarity on content of text of target text meet prerequisite; If incomplete same, then Predistribution Algorithm is utilized to obtain current text row and the capable similarity on content of text of target text.Concrete, according to the number of character number total in current line or target line and wherein identical characters, similarity between the two can be determined.Utilize the Predistribution Algorithm obtaining described similarity, can contrast the character of current line and target line, the text similarity of both acquisitions, as editing distance Levenshtein Distance algorithm can be used.Levenshtein Distance makes again Levenshtein distance, refer between two word strings, change into the minimum editing operation number of times needed for another by one, the similarity of Levenshtein Distance algorithm realization to two character strings can be utilized to weigh.Using the character of current line and target line as input, the word string based on Levenshtein Distance algorithm realization contrasts, and exports the mark amount of similarity between the two.
At contrast current line and target line, obtain in the process of the text similarity of current line and target line, specifically can take such step: first, the comparison can carrying out one by one according to word string order to the character in both, judges that whether the text of current line is identical with the text of target line; If identical, then both similarities are 100%, then need not carry out the contrast of Levenshtein Distance algorithm again, if instead judge that the text of text and the target line of current line is incomplete same, then utilize the text similarity of editing distance Levenshtein Distance algorithm acquisition current line and target line.
In the method for the similarity between aforementioned comparison's two line of text, mainly contrast the similarity between the content of text in two line of text, in addition, can also in conjunction with current text row and target text capable in the position data that occurs in the page of respective place of the character that comprises, improve the degree of accuracy that similarity judges.Concrete, can incomplete same at the content of text judging that current text row and target text are capable time, acquisition current text row and target text capable in the position data that occurs in the page of respective place of the character that comprises; If the gap between current text row and the capable position data of target text is less than preset threshold value, then trigger the step performing and utilize Predistribution Algorithm to obtain current text row and the capable similarity on content of text of target text; Otherwise, if state gap to be greater than preset threshold value, then judge current text row and target text capable between similarity do not meet prerequisite.
This make use of headerfooter in the relatively-stationary feature in the position of different header (such as, header all may appear at the upper right corner of every page, etc.), in the different pages of same electronic document, the position of header or footer is relatively consistent, if current line is different from the position of target line, then can think that both are dissimilar.Specifically when the positional information of an acquisition line of text, first can determine first character and last character of this article one's own profession, the region at the rectangle frame place then surrounded by the minimum x coordinate of this first character, the maximum x coordinate of last character, the minimum y coordinate of this article one's own profession and maximum y coordinate is exactly the position at this article one's own profession place.Specifically when the position data of comparison current line and target line, current line and the target line minimum transverse and longitudinal coordinate relative to respective page starting point can be obtained respectively, and then the difference in height calculated between the two, if both differences in height are beyond certain scope, such as, both differences in height have exceeded certain threshold value, then think that both similarities are 0.When both differences in height do not exceed certain threshold value, the similarity both recycling Predistribution Algorithm obtains.Like this, before the judgement carrying out Predistribution Algorithm, first according to the position data of current line and target line, judge line of text and target line whether position consistency or close, further increase the recognition efficiency of headerfooter.
After the similarity obtaining current line and target line, the similarity of acquisition and preset threshold value can be contrasted, because the headerfooter content of each page of same electronic document is identical or similarity is higher, so can the high threshold value of selection and comparison in choosing of preset threshold value, improve the degree of accuracy of identification header/footer.The expression way of preset threshold value can set according to current line is corresponding with the expression way of the similarity of target line, such as, when similarity to be floating number be percentage (computer expression is floating number) represents time, corresponding preset threshold value is also floating number, such as 0.96; When text similarity is expressed with the natural number between 1-100, corresponding preset threshold value is also the natural number between 1-100, such as 96.If the similarity of current line and target line exceedes this preset threshold value, then can determine that current line meets the text feature of headerfooter.
After determining the line of text meeting headerfooter feature, directly these line of text can be defined as headerfooter.But, in order to improve the content aware accuracy of page footer further, can also acquisition all pages in all meet the line of text of headerfooter text feature after, statistics obtains the regional extent of body part in each page, and then pass through the regional extent data of body part, the line of text meeting headerfooter feature is judged again.Concrete, the line of text of the text feature meeting headerfooter can be defined as doubtful headerfooter capable; Respectively in each page of format electronic document, according to the position data of the capable line of text of non-doubtful headerfooters all in page, determine the regional extent at the body matter place of each page; The position data at the character place in capable according to doubtful headerfooter again, judges the capable regional extent whether dropping on the body matter place of its place page of doubtful headerfooter; If so, be then judged to be that non-headerfooter is capable by capable for this doubtful headerfooter, otherwise, be judged to be that headerfooter is capable by capable for this doubtful headerfooter.
Concrete, using the line of text not meeting headerfooter feature of all for document pages as first content, then can travel through first content, obtain the two-dimensional coordinate of each row of first content in each page of electronic document; Minimum horizontal ordinate in statistics two-dimensional coordinate, minimum ordinate, maximum horizontal ordinate, and maximum ordinate; According to adding up the minimum horizontal ordinate obtained, minimum ordinate, maximum horizontal ordinate, and the page type page position of maximum ordinate determination electronic document; Page type page position, is also the regional extent at body matter place, utilizes page type page position, can judge the line of text meeting headerfooter feature again, thus further increases the accuracy identifying format electronic document header footer.
S103: according to the line of text at judged result determination headerfooter place.
Judging in format electronic document, whether each line of text meets the text feature of headerfooter, can see that the line of text meeting headerfooter feature processes as headerfooter, as in the process of display format electronic document, the line of text at headerfooter place can not be shown, and only demonstrate the body matter beyond headerfooter.
Above the headerfooter recognition methods of the format electronic document that the application provides is described in detail, pass through the method, when showing format electronic document, respectively multiple pages of format electronic document can be resolved, obtaining the content of text of each line of text comprised in each page; Travel through the content of text of each line of text in each page, judge whether each line of text meets the text feature of headerfooter; According to the line of text at judged result determination headerfooter place.Utilize the text feature of headerfooter line of text, the headerfooter in format electronic document is effectively identified.By the method in the application, can according in conjunction with the similarity of pages content in certain a line, identify whether certain a line in document is headerfooter with based on counter the pushing away of this similarity page out, this method does not have simple definition for the eigenwert of headerfooter and position, coverage rate for current document increases greatly, has very high recognition accuracy.
Embodiment two
In above-described embodiment one, mainly for the electronic document that character is transversely arranged, give the method identifying headerfooter.In actual applications, also may there is the e-text of some character longitudinal arrangements, in the electronic document of longitudinal arrangement, according to the coordinate of each character, the character of same row can be integrated into together as a row word.Whole process is also namely according to the position data of each character in the page of place, each character is entered to the division of row, the content of text of each text column comprised in each page is obtained according to division result, and then judge whether each text column meets the feature of headerfooter, is described in detail again below.
The embodiment of the present application two additionally provides a kind of headerfooter recognition methods of format electronic document, is applicable to the headerfooter identification of the format electronic document to word longitudinal arrangement, and as shown in Figure 3, the method can comprise the following steps:
S301: respectively multiple pages of format electronic document are resolved, obtain the content of text of each text column comprised in each page;
S302: the content of text traveling through each text column in each page, judges whether each text column meets the text feature of headerfooter;
S303: according to the text column at judged result determination headerfooter place.
The headerfooter recognition methods of the format electronic document that the embodiment of the present application provides, can utilize the text feature of headerfooter line of text, effectively identifies the headerfooter in the format electronic document of word longitudinal arrangement.It should be noted that in addition, the headerfooter recognition methods of the format electronic document provided with embodiment one in the method that the present embodiment two provides can be cross-referenced, here just repeated no more.
Corresponding with the headerfooter recognition methods of the format electronic document that the embodiment of the present application one provides, additionally provide a kind of headerfooter recognition device of format electronic document, as shown in Figure 4, the headerfooter recognition device of this format electronic document comprises:
Document resolution unit 401, for resolving multiple pages of format electronic document respectively, obtains the content of text of each line of text comprised in each page;
Line of text feature judging unit 402, for traveling through the content of text of each line of text in each page, judges whether each line of text meets the feature of headerfooter;
Headerfooter determining unit 403, for the line of text according to judged result determination headerfooter place.
Wherein, for the current text row in current page, line of text feature judging unit 402 can comprise:
First text feature judgment sub-unit, for judge in other pages whether containing with current text row between similarity meet the line of text of prerequisite, if so, then judge that current text row meets the text feature of headerfooter.
Or for the current text row in current page, line of text feature judging unit 402 also can comprise:
Second text feature judgment sub-unit, for judging whether comprise punctuation mark in current text row, and in other page whether containing with current text row between similarity meet the line of text of prerequisite, if so, then judge that current text row meets the text feature of headerfooter.
Wherein, the second text feature judgment sub-unit specifically can comprise:
Symbol decision subelement, for judging whether comprise punctuation mark in current text row;
The capable determination subelement of target text, if for not comprising punctuation mark in current text row, then in the page of the front/rear preset number of current page, obtains the line of text not comprising punctuation mark capable as target text;
Contrast subunit, for contrast current text row and target text capable, obtain current text row and target text capable between similarity;
Judge subelement, if meet prerequisite for similarity, then judge that current text row meets the text feature of headerfooter.
Wherein, during similarity specifically between obtaining the target text in current text row and other pages and be capable, can adopt and realize with lower unit:
Content of text judging unit, whether identical for judging the content of text that current text row and target text are capable;
First identifying unit, if for identical, then judges that current text row and the capable similarity on content of text of target text meet prerequisite;
Similarity calculated, if for incomplete same, then utilizes Predistribution Algorithm to obtain current text row and the capable similarity on content of text of target text.
Under another kind of implementation, this device can also comprise:
Position data acquiring unit, if it is determined that for current text row and the capable content of text of target text incomplete same, then obtain current text row and target text capable in the position data that occurs in the page of respective place of the character that comprises;
Trigger element, if be less than preset threshold value for the gap between current text row and the capable position data of target text, then triggers the step performing and utilize Predistribution Algorithm to obtain current text row and the capable similarity on content of text of target text;
Second identifying unit, for otherwise, if gap is greater than threshold value, then judge current text row and target text capable between similarity do not meet prerequisite.
During specific implementation, headerfooter determining unit 403 specifically can comprise:
Doubtful headerfooter determination subelement is capable for the line of text of the text feature meeting headerfooter is defined as doubtful headerfooter;
Subelement is determined in text region, for respectively in each page, according to the position data at the capable line of text place of non-doubtful headerfooters all in page, determines the regional extent at the body matter place of each page;
Judgment sub-unit, for the position data at the character place in capable according to doubtful headerfooter, judges the capable regional extent whether dropping on the body matter place of its place page of doubtful headerfooter;
By capable for this doubtful headerfooter, headerfooter determination subelement, if be yes for judgment sub-unit judged result, is then judged to be that non-headerfooter is capable, otherwise, be judged to be that headerfooter is capable by capable for this doubtful headerfooter.
In addition, document resolution unit 401 specifically can comprise:
Recognin unit, for resolving format electronic document, identifies the content of each character in format electronic document, and the position data of each character in the page of place;
Line of text divides subelement, for according to the position data of each character in the page of place, divides each character, obtains the content of text of each line of text comprised in each page according to division result.
Corresponding with the headerfooter recognition methods of the format electronic document that the embodiment of the present application two provides, the embodiment of the present application additionally provides a kind of headerfooter recognition device of format electronic document, and see Fig. 5, this device can comprise:
Document resolution unit 501, for resolving multiple pages of format electronic document respectively, obtains the content of text of each text column comprised in each page;
Text column feature judging unit 502, for traveling through the content of text of each text column in each page, judges whether each text column meets the feature of headerfooter;
Headerfooter determining unit 503, for the text column according to judged result determination headerfooter place.
Above the headerfooter recognition device of the format electronic document that the embodiment of the present application provides is introduced, by this device, when showing format electronic document, respectively multiple pages of format electronic document can be resolved, obtaining the content of text of each line of text comprised in each page; Travel through the content of text of each line of text in each page, judge whether each line of text meets the text feature of headerfooter; According to the line of text at judged result determination headerfooter place.Utilize the text feature of headerfooter line of text, the headerfooter in format electronic document is effectively identified.Thus can in conjunction with the similarity of pages content in certain a line, identify whether certain a line in document is headerfooter with based on counter the pushing away of this similarity page out, simple definition is not had for the eigenwert of headerfooter and position, there is very high recognition accuracy.
As seen through the above description of the embodiments, those skilled in the art can be well understood to the mode that the application can add required general hardware platform by software and realizes.Based on such understanding, the technical scheme of the application can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform the method described in some part of each embodiment of the application or embodiment.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for system or system embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.System described above and system embodiment are only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
Above to headerfooter recognition methods and the device of the format electronic document that the application provides, be described in detail, apply specific case herein to set forth the principle of the application and embodiment, the explanation of above embodiment is just for helping method and the core concept thereof of understanding the application; Meanwhile, for one of ordinary skill in the art, according to the thought of the application, all will change in specific embodiments and applications.In sum, this description should not be construed as the restriction to the application.

Claims (11)

1. a headerfooter recognition methods for format electronic document, is characterized in that, comprising:
Respectively multiple pages of format electronic document are resolved, obtain the content of text of each line of text comprised in each page;
Travel through the content of text of each line of text in each page, judge whether each line of text meets the feature of headerfooter;
According to the line of text at judged result determination headerfooter place.
2. method according to claim 1, is characterized in that, for the current text row in current page, judges whether it meets the feature of headerfooter in the following manner:
To judge in other pages whether containing with current text row between similarity meet the line of text of prerequisite, if so, then judge that current text row meets the feature of headerfooter.
3. method according to claim 1, is characterized in that, for the current text row in current page, judges whether it meets the feature of headerfooter in the following manner:
Judge whether comprise punctuation mark in current text row, and in other pages whether containing with current text row between similarity meet the line of text of prerequisite, if so, then judge that current text row meets the feature of headerfooter.
4. method according to claim 3, is characterized in that, describedly judges whether comprise punctuation mark in current text row, and in other pages whether containing with current text row between similarity meet the line of text of prerequisite, comprising:
Judge whether comprise punctuation mark in current text row;
If do not comprise punctuation mark in current text row, then in the page of the front/rear preset number of current page, obtain the line of text not comprising punctuation mark capable as target text;
Contrast described current text row and described target text capable, obtain described current text row and described target text capable between similarity;
If described similarity meets prerequisite, then judge that current text row meets the text feature of headerfooter.
5. the method according to any one of claim 2 to 4, is characterized in that, obtain in the following manner target text in current text row and other pages capable between similarity:
Judge that whether the content of text that described current text row and described target text are capable is identical;
If identical, then judge that current text row and the capable similarity on content of text of described target text meet prerequisite;
If incomplete same, then Predistribution Algorithm is utilized to obtain described current text row and the capable similarity on content of text of described target text.
6. method according to claim 5, is characterized in that, also comprises:
If it is determined that current text row and the capable content of text of described target text incomplete same, then obtain described current text row and described target text capable in the position data that occurs in the page of respective place of the character that comprises;
If the gap between described current text row and the capable described position data of described target text is less than preset threshold value, then triggers and perform the described step utilizing Predistribution Algorithm to obtain described current text row and the capable similarity on content of text of described target text;
Otherwise, if described gap is greater than described threshold value, then judge current text row and described target text capable between similarity do not meet prerequisite.
7. the method according to any one of claim 1 to 5, is characterized in that, the described line of text according to judged result determination headerfooter place, comprising:
The line of text of the text feature meeting headerfooter is defined as doubtful headerfooter capable;
Respectively in each page, according to the position data at the capable line of text place of non-doubtful headerfooters all in page, determine the regional extent at the body matter place of each page;
The position data at the character place in capable according to described doubtful headerfooter, judges the capable regional extent whether dropping on the body matter place of its place page of described doubtful headerfooter;
If so, be then judged to be that non-headerfooter is capable by capable for this doubtful headerfooter, otherwise, be judged to be that headerfooter is capable by capable for this doubtful headerfooter.
8. method according to claim 1, is characterized in that, describedly resolves multiple pages of format electronic document respectively, obtains the content of text of each line of text comprised in each page, comprising:
Format electronic document is resolved, identifies the content of each character in described format electronic document, and the position data of each character in the page of place;
According to the position data of each character in the page of place, each character described is divided, obtains the content of text of each line of text comprised in each page according to division result.
9. a headerfooter recognition methods for format electronic document, is characterized in that, comprising:
Respectively multiple pages of format electronic document are resolved, obtain the content of text of each text column comprised in each page;
Travel through the content of text of each text column in each page, judge whether each text column meets the feature of headerfooter;
According to the text column at judged result determination headerfooter place.
10. a headerfooter recognition device for format electronic document, is characterized in that, comprising:
Document resolution unit, for resolving multiple pages of format electronic document respectively, obtains the content of text of each line of text comprised in each page;
Line of text feature judging unit, for traveling through the content of text of each line of text in each page, judges whether each line of text meets the feature of headerfooter;
Headerfooter determining unit, for the line of text according to judged result determination headerfooter place.
The headerfooter recognition device of 11. 1 kinds of format electronic documents, is characterized in that, comprising:
Document resolution unit, for resolving multiple pages of format electronic document respectively, obtains the content of text of each text column comprised in each page;
Text column feature judging unit, for traveling through the content of text of each text column in each page, judges whether each text column meets the feature of headerfooter;
Headerfooter determining unit, for the text column according to judged result determination headerfooter place.
CN201410117009.0A 2014-03-26 2014-03-26 Recognition method and device for page headers and page footers of format electronic document Pending CN104951429A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410117009.0A CN104951429A (en) 2014-03-26 2014-03-26 Recognition method and device for page headers and page footers of format electronic document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410117009.0A CN104951429A (en) 2014-03-26 2014-03-26 Recognition method and device for page headers and page footers of format electronic document

Publications (1)

Publication Number Publication Date
CN104951429A true CN104951429A (en) 2015-09-30

Family

ID=54166092

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410117009.0A Pending CN104951429A (en) 2014-03-26 2014-03-26 Recognition method and device for page headers and page footers of format electronic document

Country Status (1)

Country Link
CN (1) CN104951429A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document
CN107526619A (en) * 2017-09-04 2017-12-29 江苏中威科技软件系统有限公司 The load mode of format data stream file
CN109978044A (en) * 2019-03-20 2019-07-05 广州云测信息技术有限公司 The training method and device of training data generation method and device and model
CN110543810A (en) * 2019-06-28 2019-12-06 南京智录信息科技有限公司 Technology for completely identifying header and footer of PDF (Portable document Format) file
CN110704570A (en) * 2019-08-13 2020-01-17 北京众信博雅科技有限公司 Continuous page layout document structured information extraction method
CN112329426A (en) * 2020-11-12 2021-02-05 北京方正印捷数码技术有限公司 Header and footer identification method, apparatus, device and medium for electronic file
CN113033360A (en) * 2021-03-12 2021-06-25 理光图像技术(上海)有限公司 Document image recognition device and method
CN113221507A (en) * 2021-05-28 2021-08-06 掌阅科技股份有限公司 Document editing operation synchronization method, computing device and storage medium
CN116090417A (en) * 2023-04-11 2023-05-09 福昕鲲鹏(北京)信息科技有限公司 Layout document text selection rendering method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004110398A (en) * 2002-09-18 2004-04-08 Ricoh Co Ltd Document image feature detecting method, detecting program, recording medium, and document image feature detecting device
CN101017479A (en) * 2007-02-09 2007-08-15 北京大学 Method for automatically identifying digital document type page
CN101404006A (en) * 2008-11-10 2009-04-08 金蝶软件(中国)有限公司 Method, device, system and equipment for regulating output position of page header or footer
JP4328666B2 (en) * 2004-05-12 2009-09-09 キヤノン株式会社 Document processing device
CN102081732A (en) * 2010-12-29 2011-06-01 方正国际软件有限公司 Method and system for recognizing format template

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004110398A (en) * 2002-09-18 2004-04-08 Ricoh Co Ltd Document image feature detecting method, detecting program, recording medium, and document image feature detecting device
JP4328666B2 (en) * 2004-05-12 2009-09-09 キヤノン株式会社 Document processing device
CN101017479A (en) * 2007-02-09 2007-08-15 北京大学 Method for automatically identifying digital document type page
CN101404006A (en) * 2008-11-10 2009-04-08 金蝶软件(中国)有限公司 Method, device, system and equipment for regulating output position of page header or footer
CN102081732A (en) * 2010-12-29 2011-06-01 方正国际软件有限公司 Method and system for recognizing format template

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document
CN107526619A (en) * 2017-09-04 2017-12-29 江苏中威科技软件系统有限公司 The load mode of format data stream file
CN107526619B (en) * 2017-09-04 2019-01-25 江苏中威科技软件系统有限公司 The loading method of format data stream file
CN109978044A (en) * 2019-03-20 2019-07-05 广州云测信息技术有限公司 The training method and device of training data generation method and device and model
CN110543810A (en) * 2019-06-28 2019-12-06 南京智录信息科技有限公司 Technology for completely identifying header and footer of PDF (Portable document Format) file
CN110704570A (en) * 2019-08-13 2020-01-17 北京众信博雅科技有限公司 Continuous page layout document structured information extraction method
CN112329426A (en) * 2020-11-12 2021-02-05 北京方正印捷数码技术有限公司 Header and footer identification method, apparatus, device and medium for electronic file
CN112329426B (en) * 2020-11-12 2024-05-28 北京方正印捷数码技术有限公司 Method, device, equipment and medium for recognizing header and footer of electronic file
CN113033360A (en) * 2021-03-12 2021-06-25 理光图像技术(上海)有限公司 Document image recognition device and method
CN113221507A (en) * 2021-05-28 2021-08-06 掌阅科技股份有限公司 Document editing operation synchronization method, computing device and storage medium
CN116090417A (en) * 2023-04-11 2023-05-09 福昕鲲鹏(北京)信息科技有限公司 Layout document text selection rendering method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN104951429A (en) Recognition method and device for page headers and page footers of format electronic document
US10853565B2 (en) Method and device for positioning table in PDF document
CN110363102B (en) Object identification processing method and device for PDF (Portable document Format) file
CN110795919B (en) Form extraction method, device, equipment and medium in PDF document
CN101770446B (en) Method and system for identifying form in layout file
US8041113B2 (en) Image processing device, image processing method, and computer program product
CN109657221B (en) Document paragraph sorting method, sorting device, electronic equipment and storage medium
CN101206639A (en) Method for indexing complex impression based on PDF
CN110704570A (en) Continuous page layout document structured information extraction method
US11615635B2 (en) Heuristic method for analyzing content of an electronic document
CN110705503B (en) Method and device for generating directory structured information
CN104636428A (en) Trademark recommendation method and device
EP2110758B1 (en) Searching method based on layout information
CN105302626B (en) Analytic method of XPS (XPS) structured data
CN116311259B (en) Information extraction method for PDF business document
CN104751148A (en) Method for recognizing scientific formulas in layout file
CN104462229A (en) Event classification method and device
CN114359943A (en) OFD format document paragraph identification method and device
CN112651331A (en) Text table extraction method, system, computer device and storage medium
CN103176956B (en) For the method and apparatus extracting file structure
CN101901333B (en) Method for segmenting word in text image and identification device using same
CN112906352A (en) Vehicle insurance electronic insurance policy text recognition and extraction method and system
CN105653549A (en) Method and device for extracting document information
CN104598289A (en) Recognition method and electronic device
CN104899309B (en) The method and apparatus of displaying event comment viewpoint

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150930