A kind of extraction method of layout information of digital newspaper
Technical field
The present invention relates to the mode identification technology in computer information processing field, be specifically related to a kind of extraction method of layout information of digital newspaper.
Background technology
Along with infotech, development of internet technology, the digitized developing steps of newpapers and periodicals are also being accelerated day by day.Utilize advanced Internet technology can make masses browse to digitized newpapers and periodicals content easily and efficiently, give full play to the resources advantage of newspaper office, make the faster, wider of news information propagation, promote the attractive force of newspaper office website the reader.
At present, handle and (promptly the content information in the newpapers and periodicals to be organized carrying out index for digital newspaper, as: mark layout information--publication date, release, version name) time, because these data all exist with different forms on different newspaper layouts, caused the index instrument that these contents are extracted the bigger difficulty of existence automatically, extract space of a whole page date, release, version name so all adopt generally speaking, the mode of artificial index.
Because the mode processing speed of artificial index is slower, when needing batch processing for a large amount of space of a whole page data, can limit the raising of treatment effeciency, thus need a kind of mode that the information in these fixing spaces of a whole page that exist are extracted automatically, to improve the index efficient of digital newspaper.
Summary of the invention
The objective of the invention is at present digital newspaper in the defective of carrying out existing when index is handled, a kind of extraction method of layout information of digital newspaper is provided, by comprehensive utilization space and semantic information, content is judged, realized date, the version name in the newspaper layout, the automatic extraction of release's content.
Technical scheme of the present invention is as follows: a kind of extraction method of layout information of digital newspaper comprises the steps:
(1) in the space of a whole page independently literal merge, its tissue is become several content pieces;
(2) filter out the alternating content piece according to the position that may comprise required layout information;
(3) the alternating content piece that obtains in the feature screening step (2) according to date content is judged whether it is the content piece that comprises the publication date, and the content piece that comprises the publication date is extracted;
(4) the alternating content piece that obtains in the feature screening step (2) according to release's content is judged whether it is the content piece that comprises the release, and the content piece that comprises the release is extracted;
(5) the alternating content piece that obtains in the feature screening step (2) according to version name content is judged whether it is the content piece that comprises the version name, and the content piece that comprises the version name is extracted.
Further, the extraction method of aforesaid layout information of digital newspaper, in step (2), the described position that may comprise required layout information comprises the upper left corner, the left side, the upper right corner, the top of the space of a whole page.
Further, the extraction method of aforesaid layout information of digital newspaper, in step (3), judge whether when comprising the content piece of publication date, slightly mate earlier, carefully mate then, if thin coupling is unsuccessful, then use general matched rule, chosen position leans on the content piece at top most in the result of coupling.
Further, the extraction method of aforesaid layout information of digital newspaper, in step (3), the date content of described thick coupling be characterized as following any one:
1.xxxx year xx month xx week day x, " week " and " day " be 0-2 character at interval;
2.xxxx.xx.xx week x, " xx " of " week " and front be 0-2 character at interval;
3.xxxx year xx week month x, " week " and " moon " be 0-8 character at interval;
4.xxxx.xx week x, " xx " of " week " and front be 0-8 character at interval;
Wherein, xxxx is a 1-4 character, and xx is a 1-2 character, and x is 1 character, and character is all chosen from set { 0,123,456,789 123456789 }.
Further, the extraction method of aforesaid layout information of digital newspaper, in step (3), the date content of described thin coupling be characterized as following any one:
1.xxxx year xx week month x, " week " and " moon " be 0-8 character at interval;
2.xxxx.xx week x, " xx " of " week " and front be 0-8 character at interval;
Wherein, xxxx is a 1-4 character, and xx is a 1-2 character, and x is 1 character, and character is all chosen from set { 0,123,456,789 123456789 }.
Further, the extraction method of aforesaid layout information of digital newspaper, in step (3), the date content of described general matched rule be characterized as following any one:
1.xxxx year xx month;
2.xxxx.xx;
Wherein, xxxx is a 1-4 character, and xx is a 1-2 character, and character is all chosen from set { 0,123,456,789 123456789 }.
Further, the extraction method of aforesaid layout information of digital newspaper is in step (3), if the alternating content piece does not all meet the feature Rule of judgment of date content, then all alternating content pieces are merged, the content piece after being combined according to the feature of date content is again judged.
Further, the extraction method of aforesaid layout information of digital newspaper, in step (4), if the alternating content piece comprises any two in following release's content characteristic:
1. current period xx reports xx to fold the xx version,
Wherein, the xx of " xx newspaper " is arbitrarily individual any character, and the xx of " xx is folded " is arbitrarily individual any character, and the xx of " xx version " is a 1-3 any character;
2. xx phase, xx number,
Wherein, xx is any 1-5 character;
3. there is lunar date;
Then this content piece comprises release's content information, and release's content is a front page.
Judge being characterized as of lunar date:
A) head has " lunar calendar " two words;
B) time is the arrangement of any two characters in character set [the first and second the third fourth penta oneself the hot last of the ten Heavenly stems in the ninth of the ten Heavenly Stems in heptan] and [occasion noon in the sixth of the twelve Earthly Branches not Shen the eleventh of the twelve Earthly Branches at tenth of the twelve Earthly Branches last of the twelve Earthly Branches] in an orderly way in the date;
C) month is a 1-3 character.
Further, the extraction method of aforesaid layout information of digital newspaper, in step (4), if the alternating content piece comprises following any one release's content characteristic:
1. xx version,
Wherein, xx is a 1-3 character;
2. letter+numeral does not perhaps have letter, and numeral is no more than three;
Then this content piece comprises release's content information.
Further, the extraction method of aforesaid layout information of digital newspaper, in step (5), if the alternating content piece comprises following version name content characteristic:
1. this content piece and the content piece of determining that comprises the publication date or the content piece that comprises the release are intersecting on the x direction of principal axis or on the Y direction;
2. the content information that comprises of content piece is a single file, and font size is greater than 15, and number of words is between 2-9;
3. the horizontal level of content piece is between the horizontal 30%-70% of the space of a whole page, and vertical position is at the space of a whole page longitudinally between the 5%-30%.
Further, the extraction method of aforesaid layout information of digital newspaper in step (5), if there are a plurality of alternating content pieces that comprise version name content characteristic, is then selected the highest content piece in upright position.
Beneficial effect of the present invention is as follows: the present invention is according to position and the semantic information of related content on newspaper layout, automatically extract publication date, release, version name data on the space of a whole page, by automation mechanized operation simply and easily, treatment effeciency when having improved a large amount of space of a whole page data batch processing, when alleviating intensity of workers, make that also the indexing work of digital newspaper is quick more, accurate.
Description of drawings
Fig. 1 is a method flow diagram of the present invention.
Fig. 2 extracts the synoptic diagram of independent literal for the digital newspaper space of a whole page.
Fig. 3 for in the space of a whole page independently literal merge the synoptic diagram of component content piece.
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is described in detail.
The present invention specifically is applied in the process that layout information that PDF analyzes extracts, at first utilize automatic folding with in the space of a whole page independently literal merge, make its tissue become the content piece, carry out the extraction of information according to the position and the content of these content pieces then.Described automatic folding is described in patented claim " a kind of indexing method of the complicated space of a whole page based on PDF " (200710179938.4), and particular content sees also the instructions of this patented claim, no longer carries out too much description herein.By this method, the independent literal shown in Fig. 2 just has been merged into the content piece shown in Fig. 3.
All has certain specificity owing to may comprise the position of layout information, after literal being merged into several content pieces, filter out the alternating content piece according to the position that may comprise required layout information, in general these positions are the upper left corner, the left side, the upper right corner, the top of the space of a whole page.Then, carry out the extraction of publication date, release, version name data successively, in the matching process of specific procedure, utilized regular expression.
One, obtains the publication date of the space of a whole page
Screen candidate blocks according to certain content match rule, judge whether to be date type content piece.Judge whether slightly to mate earlier when comprising the content piece of publication date, carefully mate then, if thin coupling is unsuccessful, then use general matched rule, chosen position is by the content piece at top in the result of coupling.
The date content of thick coupling be characterized as following any one:
1.xxxx year xx month xx week day x, " week " and " day " be 0-2 character at interval;
2.xxxx.xx.xx week x, " xx " of " week " and front be 0-2 character at interval;
3.xxxx year xx week month x, " week " and " moon " be 0-8 character at interval;
4.xxxx.xx week x, " xx " of " week " and front be 0-8 character at interval;
Wherein, xxxx is a 1-4 character, and xx is a 1-2 character, and x is 1 character, and character is all chosen from set { 0,123,456,789 123456789 }.
The date content of thin coupling be characterized as following any one:
1.xxxx year xx week month x, " week " and " moon " be 0-8 character at interval;
2.xxxx.xx week x, " xx " of " week " and front be 0-8 character at interval;
Wherein, xxxx is a 1-4 character, and xx is a 1-2 character, and x is 1 character, and character is all chosen from set { 0,123,456,789 123456789 }.
The date content of general matched rule be characterized as following any one:
1.xxxx year xx month;
2.xxxx.xx;
Wherein, xxxx is a 1-4 character, and xx is a 1-2 character, and character is all chosen from set { 0,123,456,789 123456789 }.
If candidate blocks is Satisfying Matching Conditions not all, then may be split assigning in a plurality of candidate blocks of date, so need merge mentioned concrete mode in the concrete merging method priority of use patented claim still " a kind of indexing method of the complicated space of a whole page based on PDF " to these candidate blocks.Piece merges and can merge according to the normal reading order according to concrete relations such as piece position as far as possible, can obtain the starting and ending position of matched character string, thereby can extract concrete date literal thus according to the result of thick coupling after character merges.Object content piece to the non-merging of finding out produces owing to wherein can have other characters that merged by mistake in the front and back on date, so need carry out deconsolidation process, extracts date literal wherein.
Two, obtain the release of the space of a whole page
After the publication date that obtains the space of a whole page, the release is extracted, if the alternating content piece comprises any two in following release's content characteristic:
1. current period xx reports xx to fold the xx version,
Wherein, the xx of " xx newspaper " is arbitrarily individual any character, and the xx of " xx is folded " is arbitrarily individual any character, and the xx of " xx version " is a 1-3 any character;
2. xx phase, xx number,
Wherein, xx is any 1-5 character;
3. there is lunar date;
Then this content piece comprises release's content information, and release's content is a front page.
Being characterized as of above-mentioned judgement lunar date:
A) head has " lunar calendar " two words;
B) time is the arrangement of any two characters in character set [the first and second the third fourth penta oneself the hot last of the ten Heavenly stems in the ninth of the ten Heavenly Stems in heptan] and [occasion noon in the sixth of the twelve Earthly Branches not Shen the eleventh of the twelve Earthly Branches at tenth of the twelve Earthly Branches last of the twelve Earthly Branches] in an orderly way in the date;
C) month is a 1-3 character.
If not front page screens according to following feature:
1. xx version,
Wherein, xx is a 1-3 character;
2. letter+numeral does not perhaps have letter, and numeral is no more than three;
If comprise above-mentioned any one release's content characteristic, then this content piece comprises release's content information.
Three, search the version name
By following feature the content piece is screened
1. general version name block (NAM) must with release or date need on the x direction or have on the Y direction crossing, if non-intersect then be not an edition name block (NAM);
2. the content of version name block (NAM) is to be single file, and font size is greater than 15, and number of words is between 2-9;
3. the horizontal level of version name block (NAM) is generally between the horizontal 30%-70% of the space of a whole page, and the vertical position of version name block (NAM) is generally at the space of a whole page longitudinally between the 5%-30%.
Screen according to above feature, if exist a plurality of candidate blocks then to select the highest content piece in upright position.
Method of the present invention is not limited to the embodiment described in the embodiment, and those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.