CN100568221C - A kind of method of newspaper layout being carried out the words reading sequence recovery - Google Patents
A kind of method of newspaper layout being carried out the words reading sequence recovery Download PDFInfo
- Publication number
- CN100568221C CN100568221C CNB2004100914343A CN200410091434A CN100568221C CN 100568221 C CN100568221 C CN 100568221C CN B2004100914343 A CNB2004100914343 A CN B2004100914343A CN 200410091434 A CN200410091434 A CN 200410091434A CN 100568221 C CN100568221 C CN 100568221C
- Authority
- CN
- China
- Prior art keywords
- literal piece
- sequence
- piece
- text
- summit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000011084 recovery Methods 0.000 title claims description 14
- 238000003058 natural language processing Methods 0.000 claims abstract description 7
- 230000001394 metastastic effect Effects 0.000 claims description 7
- 206010061289 metastatic neoplasm Diseases 0.000 claims description 7
- 230000015572 biosynthetic process Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 6
- 230000003190 augmentative effect Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000008878 coupling Effects 0.000 claims description 3
- 238000010168 coupling process Methods 0.000 claims description 3
- 238000005859 coupling reaction Methods 0.000 claims description 3
- 229920006395 saturated elastomer Polymers 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 4
- 230000010365 information processing Effects 0.000 abstract description 3
- 230000002950 deficient Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 4
- 238000013178 mathematical model Methods 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000031068 symbiosis, encompassing mutualism through parasitism Effects 0.000 description 1
Images
Abstract
The invention belongs to the document printed page understanding technology in intelligent literal and the graphic information processing, be specifically related to a kind ofly content-based newspaper layout be carried out the method that words reading sequence recovers.Handling complicated newspaper layout at prior art exists and loses the defective that reading order and content do not have chapter independence, the present invention carries out mathematical modeling to this problem with the graph theory theory first, the syntople of literal piece is expressed as digraph, and digraph split be converted into the weighting bipartite graph, adopt natural language processing technique to calculate bipartite graph limit weights, obtain a plurality of continuous sequences by Optimum Matching, each sequence is divided into a plurality of subsequences according to literal piece style information again, and the connection of subsequence corresponding content promptly is the word flow with reading order of independent chapter.Utilized semantic, spatial relationship and style information, the accuracy that reading order recovers is improved greatly and is that unit has independence with the chapter.This method can be applicable to the printed page understanding of style document and structuring reconstruct.
Description
Technical field
The invention belongs to the document printed page understanding technology in intelligent literal and the graphic information processing, be specifically related to a kind of method of newspaper layout being carried out the words reading sequence recovery.
Background technology
Along with the appearance of the development of infotech and new media format, stride medium and publish, information convenient and propagate efficient, advantages such as expression forms of information is abundant, multiple medium mutual supplement with each other's advantages and develop rapidly with its information sharing.The core that medium are published is striden by digital asset management system based on XML, but in traditional information was propagated, the existence form of information directly depended on the form of terminal media, was not easy to stride medium and published.Particularly the newspaper enormous amount is historical remote, the pattern complexity, and content independence is poor, and reading order is fuzzy, and its XML structuring is difficulty the most.How revert to the independently chapter word flow that links up and represent it is that the newspaper data assets is realized striding medium and published the problem that is faced with XML with semantic information from the fuzzy and dependent text space relation of newspaper document of this complicated space of a whole page.Newspaper layout is carried out reading order to be recovered exactly with the method that solves these technical matterss.
At present, main flow OCR digitizing software is to the processing of the band style document space of a whole page, ignoring reading order and semantic structure recovers, the electronic document such as PDF, the HTML that convert the band pattern to issue again, but be unfavorable for information reuse with deep processing as retrieval, utilize, transaction, rewrite, replenish, arrangement etc., the newspaper layout of especially many chapters lacks chapter independent reading order and structuring and makes utilization more difficult more.Carrying out reading order recovers to mainly contain two class methods: a class is to utilize pattern and spatial relationship information, as document " printed page analysis of complicated Chinese paper; understand and reconstruct " (author Chen Ming, Ding Xiaoqing, Liang Jian. Tsing-Hua University's journal natural science edition the 41st the 1st phase of volume of calendar year 2001. the page number 29~32,59) and document " Integrated Algorithms for Newspaper Page Decomposition andArticle Tracking " (the author B.Gatos that delivered at Proceedings of theFifth International Conference on Document Analysis and Recognition in 1999, S.L.Mantzaris, K.V.Chandrinos, A.Tsigris, S.J.Perantonis. the page number 559~562), newspaper layout is considered as the set of a plurality of independent literal pieces, carry out the merging and the reading order of literal piece determines based on the principle utilization rule of same piece of writing article pattern homogeneity, rule and method can only be handled pattern and the simple space of a whole page of spatial relationship such as books, journal article, but the feature of newspaper layout diversity and object dependencies makes that the accuracy of only utilizing pattern and the regular reading order that carries out between the complicated space of a whole page literal piece to recover is low excessively; Another kind of is to utilize semanteme and spatial relationship information, 2002, Aiello M, Monz C, people such as Todoran L are at document " Document understandingfor a broad class of documents " (International Journal on DocumentAnalysis and Recognition, 2002,5 (1): disclose a kind of method of utilizing semantic information to determine reading order first 1~16.), all possible reading order is done a permutation and combination, select best result according to part of speech weights formula then, but time complexity exponential growth along with the increase of literal number of blocks, can't extract independently reading order, and the semantic information of utilizing very little, influences accuracy rate.More than in these technology, do not make full use of various potential informations in the newspaper layout document so that obtain more accurate reading order effect, more do not form unified mathematical model.
Summary of the invention
At problems of the prior art, the purpose of this invention is to provide a kind of method of newspaper layout being carried out the words reading sequence recovery, this method can be effectively be carried out that reading order recovers and can be that unit carries out independent reading and cuts apart in proper order with the chapter the newspaper layout document, thereby can improve the reading order accuracy rate greatly, be convenient to further XML semantic structureization again.
For reaching above purpose, the technical solution used in the present invention is: a kind of method that newspaper layout is carried out the words reading sequence recovery may further comprise the steps:
(1) reads in the document of being with the pattern layout information, carry out printed page analysis, the identical literal of pattern is merged into the literal piece, and be categorized as text literal piece and non-text literal piece, the spatial relationship of the inner literal of literal piece is single, according to a left side than rightly read earlier, go up than under the rule read earlier literal in the piece is connected into word flow with reading order content as piece.Non-text literal piece is isolated to literal piece on every side, need not to consider and the reading order of other literal pieces that the core of processing is the reading order between text literal piece content;
(2) be the summit with text literal piece, the left and right sides syntople of piece is that directed edge is set up laterally in abutting connection with digraph, with the piece is the summit, the syntople up and down of piece is that directed edge is set up vertically in abutting connection with digraph, set up the spatial sequence digraph based on these two digraphs and according to the spatial sequence rule, the spatial sequence rule definition is: if text literal piece L is being the pioneer of text literal piece m in digraph laterally or vertically, then text literal piece L is better than text literal piece m on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece n is being the pioneer of text literal piece m in digraph vertically, and then text literal piece L is better than text literal piece n on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece L is being the pioneer of text literal piece n in digraph vertically, and then text literal piece n is better than text literal piece m on spatial sequence;
(3) the spatial sequence digraph is split conversion, structure weighting bipartite graph, the weights on bipartite graph limit adopt natural language processing technique, become speech degree and part of speech metastatic rate etc. definite by the degree of correlation, the local liveness of overlapping words, the tail speech of the text literal piece content of two summit correspondences on limit with head-word;
(4) the weighting bipartite graph is carried out Optimum Matching, determine a plurality of continuous text literal piece total order sequences based on the result of Optimum Matching;
(5) each text literal piece sequence is divided into a plurality of subsequences according to the style information and the semantic association information of literal piece again, the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.
Further, for making the present invention have better effect:
In step (4), carry out reading order when recovering, Ku En-Man Kele (Kuhn-Munkres) algorithm of Optimum Matching in the graph theory is used for content-based reading order recovers.
The document of band pattern layout information comprises that scanning paper medium newspaper and OCR discern the document that the document, PDF, professional software for composing such as the Founder that obtain are soared and generated in the step (1), style information is meant that mainly each word all has position and size information, and printed page analysis is merged into the literal piece to the identical literal of pattern according to local pattern homogeneity principle is bottom-up; The classification foundation literal piece printed words formula of literal piece and line number amount are divided into text literal piece and non-text literal piece, and literal piece inside page object spatial relationship is the vertical syntople between row and the row, the horizontal syntople between interior word and the word of going.
Two vertex set X of weighting bipartite graph and Y comprise all summits of spatial sequence digraph in the step (3), the summit a of condition: X and the summit b of Y were going out a little of spatial sequence digraph and go into a little that then also there is the limit in they in the weighting bipartite graph below satisfied on the limit of weighting bipartite graph; The weights on weighting bipartite graph limit calculate and adopt natural language processing technique:
(1) the summit b corresponding content d of the summit a of X and Y
1And d
2Degree of correlation Similarity (d
1, d
2)=cosine (d
1, d
2)=(d
1* d
2)/|| d
1|| || d
2||;
(2) d
1And d
2The local liveness Active (d of vocabulary
1, d
2)=d
1With d
2The number of reduplication/overlapping words chain degree of distribution and;
(3) definition d
1Tail speech w
1With d
2Head-word w
2One-tenth speech degree WordTrans: if w
1w
2The word string of forming is a speech in dictionary, and then WordTrans is defined as 1, otherwise is defined as 0;
(4) d
1Tail speech w
1Part of speech pos1 and d
2Head-word w
2Part of speech metastatic rate PosTrans=P (pos1pos2|pos2)=freq (pos1 of part of speech pos2, pos2)/freq (pos1), freq (pos1, pos2) expression pos1 and the co-occurrence number of times of pos2 in corpus, the occurrence number of freq (pos1) in corpus;
Limit weights=α
1* Similarity+ α
2* Active+ α
3* WordTrans+ α
4* PosTrans (α
1+ α
2+ α
3+ α
4=1).
Utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph in the step (4), specific algorithm is as follows:
1) provides initial label
l(y
j)=0,i,j=1,2...,t,t=max(n,m);
2) obtain limit collection E
l={ (x
i, y
j) | l (x
i)+l (y
j)=ω
Ij, G
l=(X, Y
k, E
l) and G
lIn one the coupling M;
3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;
4) in X, look for a M unsaturation point x
0, make A ← { x
0, B ← φ, A, B are two set;
5) if
Then change the 9th) step, otherwise carry out next step, wherein,
Be with A in the node set of node adjacency;
6) look for a node
7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, B ← B ∪ y} changes the 5th) step, otherwise carry out next step;
8) there is one from x
0But to the augmenting path P of y, order
Change the 3rd) step;
9) be calculated as follows a value:
Revise label:
Ask E according to l '
L 'And G
L '
10) l ← l ', G
l← G
L ', change the 6th) and the step;
M as a result based on Optimum Matching determines a plurality of continuous text literal piece total order sequences, sequence generating method is if the summit b of the summit a of X and Y is that the pairing saturation point of M and the summit b of X and the summit c of Y are the pairing saturation points of M, then summit a → summit b → summit c forms a sequence, and recursion that calling sequence is increased to is the longest, and then be that object generates new sequence with the summit in this sequence not, all belong to certain sequence up to each summit, summit corresponding character piece has just formed a literal piece sequence in each sequence so.
In the step (5) each text literal piece sequence is divided into a plurality of subsequences according to wide the reaching with semantic association information with column gutter of style information such as hurdle of literal piece again, it is wide and wait that adjacent literal piece corresponding limit weights in bipartite graph are greater than threshold value in the character of column gutter and the subsequence that each subsequence such as has at the hurdle, and the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.
Effect of the present invention is at the characteristics of newspaper layout document, proposes a kind of new content-based method to newspaper layout recovery words reading sequence.This method can effectively be utilized the semantic information in the newspaper layout document, spatial relationship information and style information, and utilize the graph theory mathematical model that the reading order problem is carried out modeling, the recovery of reading order but also word flow have not only been finished and still are chapter independently, make that content and pattern on the newspaper layout are irrelevant, improved the reading order accuracy rate greatly, and help the information extraction and the XML structuring of newspaper layout more, so that realize the utilization again of historical data assets and stride medium and issue again, this method can be widely used in the field of information processing of intelligent literal such as printed page understanding and figure, as the paper medium, PS, PDF, Word, the printed page understanding and the structuring of band such as InDesign style document are handled.
Why the present invention has so significant technique effect, and its reason is:
1. the present invention carries out mathematical modeling to the recovery of the reading order between text literal piece problem with graph theory Optimum Matching theory first;
2. utilizing the space is the continuous necessary condition of word flow continuously, and the limit that the space syntople between the literal piece is expressed as digraph is to reduce the search volume;
3. digraph is split and be converted into the weighting bipartite graph so that the most probable reading order sequence of quantitative selection;
4. because the continuous most crucial judgment criteria of word flow is based on content, utilize natural language processing technique, the tail speech that becomes speech degree, the previous literal piece of sentence level of the head-word of the tail speech of the previous literal piece of speech level and a back literal piece and the head-word part of speech metastatic rate of a back literal piece, the degree of correlation of section level content, the local liveness of reduplication etc. determined two literal pieces whether on reading order continuously, their linear weighted function obtains a plurality of continuous literal piece sequences as the weights on bipartite graph limit by Ku En-Man Kele (Kuhn-Munkres) matching algorithm;
Each sequence be non-chapter independently, characteristics according to the literal piece content topic unanimity of heterogeneous between the inner homogeneity of newspaper layout pattern chapter, chapter and each chapter,, column gutter wide by the hurdle and semantic relevant information are divided into a plurality of continuous subsequences to literal piece sequence, and the connection of each subsequence corresponding character piece content promptly is an independently word flow of a chapter with reading order.
Description of drawings
Fig. 1 is a process flow diagram of the present invention;
Fig. 2 is the newspaper synoptic diagram after the printed page analysis;
Fig. 3 is text literal piece horizontal in abutting connection with the digraph synoptic diagram of newspaper layout document;
Fig. 4 is text literal piece vertical in abutting connection with the digraph synoptic diagram of newspaper layout document;
Fig. 5 is by the horizontal and vertical bipartite graph synoptic diagram that changes into that splits in abutting connection with digraph;
Fig. 6 is Ku En-Man Kele (Kuhn-Munkres) Optimum Matching arithmetic result synoptic diagram;
Fig. 7 is the newspaper synoptic diagram behind the recovery reading order.
Embodiment
Below in conjunction with accompanying drawing and implementation column the present invention is done to describe further.
In the present embodiment, we have selected newspaper document that OCR scans into data as an example for use, as shown in Figure 1, a kind of newspaper layout are carried out the method that words reading sequence recovers, and may further comprise the steps:
One, read in the document of band pattern layout information, comprise document, PDF, professional software for composing such as Founder that scanning paper medium newspaper and OCR identification the obtains document that generates etc. of soaring, style information is meant that mainly each word all has position and size information.Printed page analysis is merged into the literal piece to the identical literal of pattern according to local pattern homogeneity principle is bottom-up; The classification foundation literal piece printed words formula of literal piece and line number amount are divided into text literal piece and non-text literal piece, and as shown in Figure 2, solid-line rectangle is represented text literal piece, its numbering of numeral, and dotted line is represented non-text literal piece.The inner page object spatial relationship of literal piece be row with row between vertical syntople, row in horizontal syntople between word and the word, according to a left side than rightly read earlier, go up than under the rule read earlier literal in the piece is connected into word flow with reading order content as piece.Non-text literal piece is isolated to literal piece on every side, need not to consider and the reading order of other literal pieces that the core of processing is the reading order between text literal piece content.
Two, be the summit with text literal piece, the left and right sides syntople of piece is that directed edge is set up laterally in abutting connection with digraph, as shown in Figure 3, with the piece is the summit, the syntople up and down of piece is that directed edge is set up vertically in abutting connection with digraph, as shown in Figure 4, set up the spatial sequence digraph based on these two digraphs and according to the spatial sequence rule, if the spatial sequence rule definition is being the pioneer of text literal piece m in digraph laterally or vertically for text literal piece L, then text literal piece L is better than text literal piece m on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece n is being the pioneer of text literal piece m in digraph vertically, and then text literal piece L is better than text literal piece n on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece L is being the pioneer of text literal piece n in digraph vertically, and then text literal piece n is better than text literal piece m on spatial sequence;
Three, the spatial sequence digraph is split conversion, structure weighting bipartite graph, as shown in Figure 5, the pioneer summit of " f text " expression reading order, the descendant vertex of " t text " expression reading order, two vertex set X of weighting bipartite graph and Y comprise all summits of spatial sequence digraph, the summit a of condition: X and the summit b of Y were going out a little of spatial sequence digraph and go into a little below satisfied on the limit of weighting bipartite graph, then also there is the limit in they in the weighting bipartite graph, the weights on bipartite graph limit adopt natural language processing technique, the degree of correlation by the text literal piece content of two summit correspondences on limit, the local liveness of overlapping words, the tail speech is determined with become the speech degree and the part of speech metastatic rate etc. of head-word, specifically is calculated as follows:
(1) the summit b corresponding content d of the summit a of X and Y
1And d
2Degree of correlation Similarity (d
1, d
2)=cosine (d
1, d
2)=(d
1* d
2)/|| d
1|| || d
2||;
(2) d
1And d
2The local liveness Active (d of vocabulary
1, d
2)=d
1With d
2The number of reduplication/overlapping words chain degree of distribution and;
(3) definition d
1Tail speech w
1With d
2Head-word w
2One-tenth speech degree WordTrans: if w
1w
2The word string of forming is a speech in dictionary, and then WordTrans is defined as 1, otherwise is defined as 0;
(4) d
1Tail speech w
1Part of speech pos1 and d
2Head-word w
2Part of speech metastatic rate PosTrans=P (pos1pos2|pos2)=freq (pos1 of part of speech pos2, pos2)/freq (pos1), freq (pos1, pos2) expression pos1 and the co-occurrence number of times of pos2 in corpus, the occurrence number of freq (pos1) in corpus;
Limit weights=α
1* Similarity+ α
2* Active+ α
3* WordTrans+ α
4* PosTrans (α
1+ α
2+ α
3+ α
4=1).
Four, utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph, determine a plurality of continuous text literal piece total order sequences based on the result of Optimum Matching, Ku En-Man Kele (Kuhn-Munkres) algorithm is as follows:
(1) provides initial label
l(y
j)=0,i,j=1,2...,t,t=max(n,m);
(2) obtain limit collection E
l={ (x
i, y
j) | l (x
i)+l (y
j)=ω
Ij, G
l=(X, Y
k, E
l) and G
lIn one the coupling M;
(3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;
(4) in X, look for a M unsaturation point x
0, make A ← { x
0, B ← φ, A, B are two set;
(5) if
Then changeed for (9) step, otherwise carry out next step, wherein,
Be with A in the node set of node adjacency;
(6) look for a node
(7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, { y} changeed for (5) step, otherwise carries out next step B ← B ∪;
(8) there is one from x
0But to the augmenting path P of y, order
Changeed for (3) step;
(9) be calculated as follows a value:
Revise label:
Ask E according to l '
L 'And G
L '
(10) l ← l ', G
l← G
L ', changeed for (6) step;
M as a result based on Optimum Matching, the formation sequence method is if the summit b of the summit a of X and Y is that the pairing saturation point of M and the summit b of X and the summit c of Y are the pairing saturation points of M, then summit a → summit b → summit c forms a sequence, and recursion that calling sequence is increased to is the longest, and then be that object generates new sequence with the summit in this sequence not, all belong to certain sequence up to each summit, summit corresponding character piece has just formed a literal piece sequence in each sequence so, and Optimum Matching symbiosis as shown in Figure 6 becomes 5 sequences: 12 → 13 → 14 → 15 → 16 → 17 → 20 → 18 → 21 → 19 → 22 → 1,23 → 24 → 25,27 → 28-→ 0 → 8 → 9 → 10 → 11 → 4,26 → 5 → 2 → 6 → 7 → 3 and 29.
Five, each text literal piece sequence is divided into a plurality of subsequences according to the style information and the semantic association information of literal piece again, it is wide and wait that adjacent literal piece corresponding limit weights in bipartite graph are greater than threshold value in the character of column gutter and the subsequence that each subsequence such as has at the hurdle, the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out, as shown in Figure 7, have 9 chapters, reading order is represented with arrow: 12 → 13 → 14 → 15 → 16 → 17 → 20 → 18 → 21 → 19 → 22,23 → 24 → 25,27 → 28 → 0,8 → 9 → 10 → 11 → 4,2 → 6 → 7 → 3,1,5,26 and 29, wherein having four chapters all only to contain a literal piece is respectively 1,5,26 and 29.
Claims (6)
1. one kind is carried out the method that words reading sequence recovers to newspaper layout, may further comprise the steps:
(1) reads in the document of being with the pattern layout information, carry out printed page analysis, the identical literal of pattern is merged into the literal piece, and be categorized as text literal piece and non-text literal piece, the spatial relationship of the inner literal of literal piece is single, piece in literal connected into word flow with reading order content as piece than right reading earlier, going up than the rule of reading earlier down according to a left side, non-text literal piece is isolated to literal piece on every side, need not to consider and the reading order of other literal pieces that the core of processing is the reading order between text literal piece content;
(2) be the summit with text literal piece, the left and right sides syntople of piece is that directed edge is set up laterally in abutting connection with digraph, with the piece is the summit, the syntople up and down of piece is that directed edge is set up vertically in abutting connection with digraph, set up the spatial sequence digraph based on these two digraphs and according to the spatial sequence rule, the spatial sequence rule definition is: if text literal piece L is being the pioneer of text literal piece m in digraph laterally or vertically, then text literal piece L is better than text literal piece m on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece n is being the pioneer of text literal piece m in digraph vertically, and then text literal piece L is better than text literal piece n on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece L is being the pioneer of text literal piece n in digraph vertically, and then text literal piece n is better than text literal piece m on spatial sequence;
(3) the spatial sequence digraph is split conversion, structure weighting bipartite graph, the weights on bipartite graph limit adopt natural language processing technique, become speech degree and part of speech metastatic rate definite by the local liveness of the degree of correlation of the text literal piece content of two summit correspondences on limit, overlapping words, tail speech and head-word;
(4) the weighting bipartite graph is carried out Optimum Matching, determine a plurality of continuous text literal piece total order sequences based on the result of Optimum Matching;
(5) each text literal piece sequence is divided into a plurality of subsequences according to the style information and the semantic association information of literal piece again, the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.
2, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: in step (4), carry out reading order when recovering, Ku En-Man Kele (Kuhn-Munkres) algorithm of Optimum Matching in the graph theory is used for content-based reading order recovers.
3, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: the document of band pattern layout information comprises that scanning paper medium newspaper and OCR discern the document that the document, PDF, professional software for composing such as the Founder that obtain are soared and generated in the step (1), style information is meant that mainly each word all has position and size information, and printed page analysis is merged into the literal piece to the identical literal of pattern according to local pattern homogeneity principle is bottom-up; The classification foundation literal piece printed words formula of literal piece and line number amount are divided into text literal piece and non-text literal piece, and literal piece inside page object spatial relationship is the vertical syntople between row and the row, the horizontal syntople between interior word and the word of going.
4, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: two vertex set X of weighting bipartite graph and Y comprise all summits of spatial sequence digraph in the step (3), the summit a of condition: X and the summit b of Y were going out a little of spatial sequence digraph and go into a little that then also there is the limit in they in the weighting bipartite graph below satisfied on the limit of weighting bipartite graph; The weights on weighting bipartite graph limit calculate and adopt natural language processing technique:
1) the summit b corresponding content d of the summit a of X and Y
1And d
2Degree of correlation Similarity (d
1, d
2)=cosine (d
1, d
2)=(d
1* d
2)/|| d
1|| || d
2||;
2) d
1And d
2The local liveness Active (d of vocabulary
1, d
2)=d
1With d
2The number of reduplication/overlapping words chain degree of distribution and;
3) definition d
1Tail speech w
1With d
2Head-word w
2One-tenth speech degree WordTrans: if w
1w
2The word string of forming is a speech in dictionary, and then WordTrans is defined as 1, otherwise is defined as 0;
4) d
1Tail speech w
1Part of speech pos1 and d
2Head-word w
2Part of speech metastatic rate PosTrans=P (pos1pos2|pos2)=freq (pos1 of part of speech pos2, pos2)/freq (pos1), freq (pos1, pos2) expression pos1 and the co-occurrence number of times of pos2 in corpus, the occurrence number of freq (pos1) in corpus;
Limit weights=α
1* Similarity+ α
2* Active+ α
3* WordTrans+ α
4* PosTrans (α
1+ α
2+ α
3+ α
4=1).
5, a kind of method that newspaper layout is carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph in the step (4), specific algorithm is as follows:
1) provides initial label
l(y
j)=0,i,j=1,2...,t,t=max(n,m);
2) obtain limit collection E
l={ (x
i, y
j) | l (x
i)+l (y
j)=ω
Ij, G
l=(X, Y
k, E
l) and G
lIn one the coupling M;
3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;
4) in X, look for a M unsaturation point x
0, make A ← { x
0, B ← φ, A, B are two set;
5) if
Then change the 9th) step, otherwise carry out next step, wherein,
Be with A in the node set of node adjacency;
6) look for a node
7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, B ← B ∪ y} changes the 5th) step, otherwise carry out next step;
8) there is one from x
0But to the augmenting path P of y, order
Change the 3rd) step;
9) be calculated as follows a value:
Revise label:
Ask E according to l '
L 'And G
L '
10) l ← l ', G
l← G
L ', change the 6th) and the step;
M as a result based on Optimum Matching determines a plurality of continuous text literal piece total order sequences, sequence generating method is if the summit b of the summit a of X and Y is that the pairing saturation point of M and the summit b of X and the summit c of Y are the pairing saturation points of M, then summit a → summit b → summit c forms a sequence, and recursion that calling sequence is increased to is the longest, and then be that object generates new sequence with the summit in this sequence not, all belong to certain sequence up to each summit, summit corresponding character piece has just formed a literal piece sequence in each sequence so.
6, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: in the step (5) each text literal piece sequence is divided into a plurality of subsequences according to wide the reaching with semantic association information with column gutter of style information such as hurdle of literal piece again, it is wide and wait that adjacent literal piece corresponding limit weights in bipartite graph are greater than threshold value in the character of column gutter and the subsequence that each subsequence such as has at the hurdle, and the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2004100914343A CN100568221C (en) | 2004-11-22 | 2004-11-22 | A kind of method of newspaper layout being carried out the words reading sequence recovery |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2004100914343A CN100568221C (en) | 2004-11-22 | 2004-11-22 | A kind of method of newspaper layout being carried out the words reading sequence recovery |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1604075A CN1604075A (en) | 2005-04-06 |
CN100568221C true CN100568221C (en) | 2009-12-09 |
Family
ID=34667256
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2004100914343A Expired - Fee Related CN100568221C (en) | 2004-11-22 | 2004-11-22 | A kind of method of newspaper layout being carried out the words reading sequence recovery |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100568221C (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8365072B2 (en) | 2009-01-02 | 2013-01-29 | Apple Inc. | Identification of compound graphic elements in an unstructured document |
CN101866418B (en) * | 2009-04-17 | 2013-02-27 | 株式会社理光 | Method and equipment for determining file reading sequences |
CN102541826B (en) * | 2010-12-27 | 2014-08-06 | 北大方正集团有限公司 | Text block content reorganizing method and device |
CN102073862B (en) * | 2011-02-18 | 2013-04-17 | 山东山大鸥玛软件有限公司 | Method for quickly calculating layout structure of document image |
CN103488619B (en) * | 2013-07-05 | 2017-05-24 | 百度在线网络技术(北京)有限公司 | Method and device for processing document file |
CN104268127B (en) * | 2014-09-22 | 2018-02-09 | 同方知网(北京)技术有限公司 | A kind of method of electronics shelves layout files reading order analysis |
CN106096592B (en) * | 2016-07-22 | 2019-05-24 | 浙江大学 | A kind of printed page analysis method of digital book |
CN108268429B (en) * | 2017-06-15 | 2021-08-06 | 阿里巴巴(中国)有限公司 | Method and device for determining network literature chapters |
CN109274681B (en) * | 2018-10-25 | 2021-11-16 | 深圳壹账通智能科技有限公司 | Information synchronization method and device, storage medium and server |
CN110209765B (en) * | 2019-05-23 | 2021-03-30 | 武汉绿色网络信息服务有限责任公司 | Method and device for searching keywords according to meanings |
CN113221743B (en) * | 2021-05-12 | 2024-01-12 | 北京百度网讯科技有限公司 | Table analysis method, apparatus, electronic device and storage medium |
-
2004
- 2004-11-22 CN CNB2004100914343A patent/CN100568221C/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN1604075A (en) | 2005-04-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
Gatterbauer et al. | Towards domain-independent information extraction from web tables | |
Wang et al. | Document zone content classification and its performance evaluation | |
Akimushkin et al. | On the role of words in the network structure of texts: Application to authorship attribution | |
CN106201465A (en) | Software project personalized recommendation method towards open source community | |
CN102043851A (en) | Multiple-document automatic abstracting method based on frequent itemset | |
CN111177591A (en) | Knowledge graph-based Web data optimization method facing visualization demand | |
CN100568221C (en) | A kind of method of newspaper layout being carried out the words reading sequence recovery | |
CN109145260A (en) | A kind of text information extraction method | |
Rastan et al. | Texus: A task-based approach for table extraction and understanding | |
CN103810251A (en) | Method and device for extracting text | |
CN112559656A (en) | Method for constructing affair map based on hydrologic events | |
CN109657114B (en) | Method for extracting webpage semi-structured data | |
CN114997288A (en) | Design resource association method | |
Du et al. | Exploiting syntactic structure for better language modeling: A syntactic distance approach | |
CN116205211A (en) | Document level resume analysis method based on large-scale pre-training generation model | |
CN1604073A (en) | Method for conducting title and text logic connection for newspaper pages | |
CN102591931A (en) | Recognition and extraction method for webpage data records based on tree weight | |
CN109582958B (en) | Disaster story line construction method and device | |
CN103699568A (en) | Method for extracting hyponymy relation of field terms from wikipedia | |
de Oliveira et al. | A syntactic-relationship approach to construct well-informative knowledge graphs representation | |
Gao et al. | Newspaper article reconstruction using ant colony optimization and bipartite graph | |
CN101436194B (en) | Text multiple-accuracy representing method based on data excavating technology | |
Phan et al. | Automated data extraction from the web with conditional models | |
CN111859887A (en) | Scientific and technological news automatic writing system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20091209 |