CN100568221C - A kind of method of newspaper layout being carried out the words reading sequence recovery - Google Patents

A kind of method of newspaper layout being carried out the words reading sequence recovery Download PDF

Info

Publication number
CN100568221C
CN100568221C CNB2004100914343A CN200410091434A CN100568221C CN 100568221 C CN100568221 C CN 100568221C CN B2004100914343 A CNB2004100914343 A CN B2004100914343A CN 200410091434 A CN200410091434 A CN 200410091434A CN 100568221 C CN100568221 C CN 100568221C
Authority
CN
China
Prior art keywords
literal piece
sequence
piece
text
summit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2004100914343A
Other languages
Chinese (zh)
Other versions
CN1604075A (en
Inventor
贾娟
陈晓鸥
陈堃銶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Original Assignee
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIDA FANGZHENG TECHN INST Co Ltd BEIJING, Peking University filed Critical BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Priority to CNB2004100914343A priority Critical patent/CN100568221C/en
Publication of CN1604075A publication Critical patent/CN1604075A/en
Application granted granted Critical
Publication of CN100568221C publication Critical patent/CN100568221C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention belongs to the document printed page understanding technology in intelligent literal and the graphic information processing, be specifically related to a kind ofly content-based newspaper layout be carried out the method that words reading sequence recovers.Handling complicated newspaper layout at prior art exists and loses the defective that reading order and content do not have chapter independence, the present invention carries out mathematical modeling to this problem with the graph theory theory first, the syntople of literal piece is expressed as digraph, and digraph split be converted into the weighting bipartite graph, adopt natural language processing technique to calculate bipartite graph limit weights, obtain a plurality of continuous sequences by Optimum Matching, each sequence is divided into a plurality of subsequences according to literal piece style information again, and the connection of subsequence corresponding content promptly is the word flow with reading order of independent chapter.Utilized semantic, spatial relationship and style information, the accuracy that reading order recovers is improved greatly and is that unit has independence with the chapter.This method can be applicable to the printed page understanding of style document and structuring reconstruct.

Description

A kind of method of newspaper layout being carried out the words reading sequence recovery
Technical field
The invention belongs to the document printed page understanding technology in intelligent literal and the graphic information processing, be specifically related to a kind of method of newspaper layout being carried out the words reading sequence recovery.
Background technology
Along with the appearance of the development of infotech and new media format, stride medium and publish, information convenient and propagate efficient, advantages such as expression forms of information is abundant, multiple medium mutual supplement with each other's advantages and develop rapidly with its information sharing.The core that medium are published is striden by digital asset management system based on XML, but in traditional information was propagated, the existence form of information directly depended on the form of terminal media, was not easy to stride medium and published.Particularly the newspaper enormous amount is historical remote, the pattern complexity, and content independence is poor, and reading order is fuzzy, and its XML structuring is difficulty the most.How revert to the independently chapter word flow that links up and represent it is that the newspaper data assets is realized striding medium and published the problem that is faced with XML with semantic information from the fuzzy and dependent text space relation of newspaper document of this complicated space of a whole page.Newspaper layout is carried out reading order to be recovered exactly with the method that solves these technical matterss.
At present, main flow OCR digitizing software is to the processing of the band style document space of a whole page, ignoring reading order and semantic structure recovers, the electronic document such as PDF, the HTML that convert the band pattern to issue again, but be unfavorable for information reuse with deep processing as retrieval, utilize, transaction, rewrite, replenish, arrangement etc., the newspaper layout of especially many chapters lacks chapter independent reading order and structuring and makes utilization more difficult more.Carrying out reading order recovers to mainly contain two class methods: a class is to utilize pattern and spatial relationship information, as document " printed page analysis of complicated Chinese paper; understand and reconstruct " (author Chen Ming, Ding Xiaoqing, Liang Jian. Tsing-Hua University's journal natural science edition the 41st the 1st phase of volume of calendar year 2001. the page number 29~32,59) and document " Integrated Algorithms for Newspaper Page Decomposition andArticle Tracking " (the author B.Gatos that delivered at Proceedings of theFifth International Conference on Document Analysis and Recognition in 1999, S.L.Mantzaris, K.V.Chandrinos, A.Tsigris, S.J.Perantonis. the page number 559~562), newspaper layout is considered as the set of a plurality of independent literal pieces, carry out the merging and the reading order of literal piece determines based on the principle utilization rule of same piece of writing article pattern homogeneity, rule and method can only be handled pattern and the simple space of a whole page of spatial relationship such as books, journal article, but the feature of newspaper layout diversity and object dependencies makes that the accuracy of only utilizing pattern and the regular reading order that carries out between the complicated space of a whole page literal piece to recover is low excessively; Another kind of is to utilize semanteme and spatial relationship information, 2002, Aiello M, Monz C, people such as Todoran L are at document " Document understandingfor a broad class of documents " (International Journal on DocumentAnalysis and Recognition, 2002,5 (1): disclose a kind of method of utilizing semantic information to determine reading order first 1~16.), all possible reading order is done a permutation and combination, select best result according to part of speech weights formula then, but time complexity exponential growth along with the increase of literal number of blocks, can't extract independently reading order, and the semantic information of utilizing very little, influences accuracy rate.More than in these technology, do not make full use of various potential informations in the newspaper layout document so that obtain more accurate reading order effect, more do not form unified mathematical model.
Summary of the invention
At problems of the prior art, the purpose of this invention is to provide a kind of method of newspaper layout being carried out the words reading sequence recovery, this method can be effectively be carried out that reading order recovers and can be that unit carries out independent reading and cuts apart in proper order with the chapter the newspaper layout document, thereby can improve the reading order accuracy rate greatly, be convenient to further XML semantic structureization again.
For reaching above purpose, the technical solution used in the present invention is: a kind of method that newspaper layout is carried out the words reading sequence recovery may further comprise the steps:
(1) reads in the document of being with the pattern layout information, carry out printed page analysis, the identical literal of pattern is merged into the literal piece, and be categorized as text literal piece and non-text literal piece, the spatial relationship of the inner literal of literal piece is single, according to a left side than rightly read earlier, go up than under the rule read earlier literal in the piece is connected into word flow with reading order content as piece.Non-text literal piece is isolated to literal piece on every side, need not to consider and the reading order of other literal pieces that the core of processing is the reading order between text literal piece content;
(2) be the summit with text literal piece, the left and right sides syntople of piece is that directed edge is set up laterally in abutting connection with digraph, with the piece is the summit, the syntople up and down of piece is that directed edge is set up vertically in abutting connection with digraph, set up the spatial sequence digraph based on these two digraphs and according to the spatial sequence rule, the spatial sequence rule definition is: if text literal piece L is being the pioneer of text literal piece m in digraph laterally or vertically, then text literal piece L is better than text literal piece m on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece n is being the pioneer of text literal piece m in digraph vertically, and then text literal piece L is better than text literal piece n on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece L is being the pioneer of text literal piece n in digraph vertically, and then text literal piece n is better than text literal piece m on spatial sequence;
(3) the spatial sequence digraph is split conversion, structure weighting bipartite graph, the weights on bipartite graph limit adopt natural language processing technique, become speech degree and part of speech metastatic rate etc. definite by the degree of correlation, the local liveness of overlapping words, the tail speech of the text literal piece content of two summit correspondences on limit with head-word;
(4) the weighting bipartite graph is carried out Optimum Matching, determine a plurality of continuous text literal piece total order sequences based on the result of Optimum Matching;
(5) each text literal piece sequence is divided into a plurality of subsequences according to the style information and the semantic association information of literal piece again, the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.
Further, for making the present invention have better effect:
In step (4), carry out reading order when recovering, Ku En-Man Kele (Kuhn-Munkres) algorithm of Optimum Matching in the graph theory is used for content-based reading order recovers.
The document of band pattern layout information comprises that scanning paper medium newspaper and OCR discern the document that the document, PDF, professional software for composing such as the Founder that obtain are soared and generated in the step (1), style information is meant that mainly each word all has position and size information, and printed page analysis is merged into the literal piece to the identical literal of pattern according to local pattern homogeneity principle is bottom-up; The classification foundation literal piece printed words formula of literal piece and line number amount are divided into text literal piece and non-text literal piece, and literal piece inside page object spatial relationship is the vertical syntople between row and the row, the horizontal syntople between interior word and the word of going.
Two vertex set X of weighting bipartite graph and Y comprise all summits of spatial sequence digraph in the step (3), the summit a of condition: X and the summit b of Y were going out a little of spatial sequence digraph and go into a little that then also there is the limit in they in the weighting bipartite graph below satisfied on the limit of weighting bipartite graph; The weights on weighting bipartite graph limit calculate and adopt natural language processing technique:
(1) the summit b corresponding content d of the summit a of X and Y 1And d 2Degree of correlation Similarity (d 1, d 2)=cosine (d 1, d 2)=(d 1* d 2)/|| d 1|| || d 2||;
(2) d 1And d 2The local liveness Active (d of vocabulary 1, d 2)=d 1With d 2The number of reduplication/overlapping words chain degree of distribution and;
(3) definition d 1Tail speech w 1With d 2Head-word w 2One-tenth speech degree WordTrans: if w 1w 2The word string of forming is a speech in dictionary, and then WordTrans is defined as 1, otherwise is defined as 0;
(4) d 1Tail speech w 1Part of speech pos1 and d 2Head-word w 2Part of speech metastatic rate PosTrans=P (pos1pos2|pos2)=freq (pos1 of part of speech pos2, pos2)/freq (pos1), freq (pos1, pos2) expression pos1 and the co-occurrence number of times of pos2 in corpus, the occurrence number of freq (pos1) in corpus;
Limit weights=α 1* Similarity+ α 2* Active+ α 3* WordTrans+ α 4* PosTrans (α 1+ α 2+ α 3+ α 4=1).
Utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph in the step (4), specific algorithm is as follows:
1) provides initial label l ( x i ) = max j ω ij , l(y j)=0,i,j=1,2...,t,t=max(n,m);
2) obtain limit collection E l={ (x i, y j) | l (x i)+l (y j)=ω Ij, G l=(X, Y k, E l) and G lIn one the coupling M;
3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;
4) in X, look for a M unsaturation point x 0, make A ← { x 0, B ← φ, A, B are two set;
5) if N G l ( A ) = B , Then change the 9th) step, otherwise carry out next step, wherein, N G l ( A ) ⊆ Y k , Be with A in the node set of node adjacency;
6) look for a node y ∈ N G l ( A ) - B ;
7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, B ← B ∪ y} changes the 5th) step, otherwise carry out next step;
8) there is one from x 0But to the augmenting path P of y, order M ← M ⊕ E ( P ) , Change the 3rd) step;
9) be calculated as follows a value: a = min x i ∈ A y j ∉ N G l ( A ) { l ( x i ) + l ( y j ) - ω ij } , Revise label:
Figure C20041009143400086
Ask E according to l ' L 'And G L '
10) l ← l ', G l← G L ', change the 6th) and the step;
M as a result based on Optimum Matching determines a plurality of continuous text literal piece total order sequences, sequence generating method is if the summit b of the summit a of X and Y is that the pairing saturation point of M and the summit b of X and the summit c of Y are the pairing saturation points of M, then summit a → summit b → summit c forms a sequence, and recursion that calling sequence is increased to is the longest, and then be that object generates new sequence with the summit in this sequence not, all belong to certain sequence up to each summit, summit corresponding character piece has just formed a literal piece sequence in each sequence so.
In the step (5) each text literal piece sequence is divided into a plurality of subsequences according to wide the reaching with semantic association information with column gutter of style information such as hurdle of literal piece again, it is wide and wait that adjacent literal piece corresponding limit weights in bipartite graph are greater than threshold value in the character of column gutter and the subsequence that each subsequence such as has at the hurdle, and the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.
Effect of the present invention is at the characteristics of newspaper layout document, proposes a kind of new content-based method to newspaper layout recovery words reading sequence.This method can effectively be utilized the semantic information in the newspaper layout document, spatial relationship information and style information, and utilize the graph theory mathematical model that the reading order problem is carried out modeling, the recovery of reading order but also word flow have not only been finished and still are chapter independently, make that content and pattern on the newspaper layout are irrelevant, improved the reading order accuracy rate greatly, and help the information extraction and the XML structuring of newspaper layout more, so that realize the utilization again of historical data assets and stride medium and issue again, this method can be widely used in the field of information processing of intelligent literal such as printed page understanding and figure, as the paper medium, PS, PDF, Word, the printed page understanding and the structuring of band such as InDesign style document are handled.
Why the present invention has so significant technique effect, and its reason is:
1. the present invention carries out mathematical modeling to the recovery of the reading order between text literal piece problem with graph theory Optimum Matching theory first;
2. utilizing the space is the continuous necessary condition of word flow continuously, and the limit that the space syntople between the literal piece is expressed as digraph is to reduce the search volume;
3. digraph is split and be converted into the weighting bipartite graph so that the most probable reading order sequence of quantitative selection;
4. because the continuous most crucial judgment criteria of word flow is based on content, utilize natural language processing technique, the tail speech that becomes speech degree, the previous literal piece of sentence level of the head-word of the tail speech of the previous literal piece of speech level and a back literal piece and the head-word part of speech metastatic rate of a back literal piece, the degree of correlation of section level content, the local liveness of reduplication etc. determined two literal pieces whether on reading order continuously, their linear weighted function obtains a plurality of continuous literal piece sequences as the weights on bipartite graph limit by Ku En-Man Kele (Kuhn-Munkres) matching algorithm;
Each sequence be non-chapter independently, characteristics according to the literal piece content topic unanimity of heterogeneous between the inner homogeneity of newspaper layout pattern chapter, chapter and each chapter,, column gutter wide by the hurdle and semantic relevant information are divided into a plurality of continuous subsequences to literal piece sequence, and the connection of each subsequence corresponding character piece content promptly is an independently word flow of a chapter with reading order.
Description of drawings
Fig. 1 is a process flow diagram of the present invention;
Fig. 2 is the newspaper synoptic diagram after the printed page analysis;
Fig. 3 is text literal piece horizontal in abutting connection with the digraph synoptic diagram of newspaper layout document;
Fig. 4 is text literal piece vertical in abutting connection with the digraph synoptic diagram of newspaper layout document;
Fig. 5 is by the horizontal and vertical bipartite graph synoptic diagram that changes into that splits in abutting connection with digraph;
Fig. 6 is Ku En-Man Kele (Kuhn-Munkres) Optimum Matching arithmetic result synoptic diagram;
Fig. 7 is the newspaper synoptic diagram behind the recovery reading order.
Embodiment
Below in conjunction with accompanying drawing and implementation column the present invention is done to describe further.
In the present embodiment, we have selected newspaper document that OCR scans into data as an example for use, as shown in Figure 1, a kind of newspaper layout are carried out the method that words reading sequence recovers, and may further comprise the steps:
One, read in the document of band pattern layout information, comprise document, PDF, professional software for composing such as Founder that scanning paper medium newspaper and OCR identification the obtains document that generates etc. of soaring, style information is meant that mainly each word all has position and size information.Printed page analysis is merged into the literal piece to the identical literal of pattern according to local pattern homogeneity principle is bottom-up; The classification foundation literal piece printed words formula of literal piece and line number amount are divided into text literal piece and non-text literal piece, and as shown in Figure 2, solid-line rectangle is represented text literal piece, its numbering of numeral, and dotted line is represented non-text literal piece.The inner page object spatial relationship of literal piece be row with row between vertical syntople, row in horizontal syntople between word and the word, according to a left side than rightly read earlier, go up than under the rule read earlier literal in the piece is connected into word flow with reading order content as piece.Non-text literal piece is isolated to literal piece on every side, need not to consider and the reading order of other literal pieces that the core of processing is the reading order between text literal piece content.
Two, be the summit with text literal piece, the left and right sides syntople of piece is that directed edge is set up laterally in abutting connection with digraph, as shown in Figure 3, with the piece is the summit, the syntople up and down of piece is that directed edge is set up vertically in abutting connection with digraph, as shown in Figure 4, set up the spatial sequence digraph based on these two digraphs and according to the spatial sequence rule, if the spatial sequence rule definition is being the pioneer of text literal piece m in digraph laterally or vertically for text literal piece L, then text literal piece L is better than text literal piece m on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece n is being the pioneer of text literal piece m in digraph vertically, and then text literal piece L is better than text literal piece n on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece L is being the pioneer of text literal piece n in digraph vertically, and then text literal piece n is better than text literal piece m on spatial sequence;
Three, the spatial sequence digraph is split conversion, structure weighting bipartite graph, as shown in Figure 5, the pioneer summit of " f text " expression reading order, the descendant vertex of " t text " expression reading order, two vertex set X of weighting bipartite graph and Y comprise all summits of spatial sequence digraph, the summit a of condition: X and the summit b of Y were going out a little of spatial sequence digraph and go into a little below satisfied on the limit of weighting bipartite graph, then also there is the limit in they in the weighting bipartite graph, the weights on bipartite graph limit adopt natural language processing technique, the degree of correlation by the text literal piece content of two summit correspondences on limit, the local liveness of overlapping words, the tail speech is determined with become the speech degree and the part of speech metastatic rate etc. of head-word, specifically is calculated as follows:
(1) the summit b corresponding content d of the summit a of X and Y 1And d 2Degree of correlation Similarity (d 1, d 2)=cosine (d 1, d 2)=(d 1* d 2)/|| d 1|| || d 2||;
(2) d 1And d 2The local liveness Active (d of vocabulary 1, d 2)=d 1With d 2The number of reduplication/overlapping words chain degree of distribution and;
(3) definition d 1Tail speech w 1With d 2Head-word w 2One-tenth speech degree WordTrans: if w 1w 2The word string of forming is a speech in dictionary, and then WordTrans is defined as 1, otherwise is defined as 0;
(4) d 1Tail speech w 1Part of speech pos1 and d 2Head-word w 2Part of speech metastatic rate PosTrans=P (pos1pos2|pos2)=freq (pos1 of part of speech pos2, pos2)/freq (pos1), freq (pos1, pos2) expression pos1 and the co-occurrence number of times of pos2 in corpus, the occurrence number of freq (pos1) in corpus;
Limit weights=α 1* Similarity+ α 2* Active+ α 3* WordTrans+ α 4* PosTrans (α 1+ α 2+ α 3+ α 4=1).
Four, utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph, determine a plurality of continuous text literal piece total order sequences based on the result of Optimum Matching, Ku En-Man Kele (Kuhn-Munkres) algorithm is as follows:
(1) provides initial label l ( x i ) = max j ω ij , l(y j)=0,i,j=1,2...,t,t=max(n,m);
(2) obtain limit collection E l={ (x i, y j) | l (x i)+l (y j)=ω Ij, G l=(X, Y k, E l) and G lIn one the coupling M;
(3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;
(4) in X, look for a M unsaturation point x 0, make A ← { x 0, B ← φ, A, B are two set;
(5) if N G l ( A ) = B , Then changeed for (9) step, otherwise carry out next step, wherein, N G l ( A ) ⊆ Y k , Be with A in the node set of node adjacency;
(6) look for a node y ∈ N G l ( A ) - B ;
(7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, { y} changeed for (5) step, otherwise carries out next step B ← B ∪;
(8) there is one from x 0But to the augmenting path P of y, order M ← M ⊕ E ( P ) , Changeed for (3) step;
(9) be calculated as follows a value: a = min x i ∈ A y j ∉ N G l ( A ) { l ( x i ) + l ( y j ) - ω ij } , Revise label:
Figure C20041009143400117
Ask E according to l ' L 'And G L '
(10) l ← l ', G l← G L ', changeed for (6) step;
M as a result based on Optimum Matching, the formation sequence method is if the summit b of the summit a of X and Y is that the pairing saturation point of M and the summit b of X and the summit c of Y are the pairing saturation points of M, then summit a → summit b → summit c forms a sequence, and recursion that calling sequence is increased to is the longest, and then be that object generates new sequence with the summit in this sequence not, all belong to certain sequence up to each summit, summit corresponding character piece has just formed a literal piece sequence in each sequence so, and Optimum Matching symbiosis as shown in Figure 6 becomes 5 sequences: 12 → 13 → 14 → 15 → 16 → 17 → 20 → 18 → 21 → 19 → 22 → 1,23 → 24 → 25,27 → 28-→ 0 → 8 → 9 → 10 → 11 → 4,26 → 5 → 2 → 6 → 7 → 3 and 29.
Five, each text literal piece sequence is divided into a plurality of subsequences according to the style information and the semantic association information of literal piece again, it is wide and wait that adjacent literal piece corresponding limit weights in bipartite graph are greater than threshold value in the character of column gutter and the subsequence that each subsequence such as has at the hurdle, the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out, as shown in Figure 7, have 9 chapters, reading order is represented with arrow: 12 → 13 → 14 → 15 → 16 → 17 → 20 → 18 → 21 → 19 → 22,23 → 24 → 25,27 → 28 → 0,8 → 9 → 10 → 11 → 4,2 → 6 → 7 → 3,1,5,26 and 29, wherein having four chapters all only to contain a literal piece is respectively 1,5,26 and 29.

Claims (6)

1. one kind is carried out the method that words reading sequence recovers to newspaper layout, may further comprise the steps:
(1) reads in the document of being with the pattern layout information, carry out printed page analysis, the identical literal of pattern is merged into the literal piece, and be categorized as text literal piece and non-text literal piece, the spatial relationship of the inner literal of literal piece is single, piece in literal connected into word flow with reading order content as piece than right reading earlier, going up than the rule of reading earlier down according to a left side, non-text literal piece is isolated to literal piece on every side, need not to consider and the reading order of other literal pieces that the core of processing is the reading order between text literal piece content;
(2) be the summit with text literal piece, the left and right sides syntople of piece is that directed edge is set up laterally in abutting connection with digraph, with the piece is the summit, the syntople up and down of piece is that directed edge is set up vertically in abutting connection with digraph, set up the spatial sequence digraph based on these two digraphs and according to the spatial sequence rule, the spatial sequence rule definition is: if text literal piece L is being the pioneer of text literal piece m in digraph laterally or vertically, then text literal piece L is better than text literal piece m on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece n is being the pioneer of text literal piece m in digraph vertically, and then text literal piece L is better than text literal piece n on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece L is being the pioneer of text literal piece n in digraph vertically, and then text literal piece n is better than text literal piece m on spatial sequence;
(3) the spatial sequence digraph is split conversion, structure weighting bipartite graph, the weights on bipartite graph limit adopt natural language processing technique, become speech degree and part of speech metastatic rate definite by the local liveness of the degree of correlation of the text literal piece content of two summit correspondences on limit, overlapping words, tail speech and head-word;
(4) the weighting bipartite graph is carried out Optimum Matching, determine a plurality of continuous text literal piece total order sequences based on the result of Optimum Matching;
(5) each text literal piece sequence is divided into a plurality of subsequences according to the style information and the semantic association information of literal piece again, the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.
2, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: in step (4), carry out reading order when recovering, Ku En-Man Kele (Kuhn-Munkres) algorithm of Optimum Matching in the graph theory is used for content-based reading order recovers.
3, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: the document of band pattern layout information comprises that scanning paper medium newspaper and OCR discern the document that the document, PDF, professional software for composing such as the Founder that obtain are soared and generated in the step (1), style information is meant that mainly each word all has position and size information, and printed page analysis is merged into the literal piece to the identical literal of pattern according to local pattern homogeneity principle is bottom-up; The classification foundation literal piece printed words formula of literal piece and line number amount are divided into text literal piece and non-text literal piece, and literal piece inside page object spatial relationship is the vertical syntople between row and the row, the horizontal syntople between interior word and the word of going.
4, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: two vertex set X of weighting bipartite graph and Y comprise all summits of spatial sequence digraph in the step (3), the summit a of condition: X and the summit b of Y were going out a little of spatial sequence digraph and go into a little that then also there is the limit in they in the weighting bipartite graph below satisfied on the limit of weighting bipartite graph; The weights on weighting bipartite graph limit calculate and adopt natural language processing technique:
1) the summit b corresponding content d of the summit a of X and Y 1And d 2Degree of correlation Similarity (d 1, d 2)=cosine (d 1, d 2)=(d 1* d 2)/|| d 1|| || d 2||;
2) d 1And d 2The local liveness Active (d of vocabulary 1, d 2)=d 1With d 2The number of reduplication/overlapping words chain degree of distribution and;
3) definition d 1Tail speech w 1With d 2Head-word w 2One-tenth speech degree WordTrans: if w 1w 2The word string of forming is a speech in dictionary, and then WordTrans is defined as 1, otherwise is defined as 0;
4) d 1Tail speech w 1Part of speech pos1 and d 2Head-word w 2Part of speech metastatic rate PosTrans=P (pos1pos2|pos2)=freq (pos1 of part of speech pos2, pos2)/freq (pos1), freq (pos1, pos2) expression pos1 and the co-occurrence number of times of pos2 in corpus, the occurrence number of freq (pos1) in corpus;
Limit weights=α 1* Similarity+ α 2* Active+ α 3* WordTrans+ α 4* PosTrans (α 1+ α 2+ α 3+ α 4=1).
5, a kind of method that newspaper layout is carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph in the step (4), specific algorithm is as follows:
1) provides initial label l ( x i ) = max j ω ij , l(y j)=0,i,j=1,2...,t,t=max(n,m);
2) obtain limit collection E l={ (x i, y j) | l (x i)+l (y j)=ω Ij, G l=(X, Y k, E l) and G lIn one the coupling M;
3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;
4) in X, look for a M unsaturation point x 0, make A ← { x 0, B ← φ, A, B are two set;
5) if N G l ( A ) = B , Then change the 9th) step, otherwise carry out next step, wherein, N G l ( A ) ⊆ Y k , Be with A in the node set of node adjacency;
6) look for a node y ∈ N G l ( A ) - B ;
7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, B ← B ∪ y} changes the 5th) step, otherwise carry out next step;
8) there is one from x 0But to the augmenting path P of y, order M ← M ⊕ E ( P ) , Change the 3rd) step;
9) be calculated as follows a value: a = min x i ∈ A y j ∉ N G l ( A ) { l ( x i ) + l ( y j ) - ω ij } , Revise label:
Figure C2004100914340004C1
Ask E according to l ' L 'And G L '
10) l ← l ', G l← G L ', change the 6th) and the step;
M as a result based on Optimum Matching determines a plurality of continuous text literal piece total order sequences, sequence generating method is if the summit b of the summit a of X and Y is that the pairing saturation point of M and the summit b of X and the summit c of Y are the pairing saturation points of M, then summit a → summit b → summit c forms a sequence, and recursion that calling sequence is increased to is the longest, and then be that object generates new sequence with the summit in this sequence not, all belong to certain sequence up to each summit, summit corresponding character piece has just formed a literal piece sequence in each sequence so.
6, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: in the step (5) each text literal piece sequence is divided into a plurality of subsequences according to wide the reaching with semantic association information with column gutter of style information such as hurdle of literal piece again, it is wide and wait that adjacent literal piece corresponding limit weights in bipartite graph are greater than threshold value in the character of column gutter and the subsequence that each subsequence such as has at the hurdle, and the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.
CNB2004100914343A 2004-11-22 2004-11-22 A kind of method of newspaper layout being carried out the words reading sequence recovery Expired - Fee Related CN100568221C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2004100914343A CN100568221C (en) 2004-11-22 2004-11-22 A kind of method of newspaper layout being carried out the words reading sequence recovery

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2004100914343A CN100568221C (en) 2004-11-22 2004-11-22 A kind of method of newspaper layout being carried out the words reading sequence recovery

Publications (2)

Publication Number Publication Date
CN1604075A CN1604075A (en) 2005-04-06
CN100568221C true CN100568221C (en) 2009-12-09

Family

ID=34667256

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004100914343A Expired - Fee Related CN100568221C (en) 2004-11-22 2004-11-22 A kind of method of newspaper layout being carried out the words reading sequence recovery

Country Status (1)

Country Link
CN (1) CN100568221C (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8365072B2 (en) 2009-01-02 2013-01-29 Apple Inc. Identification of compound graphic elements in an unstructured document
CN101866418B (en) * 2009-04-17 2013-02-27 株式会社理光 Method and equipment for determining file reading sequences
CN102541826B (en) * 2010-12-27 2014-08-06 北大方正集团有限公司 Text block content reorganizing method and device
CN102073862B (en) * 2011-02-18 2013-04-17 山东山大鸥玛软件有限公司 Method for quickly calculating layout structure of document image
CN103488619B (en) * 2013-07-05 2017-05-24 百度在线网络技术(北京)有限公司 Method and device for processing document file
CN104268127B (en) * 2014-09-22 2018-02-09 同方知网(北京)技术有限公司 A kind of method of electronics shelves layout files reading order analysis
CN106096592B (en) * 2016-07-22 2019-05-24 浙江大学 A kind of printed page analysis method of digital book
CN108268429B (en) * 2017-06-15 2021-08-06 阿里巴巴(中国)有限公司 Method and device for determining network literature chapters
CN109274681B (en) * 2018-10-25 2021-11-16 深圳壹账通智能科技有限公司 Information synchronization method and device, storage medium and server
CN110209765B (en) * 2019-05-23 2021-03-30 武汉绿色网络信息服务有限责任公司 Method and device for searching keywords according to meanings
CN113221743B (en) * 2021-05-12 2024-01-12 北京百度网讯科技有限公司 Table analysis method, apparatus, electronic device and storage medium

Also Published As

Publication number Publication date
CN1604075A (en) 2005-04-06

Similar Documents

Publication Publication Date Title
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
Gatterbauer et al. Towards domain-independent information extraction from web tables
Wang et al. Document zone content classification and its performance evaluation
Akimushkin et al. On the role of words in the network structure of texts: Application to authorship attribution
CN106201465A (en) Software project personalized recommendation method towards open source community
CN102043851A (en) Multiple-document automatic abstracting method based on frequent itemset
CN111177591A (en) Knowledge graph-based Web data optimization method facing visualization demand
CN100568221C (en) A kind of method of newspaper layout being carried out the words reading sequence recovery
CN109145260A (en) A kind of text information extraction method
Rastan et al. Texus: A task-based approach for table extraction and understanding
CN103810251A (en) Method and device for extracting text
CN112559656A (en) Method for constructing affair map based on hydrologic events
CN109657114B (en) Method for extracting webpage semi-structured data
CN114997288A (en) Design resource association method
Du et al. Exploiting syntactic structure for better language modeling: A syntactic distance approach
CN116205211A (en) Document level resume analysis method based on large-scale pre-training generation model
CN1604073A (en) Method for conducting title and text logic connection for newspaper pages
CN102591931A (en) Recognition and extraction method for webpage data records based on tree weight
CN109582958B (en) Disaster story line construction method and device
CN103699568A (en) Method for extracting hyponymy relation of field terms from wikipedia
de Oliveira et al. A syntactic-relationship approach to construct well-informative knowledge graphs representation
Gao et al. Newspaper article reconstruction using ant colony optimization and bipartite graph
CN101436194B (en) Text multiple-accuracy representing method based on data excavating technology
Phan et al. Automated data extraction from the web with conditional models
CN111859887A (en) Scientific and technological news automatic writing system based on deep learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20091209