CN100568221C

CN100568221C - A kind of method of newspaper layout being carried out the words reading sequence recovery

Info

Publication number: CN100568221C
Application number: CNB2004100914343A
Authority: CN
Inventors: 贾娟; 陈晓鸥; 陈堃銶
Original assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University
Current assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University
Priority date: 2004-11-22
Filing date: 2004-11-22
Publication date: 2009-12-09
Anticipated expiration: 2024-11-22
Also published as: CN1604075A

Abstract

The invention belongs to the document printed page understanding technology in intelligent literal and the graphic information processing, be specifically related to a kind ofly content-based newspaper layout be carried out the method that words reading sequence recovers.Handling complicated newspaper layout at prior art exists and loses the defective that reading order and content do not have chapter independence, the present invention carries out mathematical modeling to this problem with the graph theory theory first, the syntople of literal piece is expressed as digraph, and digraph split be converted into the weighting bipartite graph, adopt natural language processing technique to calculate bipartite graph limit weights, obtain a plurality of continuous sequences by Optimum Matching, each sequence is divided into a plurality of subsequences according to literal piece style information again, and the connection of subsequence corresponding content promptly is the word flow with reading order of independent chapter.Utilized semantic, spatial relationship and style information, the accuracy that reading order recovers is improved greatly and is that unit has independence with the chapter.This method can be applicable to the printed page understanding of style document and structuring reconstruct.

Description

A kind of method of newspaper layout being carried out the words reading sequence recovery

Technical field

The invention belongs to the document printed page understanding technology in intelligent literal and the graphic information processing, be specifically related to a kind of method of newspaper layout being carried out the words reading sequence recovery.

Background technology

Along with the appearance of the development of infotech and new media format, stride medium and publish, information convenient and propagate efficient, advantages such as expression forms of information is abundant, multiple medium mutual supplement with each other's advantages and develop rapidly with its information sharing.The core that medium are published is striden by digital asset management system based on XML, but in traditional information was propagated, the existence form of information directly depended on the form of terminal media, was not easy to stride medium and published.Particularly the newspaper enormous amount is historical remote, the pattern complexity, and content independence is poor, and reading order is fuzzy, and its XML structuring is difficulty the most.How revert to the independently chapter word flow that links up and represent it is that the newspaper data assets is realized striding medium and published the problem that is faced with XML with semantic information from the fuzzy and dependent text space relation of newspaper document of this complicated space of a whole page.Newspaper layout is carried out reading order to be recovered exactly with the method that solves these technical matterss.

At present, main flow OCR digitizing software is to the processing of the band style document space of a whole page, ignoring reading order and semantic structure recovers, the electronic document such as PDF, the HTML that convert the band pattern to issue again, but be unfavorable for information reuse with deep processing as retrieval, utilize, transaction, rewrite, replenish, arrangement etc., the newspaper layout of especially many chapters lacks chapter independent reading order and structuring and makes utilization more difficult more.Carrying out reading order recovers to mainly contain two class methods: a class is to utilize pattern and spatial relationship information, as document " printed page analysis of complicated Chinese paper; understand and reconstruct " (author Chen Ming, Ding Xiaoqing, Liang Jian. Tsing-Hua University's journal natural science edition the 41st the 1st phase of volume of calendar year 2001. the page number 29～32,59) and document " Integrated Algorithms for Newspaper Page Decomposition andArticle Tracking " (the author B.Gatos that delivered at Proceedings of theFifth International Conference on Document Analysis and Recognition in 1999, S.L.Mantzaris, K.V.Chandrinos, A.Tsigris, S.J.Perantonis. the page number 559～562), newspaper layout is considered as the set of a plurality of independent literal pieces, carry out the merging and the reading order of literal piece determines based on the principle utilization rule of same piece of writing article pattern homogeneity, rule and method can only be handled pattern and the simple space of a whole page of spatial relationship such as books, journal article, but the feature of newspaper layout diversity and object dependencies makes that the accuracy of only utilizing pattern and the regular reading order that carries out between the complicated space of a whole page literal piece to recover is low excessively; Another kind of is to utilize semanteme and spatial relationship information, 2002, Aiello M, Monz C, people such as Todoran L are at document " Document understandingfor a broad class of documents " (International Journal on DocumentAnalysis and Recognition, 2002,5 (1): disclose a kind of method of utilizing semantic information to determine reading order first 1～16.), all possible reading order is done a permutation and combination, select best result according to part of speech weights formula then, but time complexity exponential growth along with the increase of literal number of blocks, can't extract independently reading order, and the semantic information of utilizing very little, influences accuracy rate.More than in these technology, do not make full use of various potential informations in the newspaper layout document so that obtain more accurate reading order effect, more do not form unified mathematical model.

Summary of the invention

At problems of the prior art, the purpose of this invention is to provide a kind of method of newspaper layout being carried out the words reading sequence recovery, this method can be effectively be carried out that reading order recovers and can be that unit carries out independent reading and cuts apart in proper order with the chapter the newspaper layout document, thereby can improve the reading order accuracy rate greatly, be convenient to further XML semantic structureization again.

For reaching above purpose, the technical solution used in the present invention is: a kind of method that newspaper layout is carried out the words reading sequence recovery may further comprise the steps:

(1) reads in the document of being with the pattern layout information, carry out printed page analysis, the identical literal of pattern is merged into the literal piece, and be categorized as text literal piece and non-text literal piece, the spatial relationship of the inner literal of literal piece is single, according to a left side than rightly read earlier, go up than under the rule read earlier literal in the piece is connected into word flow with reading order content as piece.Non-text literal piece is isolated to literal piece on every side, need not to consider and the reading order of other literal pieces that the core of processing is the reading order between text literal piece content;

(2) be the summit with text literal piece, the left and right sides syntople of piece is that directed edge is set up laterally in abutting connection with digraph, with the piece is the summit, the syntople up and down of piece is that directed edge is set up vertically in abutting connection with digraph, set up the spatial sequence digraph based on these two digraphs and according to the spatial sequence rule, the spatial sequence rule definition is: if text literal piece L is being the pioneer of text literal piece m in digraph laterally or vertically, then text literal piece L is better than text literal piece m on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece n is being the pioneer of text literal piece m in digraph vertically, and then text literal piece L is better than text literal piece n on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece L is being the pioneer of text literal piece n in digraph vertically, and then text literal piece n is better than text literal piece m on spatial sequence;

(3) the spatial sequence digraph is split conversion, structure weighting bipartite graph, the weights on bipartite graph limit adopt natural language processing technique, become speech degree and part of speech metastatic rate etc. definite by the degree of correlation, the local liveness of overlapping words, the tail speech of the text literal piece content of two summit correspondences on limit with head-word;

(4) the weighting bipartite graph is carried out Optimum Matching, determine a plurality of continuous text literal piece total order sequences based on the result of Optimum Matching;

(5) each text literal piece sequence is divided into a plurality of subsequences according to the style information and the semantic association information of literal piece again, the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.

Further, for making the present invention have better effect:

In step (4), carry out reading order when recovering, Ku En-Man Kele (Kuhn-Munkres) algorithm of Optimum Matching in the graph theory is used for content-based reading order recovers.

The document of band pattern layout information comprises that scanning paper medium newspaper and OCR discern the document that the document, PDF, professional software for composing such as the Founder that obtain are soared and generated in the step (1), style information is meant that mainly each word all has position and size information, and printed page analysis is merged into the literal piece to the identical literal of pattern according to local pattern homogeneity principle is bottom-up; The classification foundation literal piece printed words formula of literal piece and line number amount are divided into text literal piece and non-text literal piece, and literal piece inside page object spatial relationship is the vertical syntople between row and the row, the horizontal syntople between interior word and the word of going.

Two vertex set X of weighting bipartite graph and Y comprise all summits of spatial sequence digraph in the step (3), the summit a of condition: X and the summit b of Y were going out a little of spatial sequence digraph and go into a little that then also there is the limit in they in the weighting bipartite graph below satisfied on the limit of weighting bipartite graph; The weights on weighting bipartite graph limit calculate and adopt natural language processing technique:

(1) the summit b corresponding content d of the summit a of X and Y ₁And d ₂Degree of correlation Similarity (d ₁, d ₂)=cosine (d ₁, d ₂)=(d ₁* d ₂)/|| d ₁|| || d ₂||;

(2) d ₁And d ₂The local liveness Active (d of vocabulary ₁, d ₂)=d ₁With d ₂The number of reduplication/overlapping words chain degree of distribution and;

(3) definition d ₁Tail speech w ₁With d ₂Head-word w ₂One-tenth speech degree WordTrans: if w ₁w ₂The word string of forming is a speech in dictionary, and then WordTrans is defined as 1, otherwise is defined as 0;

(4) d ₁Tail speech w ₁Part of speech pos1 and d ₂Head-word w ₂Part of speech metastatic rate PosTrans=P (pos1pos2|pos2)=freq (pos1 of part of speech pos2, pos2)/freq (pos1), freq (pos1, pos2) expression pos1 and the co-occurrence number of times of pos2 in corpus, the occurrence number of freq (pos1) in corpus;

Limit weights=α ₁* Similarity+ α ₂* Active+ α ₃* WordTrans+ α ₄* PosTrans (α ₁+ α ₂+ α ₃+ α ₄=1).

Utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph in the step (4), specific algorithm is as follows:

1) provides initial label

l (x_{i}) = \max_{j} ω_{ij},

l(y _j)＝0，i，j＝1，2...，t，t＝max(n，m)；

2) obtain limit collection E _l={ (x _i, y _j) | l (x _i)+l (y _j)=ω _Ij, G _l=(X, Y _k, E _l) and G _lIn one the coupling M;

3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;

4) in X, look for a M unsaturation point x ₀, make A ← { x ₀, B ← φ, A, B are two set;

5) if

N_{G_{l}} (A) = B,

Then change the 9th) step, otherwise carry out next step, wherein,

N_{G_{l}} (A) &SubsetEqual; Y_{k},

Be with A in the node set of node adjacency;

6) look for a node

y &Element; N_{G_{l}} (A) - B;

7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, B ← B ∪ y} changes the 5th) step, otherwise carry out next step;

8) there is one from x ₀But to the augmenting path P of y, order

M &LeftArrow; M &CirclePlus; E (P),

Change the 3rd) step;

9) be calculated as follows a value:

a = \underset{y_{j} &NotElement; N_{G_{l}} (A)}{\min_{x_{i} &Element; A}} {l (x_{i}) + l (y_{j}) - ω_{ij}},

Revise label:

Ask E according to l ' _{L '}And G _{L '}

10) l ← l ', G _l← G _{L '}, change the 6th) and the step;

M as a result based on Optimum Matching determines a plurality of continuous text literal piece total order sequences, sequence generating method is if the summit b of the summit a of X and Y is that the pairing saturation point of M and the summit b of X and the summit c of Y are the pairing saturation points of M, then summit a → summit b → summit c forms a sequence, and recursion that calling sequence is increased to is the longest, and then be that object generates new sequence with the summit in this sequence not, all belong to certain sequence up to each summit, summit corresponding character piece has just formed a literal piece sequence in each sequence so.

In the step (5) each text literal piece sequence is divided into a plurality of subsequences according to wide the reaching with semantic association information with column gutter of style information such as hurdle of literal piece again, it is wide and wait that adjacent literal piece corresponding limit weights in bipartite graph are greater than threshold value in the character of column gutter and the subsequence that each subsequence such as has at the hurdle, and the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.

Effect of the present invention is at the characteristics of newspaper layout document, proposes a kind of new content-based method to newspaper layout recovery words reading sequence.This method can effectively be utilized the semantic information in the newspaper layout document, spatial relationship information and style information, and utilize the graph theory mathematical model that the reading order problem is carried out modeling, the recovery of reading order but also word flow have not only been finished and still are chapter independently, make that content and pattern on the newspaper layout are irrelevant, improved the reading order accuracy rate greatly, and help the information extraction and the XML structuring of newspaper layout more, so that realize the utilization again of historical data assets and stride medium and issue again, this method can be widely used in the field of information processing of intelligent literal such as printed page understanding and figure, as the paper medium, PS, PDF, Word, the printed page understanding and the structuring of band such as InDesign style document are handled.

Why the present invention has so significant technique effect, and its reason is:

1. the present invention carries out mathematical modeling to the recovery of the reading order between text literal piece problem with graph theory Optimum Matching theory first;

2. utilizing the space is the continuous necessary condition of word flow continuously, and the limit that the space syntople between the literal piece is expressed as digraph is to reduce the search volume;

3. digraph is split and be converted into the weighting bipartite graph so that the most probable reading order sequence of quantitative selection;

4. because the continuous most crucial judgment criteria of word flow is based on content, utilize natural language processing technique, the tail speech that becomes speech degree, the previous literal piece of sentence level of the head-word of the tail speech of the previous literal piece of speech level and a back literal piece and the head-word part of speech metastatic rate of a back literal piece, the degree of correlation of section level content, the local liveness of reduplication etc. determined two literal pieces whether on reading order continuously, their linear weighted function obtains a plurality of continuous literal piece sequences as the weights on bipartite graph limit by Ku En-Man Kele (Kuhn-Munkres) matching algorithm;

Each sequence be non-chapter independently, characteristics according to the literal piece content topic unanimity of heterogeneous between the inner homogeneity of newspaper layout pattern chapter, chapter and each chapter,, column gutter wide by the hurdle and semantic relevant information are divided into a plurality of continuous subsequences to literal piece sequence, and the connection of each subsequence corresponding character piece content promptly is an independently word flow of a chapter with reading order.

Description of drawings

Fig. 1 is a process flow diagram of the present invention;

Fig. 2 is the newspaper synoptic diagram after the printed page analysis;

Fig. 3 is text literal piece horizontal in abutting connection with the digraph synoptic diagram of newspaper layout document;

Fig. 4 is text literal piece vertical in abutting connection with the digraph synoptic diagram of newspaper layout document;

Fig. 5 is by the horizontal and vertical bipartite graph synoptic diagram that changes into that splits in abutting connection with digraph;

Fig. 6 is Ku En-Man Kele (Kuhn-Munkres) Optimum Matching arithmetic result synoptic diagram;

Fig. 7 is the newspaper synoptic diagram behind the recovery reading order.

Embodiment

Below in conjunction with accompanying drawing and implementation column the present invention is done to describe further.

In the present embodiment, we have selected newspaper document that OCR scans into data as an example for use, as shown in Figure 1, a kind of newspaper layout are carried out the method that words reading sequence recovers, and may further comprise the steps:

One, read in the document of band pattern layout information, comprise document, PDF, professional software for composing such as Founder that scanning paper medium newspaper and OCR identification the obtains document that generates etc. of soaring, style information is meant that mainly each word all has position and size information.Printed page analysis is merged into the literal piece to the identical literal of pattern according to local pattern homogeneity principle is bottom-up; The classification foundation literal piece printed words formula of literal piece and line number amount are divided into text literal piece and non-text literal piece, and as shown in Figure 2, solid-line rectangle is represented text literal piece, its numbering of numeral, and dotted line is represented non-text literal piece.The inner page object spatial relationship of literal piece be row with row between vertical syntople, row in horizontal syntople between word and the word, according to a left side than rightly read earlier, go up than under the rule read earlier literal in the piece is connected into word flow with reading order content as piece.Non-text literal piece is isolated to literal piece on every side, need not to consider and the reading order of other literal pieces that the core of processing is the reading order between text literal piece content.

Two, be the summit with text literal piece, the left and right sides syntople of piece is that directed edge is set up laterally in abutting connection with digraph, as shown in Figure 3, with the piece is the summit, the syntople up and down of piece is that directed edge is set up vertically in abutting connection with digraph, as shown in Figure 4, set up the spatial sequence digraph based on these two digraphs and according to the spatial sequence rule, if the spatial sequence rule definition is being the pioneer of text literal piece m in digraph laterally or vertically for text literal piece L, then text literal piece L is better than text literal piece m on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece n is being the pioneer of text literal piece m in digraph vertically, and then text literal piece L is better than text literal piece n on spatial sequence; If text literal piece L is being the pioneer of text literal piece m in digraph laterally, and text literal piece L is being the pioneer of text literal piece n in digraph vertically, and then text literal piece n is better than text literal piece m on spatial sequence;

Three, the spatial sequence digraph is split conversion, structure weighting bipartite graph, as shown in Figure 5, the pioneer summit of " f text " expression reading order, the descendant vertex of " t text " expression reading order, two vertex set X of weighting bipartite graph and Y comprise all summits of spatial sequence digraph, the summit a of condition: X and the summit b of Y were going out a little of spatial sequence digraph and go into a little below satisfied on the limit of weighting bipartite graph, then also there is the limit in they in the weighting bipartite graph, the weights on bipartite graph limit adopt natural language processing technique, the degree of correlation by the text literal piece content of two summit correspondences on limit, the local liveness of overlapping words, the tail speech is determined with become the speech degree and the part of speech metastatic rate etc. of head-word, specifically is calculated as follows:

Four, utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph, determine a plurality of continuous text literal piece total order sequences based on the result of Optimum Matching, Ku En-Man Kele (Kuhn-Munkres) algorithm is as follows:

(1) provides initial label

l (x_{i}) = \max_{j} ω_{ij},

l(y _j)＝0，i，j＝1，2...，t，t＝max(n，m)；

(2) obtain limit collection E _l={ (x _i, y _j) | l (x _i)+l (y _j)=ω _Ij, G _l=(X, Y _k, E _l) and G _lIn one the coupling M;

(3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;

(4) in X, look for a M unsaturation point x ₀, make A ← { x ₀, B ← φ, A, B are two set;

(5) if

N_{G_{l}} (A) = B,

Then changeed for (9) step, otherwise carry out next step, wherein,

N_{G_{l}} (A) &SubsetEqual; Y_{k},

Be with A in the node set of node adjacency;

(6) look for a node

y &Element; N_{G_{l}} (A) - B;

(7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, { y} changeed for (5) step, otherwise carries out next step B ← B ∪;

(8) there is one from x ₀But to the augmenting path P of y, order

M &LeftArrow; M &CirclePlus; E (P),

Changeed for (3) step;

(9) be calculated as follows a value:

a = \underset{y_{j} &NotElement; N_{G_{l}} (A)}{\min_{x_{i} &Element; A}} {l (x_{i}) + l (y_{j}) - ω_{ij}},

Revise label:

Ask E according to l ' _{L '}And G _{L '}

(10) l ← l ', G _l← G _{L '}, changeed for (6) step;

M as a result based on Optimum Matching, the formation sequence method is if the summit b of the summit a of X and Y is that the pairing saturation point of M and the summit b of X and the summit c of Y are the pairing saturation points of M, then summit a → summit b → summit c forms a sequence, and recursion that calling sequence is increased to is the longest, and then be that object generates new sequence with the summit in this sequence not, all belong to certain sequence up to each summit, summit corresponding character piece has just formed a literal piece sequence in each sequence so, and Optimum Matching symbiosis as shown in Figure 6 becomes 5 sequences: 12 → 13 → 14 → 15 → 16 → 17 → 20 → 18 → 21 → 19 → 22 → 1,23 → 24 → 25,27 → 28-→ 0 → 8 → 9 → 10 → 11 → 4,26 → 5 → 2 → 6 → 7 → 3 and 29.

Five, each text literal piece sequence is divided into a plurality of subsequences according to the style information and the semantic association information of literal piece again, it is wide and wait that adjacent literal piece corresponding limit weights in bipartite graph are greater than threshold value in the character of column gutter and the subsequence that each subsequence such as has at the hurdle, the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out, as shown in Figure 7, have 9 chapters, reading order is represented with arrow: 12 → 13 → 14 → 15 → 16 → 17 → 20 → 18 → 21 → 19 → 22,23 → 24 → 25,27 → 28 → 0,8 → 9 → 10 → 11 → 4,2 → 6 → 7 → 3,1,5,26 and 29, wherein having four chapters all only to contain a literal piece is respectively 1,5,26 and 29.

Claims

1. one kind is carried out the method that words reading sequence recovers to newspaper layout, may further comprise the steps:

(1) reads in the document of being with the pattern layout information, carry out printed page analysis, the identical literal of pattern is merged into the literal piece, and be categorized as text literal piece and non-text literal piece, the spatial relationship of the inner literal of literal piece is single, piece in literal connected into word flow with reading order content as piece than right reading earlier, going up than the rule of reading earlier down according to a left side, non-text literal piece is isolated to literal piece on every side, need not to consider and the reading order of other literal pieces that the core of processing is the reading order between text literal piece content;

(3) the spatial sequence digraph is split conversion, structure weighting bipartite graph, the weights on bipartite graph limit adopt natural language processing technique, become speech degree and part of speech metastatic rate definite by the local liveness of the degree of correlation of the text literal piece content of two summit correspondences on limit, overlapping words, tail speech and head-word;

2, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: in step (4), carry out reading order when recovering, Ku En-Man Kele (Kuhn-Munkres) algorithm of Optimum Matching in the graph theory is used for content-based reading order recovers.

3, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: the document of band pattern layout information comprises that scanning paper medium newspaper and OCR discern the document that the document, PDF, professional software for composing such as the Founder that obtain are soared and generated in the step (1), style information is meant that mainly each word all has position and size information, and printed page analysis is merged into the literal piece to the identical literal of pattern according to local pattern homogeneity principle is bottom-up; The classification foundation literal piece printed words formula of literal piece and line number amount are divided into text literal piece and non-text literal piece, and literal piece inside page object spatial relationship is the vertical syntople between row and the row, the horizontal syntople between interior word and the word of going.

4, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: two vertex set X of weighting bipartite graph and Y comprise all summits of spatial sequence digraph in the step (3), the summit a of condition: X and the summit b of Y were going out a little of spatial sequence digraph and go into a little that then also there is the limit in they in the weighting bipartite graph below satisfied on the limit of weighting bipartite graph; The weights on weighting bipartite graph limit calculate and adopt natural language processing technique:

1) the summit b corresponding content d of the summit a of X and Y ₁And d ₂Degree of correlation Similarity (d ₁, d ₂)=cosine (d ₁, d ₂)=(d ₁* d ₂)/|| d ₁|| || d ₂||;

2) d ₁And d ₂The local liveness Active (d of vocabulary ₁, d ₂)=d ₁With d ₂The number of reduplication/overlapping words chain degree of distribution and;

3) definition d ₁Tail speech w ₁With d ₂Head-word w ₂One-tenth speech degree WordTrans: if w ₁w ₂The word string of forming is a speech in dictionary, and then WordTrans is defined as 1, otherwise is defined as 0;

4) d ₁Tail speech w ₁Part of speech pos1 and d ₂Head-word w ₂Part of speech metastatic rate PosTrans=P (pos1pos2|pos2)=freq (pos1 of part of speech pos2, pos2)/freq (pos1), freq (pos1, pos2) expression pos1 and the co-occurrence number of times of pos2 in corpus, the occurrence number of freq (pos1) in corpus;

5, a kind of method that newspaper layout is carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph in the step (4), specific algorithm is as follows:

1) provides initial label

l (x_{i}) = \max_{j} ω_{ij},

l(y _j)＝0，i，j＝1，2...，t，t＝max(n，m)；

5) if

N_{G_{l}} (A) = B,

Then change the 9th) step, otherwise carry out next step, wherein,

N_{G_{l}} (A) &SubsetEqual; Y_{k},

Be with A in the node set of node adjacency;

6) look for a node

y &Element; N_{G_{l}} (A) - B;

8) there is one from x ₀But to the augmenting path P of y, order

M &LeftArrow; M &CirclePlus; E (P),

Change the 3rd) step;

9) be calculated as follows a value:

a = \underset{y_{j} &NotElement; N_{G_{l}} (A)}{\min_{x_{i} &Element; A}} {l (x_{i}) + l (y_{j}) - ω_{ij}},

Revise label:

Ask E according to l ' _{L '}And G _{L '}

10) l ← l ', G _l← G _{L '}, change the 6th) and the step;

6, a kind of method of newspaper layout being carried out the words reading sequence recovery as claimed in claim 1, it is characterized in that: in the step (5) each text literal piece sequence is divided into a plurality of subsequences according to wide the reaching with semantic association information with column gutter of style information such as hurdle of literal piece again, it is wide and wait that adjacent literal piece corresponding limit weights in bipartite graph are greater than threshold value in the character of column gutter and the subsequence that each subsequence such as has at the hurdle, and the be linked in sequence word flow of formation of the content of subsequence Chinese block promptly is the independently words reading sequence of the single article that recovers to come out.