CN1320481C

CN1320481C - Method for conducting title and text logic connection for newspaper pages

Info

Publication number: CN1320481C
Application number: CNB2004100914324A
Authority: CN
Inventors: 贾娟; 陈晓鸥; 陈堃銶
Original assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University
Current assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University
Priority date: 2004-11-22
Filing date: 2004-11-22
Publication date: 2007-06-06
Anticipated expiration: 2024-11-22
Also published as: CN1604073A

Abstract

The present invention belongs to a processing technology for intelligent letters and graph information, particularly to a method for logically correlating a title and a text of a newspaper layout. By aiming at the defect that the existing layout understanding technology only depends on a logic object of a type information classification layout and lacks semantic structure extraction for the multiple-text multiple-title newspaper layout, the present invention firstly uses a graph theory to establish a mathematics module, the one-to-one characteristic of matched granularity of a non-text range set and a text range set is described by using a bipartite graph match module, and a weighted bipartite graph is established according to a spatial relation. A natural language processing technology is firstly adopted to calculate an edge weight value of the bipartite graph, and a pairing saturated vertex of an optimal matching result is used as a title and a text in successful logical correlation. The present invention provides the method that an optimal matched Kuhn-Munkres algorithm and artificial intelligence are combined to solve the logical correlation problem of the title and the text, the matched accuracy rate is high, and the present invention can be used for a history data structured process and a metadata extracting process.

Description

A kind of newspaper layout is carried out the title method related with text logic connection

Technical field

The invention belongs to intelligent literal and graphic information processing technology, be specifically related to a kind of newspaper layout be carried out the title method related with text logic connection.

Background technology

Top line plays an important role in Content Management Systems such as classification, retrieval, Dublin Core and NewsML all title as a kind of important metadata, particularly in striding the medium publication, title is as the important element of metadata and XML message structure, the correctness related with text logic connection directly has influence on reusing and deep processing of information in the digital asset management system, as retrieval, issue and hyperlink etc. again.Logic association refers to, and to be exactly each literal piece that tiles on the newspaper layout two-dimensional space be title, text, header, speech etc. by its semantic function logical division, then the title of the same message of expression and the text item as a structure associated.As traditional media format, be different from books, magazine, the information of newspaper is propagated has intensive, promptly on a space of a whole page, carry out the composing of a plurality of chapters, in order to improve legibility, each chapter all has a title that its content is summarized, the position heading be embedded in chapter zone or with the chapter adjacency, have eye-catching characteristics such as the layout of a page without columns, Jia Heijia big font at form of expression heading.But in the newspaper layout of various carriers such as Jie of paper media, software for composing, PDF, the chapter text does not have the structurized related of inherence with title, just the tiling on the layout space is enumerated, and caption position arbitrarily, font size is fixing, fixing, a title and a plurality of text blocks position vicinity of row anyhow, make and judge that there are ambiguity in a title and which text matching, other class title piece such as header, speech etc. on pattern with the title homogeneity, only utilize style information correctly to carry out logical division to the literal piece.

In addition, people are by visual thinking ability and the semantic logic association that carries out text and title, but computing machine can't be from this structure connection of direct information " understanding ".Because the historical amount of assets of newspaper is huge, adopt artificial assistant interventional method cost not only consuming time but also too big, the logic association that how to make computer intelligence in printed page understanding and structuring restructuring procedure, carries out newspaper layout title and text automatically becomes active demand.

Title is related with text logic connection and need hocket to literal piece logical division, promptly at first rough sort literal piece is non-text block and text block, carry out logic association then, utilize the result of coupling to determine which non-text literal piece is real title again, but the logical division to title all utilizes style information independently to carry out at present, as document " Document page similarity based on layout visual saliency:Application to query by example and documentclassification " (Proceedings of the Seventh International Conference onDocument Analysis and Recognition.2003,1208～1212); And document TOC (TableOf Content) catalogue extracting method " Automated Detection and Segmentation of Tableof Contents Page from Document Images " (author is S.Mandal, S.P.Chowdhury and A.K.Das. are published in Proceedings of the Seventh International Conferenceon Document Analysis and Recognition, 2003,398～402.) the only suitable books space of a whole page is powerless to the newspaper of the complicated space of a whole page; Document " printed page analysis of complicated Chinese paper, understanding and reconstruct " (author Chen Ming, Ding Xiaoqing, strong, Tsing-Hua University's journal natural science edition the 41st the 1st phase of volume of calendar year 2001 of beam.The page number 29～32,59) Matching Model rule and method can only the processing rule zone common type, when the text zone be irregularly shaped or the position of title and text relation when complicated Matching Model not have the situation of description just can't correctly mate, another one title and a plurality of chapters position in abutting connection with the time have an ambiguity coupling that can lead to errors.It is good and bad that prior art lacks the quantitative total evaluation coupling of unified mathematical model, all do not consider semantic information, is not enough according to pattern and the complicated newspaper layout of position information process only.Because the processing of the logic association of title and text is an inverse process of writing title during the space of a whole page generates for text in the space of a whole page reconstruct, method " Description of the UAM system for generationg veryshort summaries at DUC-2004 " (the Enrique alfonseca that title in the natural language processing technique generates, Jose MariaGuirao, Antonio Moreno-Sandoval.Document Understanding Conference 2004) be worth using for reference.

Summary of the invention

At in the prior art to the less-than-ideal defective of newspaper layout title matching effect, the purpose of this invention is to provide and a kind of newspaper layout is carried out the title method related with text logic connection, this method can be carried out structure of an article extraction to newspaper layout, can improve the title matching effect greatly.

For reaching above purpose, the technical solution used in the present invention is: a kind of newspaper layout is carried out the title method related with text logic connection, may further comprise the steps:

(1) reads in newspaper document after the printed page analysis, each literal piece is categorized as text literal piece and non-text literal piece by line number amount in font style and the piece, text literal piece is divided into independently chapter zone of a plurality of contents by reading order and piece pattern;

(2) set up the weighting bipartite graph, two vertex sets of bipartite graph comprise all non-text literal piece and chapter zone respectively, and the limit of bipartite graph is corresponding in the neighbouring relations of space of a whole page two-dimensional space with non-text literal piece and chapter zone;

(3) weights on bipartite graph limit adopt natural language processing technique, determine by the non-text literal piece content of summit correspondence and the semanteme of chapter area contents, method is that to utilize title be the characteristics of article content theme summary, literal in the text literal piece is carried out obtaining word set a after the lexical analysis, total m different speech, and calculate the dispersion degree of each speech among the word set a and degree of finger altogether, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech, and degree of finger is represented with the number of times that this speech occurs in chapter altogether; Equally the literal in the non-text literal piece is carried out lexical analysis and obtain word set b, total n different speech, and calculate the relative dispersion degree of each speech in the chapter text and relative degree of finger altogether among the word set b, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech relatively, and degree of finger is represented with the number of times that this speech occurs in chapter relatively altogether; Before among the word set a n maximum dispersion degree and as the total points divergence of word set a, before among the word set a n maximum altogether degree of finger and as the degree of finger altogether of word set a, all relative dispersion degree and total relative dispersion degree among the word set b as word set b, all degree of finger relatively altogether and always relative degree of finger altogether among the word set b as word set b.The total relatively dispersion degree of the calculating of dispersion coefficient by word set b obtains divided by the total points divergence of word set a, and the total relatively altogether degree of referring to of the calculating that refers to coefficient altogether by word set b obtains divided by the degree of finger altogether of word set a; Title is to the speech coverage of chapter text, represents divided by the number of all speech of word set b with the number that the speech of word set b occurs in the chapter text.Dispersion coefficient, refer to that the linear weighted function of coefficient and speech coverage is the weights on limit altogether;

(4) utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph, the non-text literal piece content of the saturated vertex correspondence in the non-text literal of the Optimum Matching result piece vertex set promptly is a title, and what saturated vertex correspondence in another chapter zone vertex set that the limit links to each other was arranged with it is exactly the text chapter of this title institute logic association, and the two is respectively as title in the XML structure of an article and the output of text item.

It is title by its semantic function logical division that above-mentioned logic association refers to each literal piece that tiles on the newspaper layout two-dimensional space, text, header, speech etc., then the expression title of same message and text associating as a structure, carrying out title when related with text logic connection, theory with bipartite graph in the graph theory, algorithm and result are incorporated on the tolerance of summary spreadability between literal piece content, specifically, it is related with text logic connection to be that Ku En-Man Kele (Kuhn-Munkres) algorithm with Optimum Matching in the graph theory is used for content-based title.

Effect of the present invention is: adopt method of the present invention, can be effectively carry out the structure of an article to newspaper layout by signal conditioning package and extract, improved the matching effect of text and title in the newspaper layout greatly.By to the modeling of problem with to human thinking's simulation, make matching accuracy rate very high, can be widely used in during the historical data structuring of digital asset management system and meta-data extraction handle.

Why the present invention has such effect, is because the present invention is directed to relation various characteristics in position between newspaper layout character area complexity and the literal piece, proposes a kind of new method to title logic association text in the newspaper layout.The present invention utilizes the bipartite graph matching mathematical model to describe man-to-man characteristics on title and the text granularity accurately, utilizing style information is the block sort of newspaper layout Chinese words non-positive collected works and positive collected works, and set up initial bipartite graph according to the spatial relationship between two set elements, particularly adopt natural language processing technique first, take all factors into consideration extraction type and two kinds of summaries of total junction type type, and to calculate title based on the length that refers to the speech chain altogether and dispersion degree be the limit weights of weighting bipartite graph to the semanteme summary coverage of text as the judge factor of logic association between non-text block and the text block, promptly is the incidence relation of title and text through the limit of the connection saturation point after the Optimum Matching.

Description of drawings

Fig. 1 is a process flow diagram of the present invention;

Fig. 2 is printed page analysis and sorted newspaper synoptic diagram;

Fig. 3 is the newspaper synoptic diagram with chapter zone behind the recovery reading order;

Fig. 4 is the bipartite graph synoptic diagram that non-text literal piece and chapter zone generate according to syntople;

Fig. 5 is Ku En-Man Kele (Kuhn-Munkres) Optimum Matching arithmetic result synoptic diagram.

Embodiment

Below in conjunction with accompanying drawing the present invention is done to describe further, process flow diagram of the present invention as shown in Figure 1:

(1) reads in newspaper document after the printed page analysis, the newspaper document comprises scanning paper medium newspaper and through document, PDF, professional software for composing such as Founder that OCR identification the obtains document that generates etc. of soaring, printed page analysis is bottom-up the space of a whole page to be divided into each piece zone, and physical classification is literal piece and image block.Each literal piece is categorized as text literal piece and non-text literal piece by line number amount in font style and the piece, as shown in Figure 2, solid-line rectangle is represented text literal piece, dashed rectangle is represented non-text literal piece, the syntople of text literal piece is expressed as digraph, and fractionation is converted into the weighting bipartite graph, adopt natural language processing technique to calculate bipartite graph limit weights, obtain a plurality of continuous sequences by Optimum Matching, each sequence is divided into a plurality of subsequences according to literal piece style information again, the zone that merges the subsequence correspondence promptly is chapter zone independently, the word flow that its corresponding content connects into is as the content in chapter zone, as shown in Figure 3, arrow is represented the priority of reading order, each continuous arrow sequence has been formed the chapter zone to text literal piece, the numbering in zone circle numeral chapter zone, and ordinary numbers is represented the numbering of non-text literal piece;

(2) set up the weighting bipartite graph, two vertex sets of bipartite graph comprise all non-text literal piece and chapter zone respectively, the limit of bipartite graph is corresponding in the neighbouring relations of space of a whole page two-dimensional space with non-text literal piece and chapter zone, as shown in Figure 4, left side vertex set is represented non-text literal piece, and the right vertex set is represented the chapter zone;

(3) weights on bipartite graph limit adopt natural language processing technique, determine by the non-text literal piece content of summit correspondence and the semanteme of chapter area contents, method is that to utilize title be the characteristics of article content theme summary, literal in the text literal piece is carried out obtaining word set a after the lexical analysis, total m different speech, and calculate the dispersion degree of each speech among the word set a and degree of finger altogether, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech, and degree of finger is represented with the number of times that this speech occurs in chapter altogether; Equally the literal in the non-text literal piece is carried out obtaining word set b after the lexical analysis, total n different speech, and calculate the relative dispersion degree of each speech in the chapter text and relative degree of finger altogether among the word set b, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech relatively, and degree of finger is represented with the number of times that this speech occurs in chapter relatively altogether; Before among the word set a n maximum dispersion degree and as the total points divergence of word set a, before among the word set a n maximum altogether degree of finger and as the degree of finger altogether of word set a, all relative dispersion degree and total relative dispersion degree among the word set b as word set b, all degree of finger relatively altogether and always relative degree of finger altogether among the word set b as word set b.The total relatively dispersion degree of the calculating of dispersion coefficient by word set b obtains divided by the total points divergence of word set a, and the total relatively altogether degree of referring to of the calculating that refers to coefficient altogether by word set b obtains divided by the degree of finger altogether of word set a; Title is to the speech coverage of chapter text, represents divided by the number of all speech of word set b with the number that the speech of word set b occurs in the chapter text.Dispersion coefficient, refer to that the linear weighted function of coefficient and speech coverage is the weights on limit altogether;

(4) utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph, the non-text literal piece content of the saturated vertex correspondence in the non-text literal of the Optimum Matching result piece vertex set promptly is a title, and the saturated vertex correspondence in the regional vertex set of another chapter that has the limit to link to each other with it is exactly the text chapter of this title institute logic association, as shown in Figure 5, the left side vertex representation title that is linked to each other by the limit, the right vertex representation is the chapter text of logic association with it, are ingredients of same message as title 6 with text 7, and the two is respectively as title in the XML structure of an article and the output of text item.Optimum Matching result's unsaturation point corresponding character piece is neither the also non-text of title, just in the space of a whole page as the content of other types such as header, speech, not only solved page object logical division problem but also finished the logic association of title and text.The Kuhn_Munkres algorithm that calculates Optimum Matching is as follows:

1) provides initial label

l (x_{i}) = \max_{j} ω_{ij},

l(y _j)＝0，i，j＝1，2...，t?，t＝max(n，m)；

2) obtain limit collection E _l={ (x _i, y _j) | l (x _i)+l (y _j)=ω _Ij, G _l=(X, Y _k, E _l) and G _lIn one the coupling M;

3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;

4) in X, look for a M unsaturation point x ₀, make A ← { x ₀, B ← φ, A, B are two set;

5) if

N_{G_{l}} (A) = B,

Then change the 9th) step, otherwise carry out next step, wherein,

N_{G_{l}} (A) &SubsetEqual; Y_{k},

Be with A in the node set of node adjacency;

6) look for a node

y &Element; N_{G_{l}} (A) - B;

7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, B ← B ∪ y} changes the 5th) step, otherwise carry out next step;

8) there is one from x ₀But the augmenting path P to y makes M ← M  E (P), changes the 3rd) step;

9) be calculated as follows a value:

a = \min_{\underset{y_{j} &NotElement; N_{G_{l}}}{x_{i} &Element; A}} {l (x_{i}) + l (y_{j}) - ω_{ij}},

Revise label:

Ask E according to l ' _{L '}And G _{L '}

10) l ← l ', G _l← G _{L '}, change the 6th) and the step.

Claims

1. one kind is carried out the title method related with text logic connection to newspaper layout, may further comprise the steps:

(3) weights on bipartite graph limit adopt natural language processing technique, determine by the non-text literal piece content of summit correspondence and the semanteme of chapter area contents, method is that to utilize title be the characteristics of article content theme summary, literal in the text literal piece is carried out obtaining word set a after the lexical analysis, total m different speech, and calculate the dispersion degree of each speech among the word set a and degree of finger altogether, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech, and degree of finger is represented with the number of times that this speech occurs in chapter altogether; Equally the literal in the non-text literal piece is carried out lexical analysis and obtain word set b, total n different speech, and calculate the relative dispersion degree of each speech in the chapter text and relative degree of finger altogether among the word set b, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech relatively, and degree of finger is represented with the number of times that this speech occurs in chapter relatively altogether; Before among the word set a n maximum dispersion degree and as the total points divergence of word set a, before among the word set a n maximum altogether degree of finger and as the degree of finger altogether of word set a, all relative dispersion degree and total relative dispersion degree among the word set b as word set b, all degree of finger relatively altogether and always relative degree of finger altogether among the word set b as word set b, the total relatively dispersion degree of the calculating of dispersion coefficient by word set b obtains divided by the total points divergence of word set a, and the total relatively altogether degree of referring to of the calculating that refers to coefficient altogether by word set b obtains divided by the degree of finger altogether of word set a; Title is to the speech coverage of chapter text, represents divided by the number of all speech of word set b with the number that the speech of word set b occurs in the chapter text, dispersion coefficient, refers to that the linear weighted function of coefficient and speech coverage is the weights on limit altogether;

(4) utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph, the non-text literal piece content of the saturated vertex correspondence in the non-text literal of the Optimum Matching result piece vertex set promptly is a title, and what saturated vertex correspondence in another chapter zone vertex set that the limit links to each other was arranged with it is exactly the text chapter of this title institute logic association, and the two is respectively as title in the XML structure of an article and the output of text item;

It is title, text, header, speech by its semantic function logical division that above-mentioned logic association refers to each literal piece that tiles on the newspaper layout two-dimensional space, then the expression title of same message and text associating as a structure.

2. as claimed in claim 1ly a kind of newspaper layout is carried out the title method related with text logic connection, it is characterized in that: the newspaper document comprises scanning paper medium newspaper and the document that obtains through OCR identification in the step (1), PDF, the document that the specialty software for composing generates, printed page analysis is bottom-up the space of a whole page to be divided into each piece zone, and physical classification is literal piece and image block, each literal piece is categorized as text literal piece and non-text literal piece by line number amount in font style and the piece, the syntople of text literal piece is expressed as digraph, and fractionation is converted into the weighting bipartite graph, adopt natural language processing technique to calculate bipartite graph limit weights, obtain a plurality of continuous sequences by Optimum Matching, each sequence is divided into a plurality of subsequences according to literal piece style information again, the zone that merges the subsequence correspondence promptly is chapter zone independently, and the word flow that its corresponding content connects into is as the content in chapter zone.

3. as claimed in claim 1ly a kind of newspaper layout is carried out the title method related with text logic connection, it is characterized in that: in the step (4), Optimum Matching result's unsaturation point corresponding character piece is neither the also non-text of title, be header, the speech in the space of a whole page, not only solved page object logical division problem but also finished the logic association of title and text, Ku En-Man Kele (Kuhn-Munkres) algorithm that calculates Optimum Matching is as follows: