CN101751455B

CN101751455B - Method for automatically generating title by adopting artificial intelligence technology

Info

Publication number: CN101751455B
Application number: CN2009101570162A
Authority: CN
Inventors: 徐颂华; 杨少辉; 刘智满
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2009-12-31
Filing date: 2009-12-31
Publication date: 2011-09-21
Anticipated expiration: 2029-12-31
Also published as: CN101751455A

Abstract

The invention discloses a method for automatically generating a title by adopting an artificial intelligence technology, which is realized by learning word characteristics generated on the basis of background knowledge related to the text by using a machine. The method comprises the following steps of: generating a query for the text by using a conversion technology from the text to the query; searching full text of Wikipedia by using the query; defining new word characteristics through an article structure obtained by analysis and the literature styles; running the generated characteristics on the machine by using a learning method so as to extract candidate title words from the text; and clustering the works to generate the final title. By the method, the background knowledge of the Wikipedia is introduced to the recognition process of the candidate title words; various structural information of the Wikipedia can be fully utilized; the characteristics of the words can be defined by using the styles of literature; and the titles can be automatically generated by using a computer.

Description

Adopt the method for artificial intelligence technology automatically generating title

Technical field

The present invention relates to data mining and artificial intelligence field, relate in particular to a kind of method that adopts the artificial intelligence technology automatically generating title.

Background technology

The extraction work of automatically generating title is carried out in a large amount of work.Proceedings of theHLT-NAACL magazine in 2003 (article title " Hedge trimmer:a parse-and-trim approach toheadline generation ") has been introduced a kind of generation of carrying out article title based on method of semantic.Proceedings of Document Understanding Conference magazine in 2004 (article title " Bbn/umd atduc 2004:Topiary ") is introduced and is utilized based on the sentence compress technique of semanteme and produce the title of article based on the title word of statistics jointly.Proceedings of ACL magazine in 2004 (article title " Template-filtered headline summarization ") is introduced a kind of method based on masterplate and is produced article title.Calendar year 2001 Proceedings of the Second International Conference on Computational Linguisticsand Intelligent Text

Generally speaking, our observed related work all is to utilize the information of article itself to produce the rule of some statistical, removes to produce title based on these rules.

Summary of the invention

The objective of the invention is to overcome the deficiencies in the prior art, a kind of method that adopts the artificial intelligence technology automatically generating title is provided.

Adopt the method for artificial intelligence technology automatically generating title may further comprise the steps:

1) text background knowledge obtains, utilize a text to produce the query statement of text correspondence to the switch technology of inquiry, detect important sentence in the text, select sentence important in the text, remove insignificant word then, and remaining speech is returned virgin state, the result is exactly the inquiry that generates, full-text search engine Zettair serves as that input is retrieved at wikipedia with this inquiry, returns the article set of a wikipedia;

2) analyze the wikipedia article set of returning, therefrom extract valuable information, for each wikipedia article that returns, analyze its structure, extract and import link, derive link, four kinds of different structural informations of kind and infobox, and form corresponding set;

3) utilize the structural information of wikipedia and article body to decide the word feature that justice is new, from three aspect definition word features, produce the feature of word by the background knowledge of using wikipedia, produce the feature of word according to the type information of article, utilize the information of article itself to produce the feature of word, form a feature space jointly;

4) based on the word feature space that produces above, use the method for support vector machine to carry out machine learning, obtain a training pattern, and use this model from text, to extract candidate's title word;

5) use clustering algorithm that the word that extracts is linked together, utilize syntax rule to handle to connecting the title that produces, thereby reach the requirement of fluency.

Described step 1) is:

A) sentence in the text is made up a figure, the point among the figure represent sentence, and the contact between sentence is represented on the limit of tie point, and the weight on limit determines by the similarity degree of two sentences, utilizes this figure to detect important sentences in the text;

B) all represent a critical sentence by each key node that calculates, remove meaningless word in the sentence according to meaningless word list then;

C) word that step b) is handled returns original form, utilizes the inquiry of remaining group of words one-tenth corresponding to article then;

D) inquiry that produces is input to a full-text search engine Zettair, this engine moves on wikipedia, according to returning article in the wikipedia, and, obtains the set of the related article composition in the wikipedia according to the degree of correlation ordering with the degree of correlation of inquiry.

Described step 2) be:

E) for each article in the set, extract importing link wherein, produce an importing link set, import link the article of other position of wikipedia is linked to current article, utilize MediaWiki API to obtain all importing link set of certain article;

F) for each article in the set, extract derivation link wherein, and form one and derive the link set, derive link current article is pointed to other position of wikipedia, derivation is linked at that the form with hyperlink exists in the text of article, obtains the derivation link set of this article by extracting hyperlink all in the article;

G) each article is extracted its kind of information, and form a kind set;

H) each article that contains infobox is extracted parameter value among the infobox, form an infobox set of parameter values.Throw away the parameter name information of infobox kind simultaneously.

Described step 3) is:

I) for each link in the importing link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article simultaneously at full-text search engine, calculate the importing chain feature of this word candidate;

J) for each link in the derivation link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article simultaneously at full-text search engine, draw the derivation chain feature of this word candidate;

K) for each element of the kind of wikipedia article set, utilize the familygram of wikipedia to draw similarity degree between it and the word candidate, consider the score of this article simultaneously at full-text search engine, draw the species characteristic of this word candidate;

L) for each element in the infobox set of parameter values of wikipedia article, utilize WordNet to obtain similarity degree between itself and the word candidate, consider the score of this article simultaneously, draw the infobox feature of this word candidate at full-text search engine;

M) extraction comprises barment tag, character feature, and architectural feature is measured the subject matter similarity degree of two articles in interior article type feature;

N) use an article set that comprises a lot of type articles, a given article, from set, find out 300 nearest articles of type similarity degree, extract their title, remove wherein insignificant word, to each such word, calculate the occurrence number of word, and calculate the subject matter similarity degree of word and article;

O) use the feature of some widely used words simultaneously, the frequency that word occurs in article, the position of word in article, whether word refers to special name or place name, and whether word length and word appear in the sentence of summary.

Described step 4) is:

P) keyword extraction is seen as a classification problem, moves on the text feature space of using algorithm of support vector machine to produce in the above, and word candidate is divided into keyword and non-key speech;

Q) when using algorithm of support vector machine to train, appear at word in the title as the example in front, the example of other word reverse side is trained a support vector model then, utilizes this model to carry out the extraction of key word;

R) according to the size of the signals in machine learning, utilize the quantity of a parameter by control extraction keyword, the key word that extracts is sorted, the possibility that the high word candidate that sorts becomes key word is high more.

Described step 5) is:

S) in text, mark the candidate's title word that identifies, comprise the word on its left side and the right,, merge and form a bulk if two fritters link to each other with a fritter;

T) when not having piece to merge, identify the piece of the maximum that exists in the text, the word in this piece is used as title, if do not reach the length for heading requirement, the next maximum piece of identification, adding word wherein is in title, up to satisfying the length for heading requirement then;

U) in order further to strengthen the readability of title, produce the title that some syntax rule optimizations produce, the POS label also is used for optimizing title, is exactly last title through the title of two suboptimization.

The beneficial effect that the present invention compared with prior art has:

(1) background knowledge of wikipedia is incorporated into the identifying of candidate's title word;

(2) make full use of the various structural informations of wikipedia;

(3) utilize the feature of the type information definition word of article.

Description of drawings

Fig. 1 is the software flow pattern that adopts the method for artificial intelligence technology automatically generating title;

Fig. 2 obtains the process flow diagram of background knowledge from wikipedia;

Fig. 3 is the candidate's title word synoptic diagram that identifies from text of the present invention;

Fig. 4 is the cluster process process flow diagram of word candidate of the present invention;

Fig. 5 is the example synoptic diagram of automatically generating title of the present invention.

Embodiment

3) utilize the structural information of wikipedia and article body to decide the word feature that justice is new, from three aspect definition word features, produce the feature of word by the background knowledge of using wikipedia, produce the feature of word according to the type information of article, utilize the information of article itself to produce the feature of word, form a feature space jointly.

Described step 1) is:

Described step 2) be:

G) each article is extracted its kind of information, and form a kind set;

Described step 3) is:

L) for each element in the infobox set of parameter values of wikipedia article, utilize WordNet to draw similarity degree between itself and the word candidate, consider the score of this article simultaneously, draw the infobox feature of this word candidate at full-text search engine;

Described step 4) is:

Described step 5) is:

Embodiment

As shown in Figure 1, the flow process of implementation system of the present invention comprises article background context knowledge acquisition 101, wikipedia text structure 102 is returned in analysis, utilize the new word feature 103 of wikipedia structure and type definition, realize identification candidate title word 104 by machine learning, thereby cluster and optimization form last title 105.

Article background context knowledge acquisition 101: in this example, this part may further comprise the steps:

(A) the crucial sentence in the detection article, details are as follows for its step:

1) sentence in the article is regarded as point among the figure, thereby be that an article produces a figure, the detection critical sentence algorithm (" TextRank:Bringing order into texts " that this method has adopted Proceedings of EMNLP magazine to be announced in 2004,233-242,2004).

2) this algorithm makes up one by a figure based on sentence, point among the figure is represented sentence, the contact between sentence is represented on the limit of tie point, the weight on limit is by the similarity degree decision of two sentences, similarity degree computing method between sentence are based on the word in two sentences, and utilize WordNet to consider similarity degree between word, thereby draw the similarity degree between the sentence, the function that calculates similarity degree between two sentences is as giving a definition:

Similarity (S_{i}, S_{j}) = \frac{\underset{W_{p} &Element; S_{i}}{Σ} \underset{W_{q} &Element; S_{j}}{Σ} σ_{1} (W_{p}, W_{q})}{\log (| S_{i} |) + \log (| S_{j} |)}

Wherein S represents sentence, and W represents the word in the sentence, || the number of words that comprises in the symbology sentence, σ ₁(W _p, W _q) utilize WordNet to measure similarity degree between two words.

3) utilize WordNet to measure similarity degree between the word, this method has been used one piece of article (" Wodnet::Similarity-measuring therelatedness of concepts " that Proceedings ofAAAI magazine was announced in 2004, what Proceedings of the Nineteenth National Conference onArtificial Intelligence, 2004) proposed is a kind of based on similarity degree computing method between the word of WordNet.

(B) critical sentence that detects in the step (A) is handled, inquired about accordingly, details are as follows for its step:

1) insignificant word removed in crucial sentence.This method has been used the meaningless word list that ACM Forum magazine was announced in 1989 (" A stop list for general text ", ACM Forum, 24 (1-2): 19-21,1989) and has been removed insignificant speech in the sentence.

2) remaining word is returned its original form, the result after utilization is handled forms the inquiry corresponding to article.

(C) utilize the inquiry that produces that wikipedia is retrieved, details are as follows for its step:

1) utilize the inquiry that produces that wikipedia is carried out full-text search, this method has been used one piece of article (" RMIT University at TREC2004 " that Proceedings TextRetrieval Conference magazine was announced in 2004, Proceedings Text Retrieval Conference) a full-text search engine Zettair who is proposed carries out full-text search to wikipedia, returns a series of relevant article titles.

2) according to the degree of correlation with inquiry the article as a result that returns is sorted, and get the top n article, we obtain the set that a related article in the wikipedia is formed like this, and the value of N can be regulated.Wikipedia text structure 102 is returned in analysis: in this example, this part may further comprise the steps:

(D) from the wikipedia article, extract link structure, comprise importing link and deriving link that details are as follows for its step:

1) import link the article of other position of wikipedia is linked to current article, this method has been used one piece of article (" Semantic MediaWiki ", Proceedings of 5 that Proceedings of ISWC magazine was announced in 2006 ^ThInternational Semantic Web Conference, 935-942,2006) the MediaWiki API that proposed obtain certain article all import the link set.

2) derive link current article is pointed to other position of wikipedia, derive that the form with hyperlink exists in the text that is linked at article, gather by extracting the derivation link that hyperlink all in the article obtains this article.

(E) extract kind of information and infobox parameter information from the wikipedia article, details are as follows for its step:

1) species structure is the key character of wikipedia, and it puts related article together, makes things convenient for the user to read.We extract its kind of information to a step to each article, and form a kind set.

2) infobox in the wikipedia article is a summary of important information in the article, and each article that contains infobox is extracted parameter value among the infobox, forms an infobox set of parameter values, throws away the parameter name information of infobox kind simultaneously.

Utilize the new word feature 103 of wikipedia structure and type definition: in this example, this part may further comprise the steps:

(F) utilize the structural information of wikipedia article to define new word feature, details are as follows for its step:

1) for each link in the importing link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article at full-text search engine simultaneously, calculate the importing chain feature of this word candidate, eigenwert is by following function calculation:

S_{I} (x_{i}, Π) = \frac{\underset{p_{r} &Element; Π}{Σ} [z (p_{r}) \cdot \underset{k &Element; IT (p_{r})}{Σ} σ_{1} (x_{i}, k)]}{\underset{p_{r} &Element; Π}{Σ} z (p_{r}) \cdot | IT (p_{r}) |}

Wherein ∏ represents that the front obtains the set of wikipedia article, x _iRepresent a word candidate, p _rAn article among the expression ∏, z (p _r) the degree of correlation score returned of expression full-text search engine Zettair, σ ₁Utilize the similarity degree between two words of WordNet measurement, || the number of elements in the expression set, IT represents to import link and gathers.

2) for each link in the derivation link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article simultaneously at full-text search engine, calculate the derivation chain feature of this word candidate.

S_{O} (x_{i}, Π) = \frac{\underset{p_{r} &Element; Π}{Σ} [z (p_{r}) \cdot \underset{k &Element; OT (p_{r})}{Σ} σ_{1} (x_{i}, k)]}{\underset{p_{r} &Element; Π}{Σ} z (p_{r}) \cdot | OT (p_{r}) |}

Represent that wherein OT links set everywhere, other symbol be defined in i) in provide.

3) for each element of the kind of wikipedia article set, utilize the familygram of wikipedia to calculate similarity degree between it and the word candidate, consider the score of this article simultaneously at full-text search engine, calculate the species characteristic of this word candidate.

S_{C} (x_{i}, Π) = \frac{\underset{p_{r} &Element; Π}{Σ} [z (p_{r}) \cdot \underset{c &Element; C (p_{r})}{Σ} σ_{2} (x_{i}, c)]}{\underset{p_{r} &Element; Π}{Σ} z (p_{r}) \cdot | C (p_{r}) |}

Wherein C represents the kind set of a wikipedia article correspondence, σ ₂Utilize the similarity degree between two words of wikipedia familygram calculating.Other symbol be defined in i) in provide.

4) for each element in the infobox set of parameter values of wikipedia article, utilize WordNet to calculate similarity degree between it and the word candidate, consider the score of this article simultaneously, calculate the infobox feature of this word candidate at full-text search engine.

S_{F} (x_{i}, Π) = \frac{\underset{p_{r} &Element; Π}{Σ} [z (p_{r}) \cdot \underset{k &Element; IV (p_{r})}{Σ} σ_{1} (x_{i}, k)]}{\underset{p_{r} &Element; Π}{Σ} z (p_{r}) \cdot | IV (p_{r}) |}

Wherein IV represents the infobox set of parameter values of a wikipedia article, and the definition of other symbol is at i) provide.

(G) utilize the new word feature of type information definition of article, details are as follows for its step:

1) the type feature of extraction article is determined the type of article, this method has been used one piece of article (" The form is the substance:classification of genres in text " that Proceedings ofHuman Language Technology and Knowledge Management magazine was announced in calendar year 2001, Proceedingsof the workshop on Human Language Technology and Knowledge Management, 1-8,2001) a kind of article that utilizes that is proposed comprises architectural feature, character feature, barment tag are determined the type of article in interior multinomial feature.

2) measure the type similarity of two articles, this method has been used a kind of method that J.G.Stewart proposed in one piece of PhD dissertation (" Genre Oriented Summarization ") in 2008 and has been measured type similarity between article.

3) define the subject matter fitness of word and article according to word occurrence number in article title, this method is used an article set that comprises a lot of type articles, a given article, from set, find out 300 nearest articles of type similarity degree, extract title, remove insignificant word in the title, to each such word, calculate the occurrence number of word, and calculate the subject matter similarity degree of word and article, define word weighting function based on type:

WO (w_{i}) = Σ_{k = 1}^{n} θ (d_{j}, d_{j, k})

Wherein θ is 2) function of the subject matter similarity degree of two articles of measurement of proposing, d _{J, k}Be and d _j300 articles that the type similarity is nearest.

4) based on 3) the result word frequencies function based on type is further proposed:

WF (w_{k}) = \frac{WO (w_{k})}{Σ_{t = 1}^{m} WO (w_{t})}

Wherein m is all number of words that occur in the title in 300 articles, based on top two formula, next defines the article type compliance characteristics of word, and this function is as giving a definition:

γ (w_{i}, d_{j}) = Σ_{k = 1}^{m} WF (w_{k}) σ_{1} (w_{k}, w_{i})

(H) use some widely used word features, details are as follows for its step:

1) utilizes the frequecy characteristic of the frequency computation part word that word occurs in article, adopt standardized tf.idf to go to measure the frequency of word, this method has been used one piece of article (" Term-weighting approaches in automatic text retrieval " that Technical Report magazine was announced in 1987, Technical report, 1987) method that is proposed is calculated the value of tf.idf.

That 2) utilizes position that word occurs and number of times definition word in article occurs feature first, average characteristics and occur feature for the last time; The word that refers to special name or place name also is used to defined feature; The relative length of word also is used to portray the feature of word in addition; The word of last and summing-up, as " insummary ", " in conclusion " appears at word together, and its summary feature is defined as 1, otherwise is 0.Discern candidate's title word 104 by machine learning: move on the text feature space of using algorithm of support vector machine to produce in the above, word candidate is divided into keyword and non-key speech, when using algorithm of support vector machine to train, appear at word in the title as the example in front, the example of other word reverse side, the data mode in the training set are (F (w ₁), y ₁) ..., (F (w _n, y _n)), F (w wherein _j) refer to the proper vector of j word, y _jBe the class label corresponding to word, its value is 1 or-1.1 represents key word, and-1 represents non-keyword.Train a support vector model then, utilize this model to carry out the extraction of key word, size according to the signals in machine learning, the key word that extracts is sorted, the possibility that the high word candidate that sorts becomes key word is high more, and the quantity of extracting keyword is by parameter M control.

Cluster is also optimized formation title 105: in this example, this part may further comprise the steps:

(I) key word that has identified is carried out cluster, form preliminary title, details are as follows for its step:

1) key word that identifies is carried out cluster operation, this method has been used one piece of article (" Headline Summarization at ISI " that Proceedings ofHLT/NAACL magazine was announced in 2003, Proceedings of HLT/NAACL workshop on Automatic Summarization/DUC2003,2003) method that is proposed is carried out the cluster of key word, thereby forms a preliminary title.

2) identify the maximum cluster window that exists in the text, the word in this window is used as title, if do not reach the length for heading requirement, we discern next maximum window, and adding word is wherein known and satisfied the length for heading requirement in title then.

(J) title to preliminary generation is optimized, and details are as follows for its step:

1) utilize some syntax rules to optimize title, strengthen readable, this method has been used one piece of article (" Headline Summarization at ISI " that the HLT/NAACL magazine was announced in 2003, Proceedings ofHLT/NAACL workshop on Automatic Summarization/DUC2003,2003) syntax rule that is proposed is carried out the Optimizing operation of title.

2) utilize the POS label of word to optimize title, strengthen readable, this method has been used one piece of article (" Statistical Techniques for Natural Language Parsing " that the AI magazine was announced in 1997, AIMagazine, 18 (4): 33-44,1997) method is calculated the POS label of word.

Claims

1. method that adopts the artificial intelligence technology automatically generating title, its process is may further comprise the steps:

3) utilize the structural information of wikipedia and article subject matter to define new word feature, from three aspect definition word features, produce the feature of word by the background knowledge of using wikipedia, produce the feature of word according to the subject matter information of article, utilize the information of article itself to produce the feature of word, form a feature space jointly;

5) use clustering algorithm that the word that extracts is linked together, utilize syntax rule to handle to connecting the title that produces, thereby reach the requirement of fluency;

Described step 1) is:

D) inquiry that produces is input to a full-text search engine Zettair, this engine moves on wikipedia, according to returning article in the wikipedia, and, obtains the set of the related article composition in the wikipedia according to the degree of correlation ordering with the degree of correlation of inquiry;

Described step 2) be:

E) for each article in the set, extract importing link wherein, produce an importing link set, import link the article of other position of wikipedia is linked to current article, utilize MediaWikiAPI to obtain all importing link set of certain article;

G) each article is extracted its kind of information, and form a kind set;

H) each article that contains infobox is extracted parameter value among the infobox, form an infobox set of parameter values, throw away the parameter name information of infobox kind simultaneously;

Described step 3) is:

K) for each element of the kind of wikipedia article set, utilize the familygram of wikipedia come the similarity degree between it and the word candidate, consider the score of this article simultaneously at full-text search engine, draw the species characteristic of this word candidate;

L) for each element in the infobox set of parameter values of wikipedia article, utilize WordNet to get similarity degree between itself and the word candidate, consider the score of this article simultaneously at full-text search engine, draw the infobox feature of this word candidate;

M) extraction comprises barment tag, character feature, and architectural feature is measured the subject matter similarity degree of two articles in interior article subject matter feature;

N) use an article set that comprises a lot of subject matters, a given article, from set, find out 300 nearest articles of subject matter similarity degree, extract their title, remove wherein insignificant word, to each such word, calculate the occurrence number of word, and calculate the subject matter similarity degree of word and article;

O) use the feature of some widely used words simultaneously, the frequency that word occurs in article, the position of word in article, whether word refers to special name or place name, and whether word length and word appear in the sentence of summary;

Described step 4) is:

Q) when using algorithm of support vector machine to train, appear at word in the title as the example in front, the example of other word reverse side is trained a support vector model then, utilizes this model to carry out the extraction of keyword;

R) according to the size of the signals in machine learning, utilize a parameter M control to extract the quantity of keyword, the keyword that extracts is sorted, the possibility that the high word candidate that sorts becomes keyword is high more.

2. a kind of method that adopts the artificial intelligence technology automatically generating title according to claim 1 is characterized in that described step 5) is: