CN101751455B - Method for automatically generating title by adopting artificial intelligence technology - Google Patents

Method for automatically generating title by adopting artificial intelligence technology Download PDF

Info

Publication number
CN101751455B
CN101751455B CN2009101570162A CN200910157016A CN101751455B CN 101751455 B CN101751455 B CN 101751455B CN 2009101570162 A CN2009101570162 A CN 2009101570162A CN 200910157016 A CN200910157016 A CN 200910157016A CN 101751455 B CN101751455 B CN 101751455B
Authority
CN
China
Prior art keywords
word
article
wikipedia
text
title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009101570162A
Other languages
Chinese (zh)
Other versions
CN101751455A (en
Inventor
徐颂华
杨少辉
刘智满
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN2009101570162A priority Critical patent/CN101751455B/en
Publication of CN101751455A publication Critical patent/CN101751455A/en
Application granted granted Critical
Publication of CN101751455B publication Critical patent/CN101751455B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for automatically generating a title by adopting an artificial intelligence technology, which is realized by learning word characteristics generated on the basis of background knowledge related to the text by using a machine. The method comprises the following steps of: generating a query for the text by using a conversion technology from the text to the query; searching full text of Wikipedia by using the query; defining new word characteristics through an article structure obtained by analysis and the literature styles; running the generated characteristics on the machine by using a learning method so as to extract candidate title words from the text; and clustering the works to generate the final title. By the method, the background knowledge of the Wikipedia is introduced to the recognition process of the candidate title words; various structural information of the Wikipedia can be fully utilized; the characteristics of the words can be defined by using the styles of literature; and the titles can be automatically generated by using a computer.

Description

Adopt the method for artificial intelligence technology automatically generating title
Technical field
The present invention relates to data mining and artificial intelligence field, relate in particular to a kind of method that adopts the artificial intelligence technology automatically generating title.
Background technology
The extraction work of automatically generating title is carried out in a large amount of work.Proceedings of theHLT-NAACL magazine in 2003 (article title " Hedge trimmer:a parse-and-trim approach toheadline generation ") has been introduced a kind of generation of carrying out article title based on method of semantic.Proceedings of Document Understanding Conference magazine in 2004 (article title " Bbn/umd atduc 2004:Topiary ") is introduced and is utilized based on the sentence compress technique of semanteme and produce the title of article based on the title word of statistics jointly.Proceedings of ACL magazine in 2004 (article title " Template-filtered headline summarization ") is introduced a kind of method based on masterplate and is produced article title.Calendar year 2001 Proceedings of the Second International Conference on Computational Linguisticsand Intelligent Text
Generally speaking, our observed related work all is to utilize the information of article itself to produce the rule of some statistical, removes to produce title based on these rules.
Summary of the invention
The objective of the invention is to overcome the deficiencies in the prior art, a kind of method that adopts the artificial intelligence technology automatically generating title is provided.
Adopt the method for artificial intelligence technology automatically generating title may further comprise the steps:
1) text background knowledge obtains, utilize a text to produce the query statement of text correspondence to the switch technology of inquiry, detect important sentence in the text, select sentence important in the text, remove insignificant word then, and remaining speech is returned virgin state, the result is exactly the inquiry that generates, full-text search engine Zettair serves as that input is retrieved at wikipedia with this inquiry, returns the article set of a wikipedia;
2) analyze the wikipedia article set of returning, therefrom extract valuable information, for each wikipedia article that returns, analyze its structure, extract and import link, derive link, four kinds of different structural informations of kind and infobox, and form corresponding set;
3) utilize the structural information of wikipedia and article body to decide the word feature that justice is new, from three aspect definition word features, produce the feature of word by the background knowledge of using wikipedia, produce the feature of word according to the type information of article, utilize the information of article itself to produce the feature of word, form a feature space jointly;
4) based on the word feature space that produces above, use the method for support vector machine to carry out machine learning, obtain a training pattern, and use this model from text, to extract candidate's title word;
5) use clustering algorithm that the word that extracts is linked together, utilize syntax rule to handle to connecting the title that produces, thereby reach the requirement of fluency.
Described step 1) is:
A) sentence in the text is made up a figure, the point among the figure represent sentence, and the contact between sentence is represented on the limit of tie point, and the weight on limit determines by the similarity degree of two sentences, utilizes this figure to detect important sentences in the text;
B) all represent a critical sentence by each key node that calculates, remove meaningless word in the sentence according to meaningless word list then;
C) word that step b) is handled returns original form, utilizes the inquiry of remaining group of words one-tenth corresponding to article then;
D) inquiry that produces is input to a full-text search engine Zettair, this engine moves on wikipedia, according to returning article in the wikipedia, and, obtains the set of the related article composition in the wikipedia according to the degree of correlation ordering with the degree of correlation of inquiry.
Described step 2) be:
E) for each article in the set, extract importing link wherein, produce an importing link set, import link the article of other position of wikipedia is linked to current article, utilize MediaWiki API to obtain all importing link set of certain article;
F) for each article in the set, extract derivation link wherein, and form one and derive the link set, derive link current article is pointed to other position of wikipedia, derivation is linked at that the form with hyperlink exists in the text of article, obtains the derivation link set of this article by extracting hyperlink all in the article;
G) each article is extracted its kind of information, and form a kind set;
H) each article that contains infobox is extracted parameter value among the infobox, form an infobox set of parameter values.Throw away the parameter name information of infobox kind simultaneously.
Described step 3) is:
I) for each link in the importing link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article simultaneously at full-text search engine, calculate the importing chain feature of this word candidate;
J) for each link in the derivation link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article simultaneously at full-text search engine, draw the derivation chain feature of this word candidate;
K) for each element of the kind of wikipedia article set, utilize the familygram of wikipedia to draw similarity degree between it and the word candidate, consider the score of this article simultaneously at full-text search engine, draw the species characteristic of this word candidate;
L) for each element in the infobox set of parameter values of wikipedia article, utilize WordNet to obtain similarity degree between itself and the word candidate, consider the score of this article simultaneously, draw the infobox feature of this word candidate at full-text search engine;
M) extraction comprises barment tag, character feature, and architectural feature is measured the subject matter similarity degree of two articles in interior article type feature;
N) use an article set that comprises a lot of type articles, a given article, from set, find out 300 nearest articles of type similarity degree, extract their title, remove wherein insignificant word, to each such word, calculate the occurrence number of word, and calculate the subject matter similarity degree of word and article;
O) use the feature of some widely used words simultaneously, the frequency that word occurs in article, the position of word in article, whether word refers to special name or place name, and whether word length and word appear in the sentence of summary.
Described step 4) is:
P) keyword extraction is seen as a classification problem, moves on the text feature space of using algorithm of support vector machine to produce in the above, and word candidate is divided into keyword and non-key speech;
Q) when using algorithm of support vector machine to train, appear at word in the title as the example in front, the example of other word reverse side is trained a support vector model then, utilizes this model to carry out the extraction of key word;
R) according to the size of the signals in machine learning, utilize the quantity of a parameter by control extraction keyword, the key word that extracts is sorted, the possibility that the high word candidate that sorts becomes key word is high more.
Described step 5) is:
S) in text, mark the candidate's title word that identifies, comprise the word on its left side and the right,, merge and form a bulk if two fritters link to each other with a fritter;
T) when not having piece to merge, identify the piece of the maximum that exists in the text, the word in this piece is used as title, if do not reach the length for heading requirement, the next maximum piece of identification, adding word wherein is in title, up to satisfying the length for heading requirement then;
U) in order further to strengthen the readability of title, produce the title that some syntax rule optimizations produce, the POS label also is used for optimizing title, is exactly last title through the title of two suboptimization.
The beneficial effect that the present invention compared with prior art has:
(1) background knowledge of wikipedia is incorporated into the identifying of candidate's title word;
(2) make full use of the various structural informations of wikipedia;
(3) utilize the feature of the type information definition word of article.
Description of drawings
Fig. 1 is the software flow pattern that adopts the method for artificial intelligence technology automatically generating title;
Fig. 2 obtains the process flow diagram of background knowledge from wikipedia;
Fig. 3 is the candidate's title word synoptic diagram that identifies from text of the present invention;
Fig. 4 is the cluster process process flow diagram of word candidate of the present invention;
Fig. 5 is the example synoptic diagram of automatically generating title of the present invention.
Embodiment
Adopt the method for artificial intelligence technology automatically generating title may further comprise the steps:
1) text background knowledge obtains, utilize a text to produce the query statement of text correspondence to the switch technology of inquiry, detect important sentence in the text, select sentence important in the text, remove insignificant word then, and remaining speech is returned virgin state, the result is exactly the inquiry that generates, full-text search engine Zettair serves as that input is retrieved at wikipedia with this inquiry, returns the article set of a wikipedia;
2) analyze the wikipedia article set of returning, therefrom extract valuable information, for each wikipedia article that returns, analyze its structure, extract and import link, derive link, four kinds of different structural informations of kind and infobox, and form corresponding set;
3) utilize the structural information of wikipedia and article body to decide the word feature that justice is new, from three aspect definition word features, produce the feature of word by the background knowledge of using wikipedia, produce the feature of word according to the type information of article, utilize the information of article itself to produce the feature of word, form a feature space jointly.
4) based on the word feature space that produces above, use the method for support vector machine to carry out machine learning, obtain a training pattern, and use this model from text, to extract candidate's title word;
5) use clustering algorithm that the word that extracts is linked together, utilize syntax rule to handle to connecting the title that produces, thereby reach the requirement of fluency.
Described step 1) is:
A) sentence in the text is made up a figure, the point among the figure represent sentence, and the contact between sentence is represented on the limit of tie point, and the weight on limit determines by the similarity degree of two sentences, utilizes this figure to detect important sentences in the text;
B) all represent a critical sentence by each key node that calculates, remove meaningless word in the sentence according to meaningless word list then;
C) word that step b) is handled returns original form, utilizes the inquiry of remaining group of words one-tenth corresponding to article then;
D) inquiry that produces is input to a full-text search engine Zettair, this engine moves on wikipedia, according to returning article in the wikipedia, and, obtains the set of the related article composition in the wikipedia according to the degree of correlation ordering with the degree of correlation of inquiry.
Described step 2) be:
E) for each article in the set, extract importing link wherein, produce an importing link set, import link the article of other position of wikipedia is linked to current article, utilize MediaWiki API to obtain all importing link set of certain article;
F) for each article in the set, extract derivation link wherein, and form one and derive the link set, derive link current article is pointed to other position of wikipedia, derivation is linked at that the form with hyperlink exists in the text of article, obtains the derivation link set of this article by extracting hyperlink all in the article;
G) each article is extracted its kind of information, and form a kind set;
H) each article that contains infobox is extracted parameter value among the infobox, form an infobox set of parameter values.Throw away the parameter name information of infobox kind simultaneously.
Described step 3) is:
I) for each link in the importing link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article simultaneously at full-text search engine, calculate the importing chain feature of this word candidate;
J) for each link in the derivation link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article simultaneously at full-text search engine, draw the derivation chain feature of this word candidate;
K) for each element of the kind of wikipedia article set, utilize the familygram of wikipedia to draw similarity degree between it and the word candidate, consider the score of this article simultaneously at full-text search engine, draw the species characteristic of this word candidate;
L) for each element in the infobox set of parameter values of wikipedia article, utilize WordNet to draw similarity degree between itself and the word candidate, consider the score of this article simultaneously, draw the infobox feature of this word candidate at full-text search engine;
M) extraction comprises barment tag, character feature, and architectural feature is measured the subject matter similarity degree of two articles in interior article type feature;
N) use an article set that comprises a lot of type articles, a given article, from set, find out 300 nearest articles of type similarity degree, extract their title, remove wherein insignificant word, to each such word, calculate the occurrence number of word, and calculate the subject matter similarity degree of word and article;
O) use the feature of some widely used words simultaneously, the frequency that word occurs in article, the position of word in article, whether word refers to special name or place name, and whether word length and word appear in the sentence of summary.
Described step 4) is:
P) keyword extraction is seen as a classification problem, moves on the text feature space of using algorithm of support vector machine to produce in the above, and word candidate is divided into keyword and non-key speech;
Q) when using algorithm of support vector machine to train, appear at word in the title as the example in front, the example of other word reverse side is trained a support vector model then, utilizes this model to carry out the extraction of key word;
R) according to the size of the signals in machine learning, utilize the quantity of a parameter by control extraction keyword, the key word that extracts is sorted, the possibility that the high word candidate that sorts becomes key word is high more.
Described step 5) is:
S) in text, mark the candidate's title word that identifies, comprise the word on its left side and the right,, merge and form a bulk if two fritters link to each other with a fritter;
T) when not having piece to merge, identify the piece of the maximum that exists in the text, the word in this piece is used as title, if do not reach the length for heading requirement, the next maximum piece of identification, adding word wherein is in title, up to satisfying the length for heading requirement then;
U) in order further to strengthen the readability of title, produce the title that some syntax rule optimizations produce, the POS label also is used for optimizing title, is exactly last title through the title of two suboptimization.
Embodiment
As shown in Figure 1, the flow process of implementation system of the present invention comprises article background context knowledge acquisition 101, wikipedia text structure 102 is returned in analysis, utilize the new word feature 103 of wikipedia structure and type definition, realize identification candidate title word 104 by machine learning, thereby cluster and optimization form last title 105.
Article background context knowledge acquisition 101: in this example, this part may further comprise the steps:
(A) the crucial sentence in the detection article, details are as follows for its step:
1) sentence in the article is regarded as point among the figure, thereby be that an article produces a figure, the detection critical sentence algorithm (" TextRank:Bringing order into texts " that this method has adopted Proceedings of EMNLP magazine to be announced in 2004,233-242,2004).
2) this algorithm makes up one by a figure based on sentence, point among the figure is represented sentence, the contact between sentence is represented on the limit of tie point, the weight on limit is by the similarity degree decision of two sentences, similarity degree computing method between sentence are based on the word in two sentences, and utilize WordNet to consider similarity degree between word, thereby draw the similarity degree between the sentence, the function that calculates similarity degree between two sentences is as giving a definition:
Similarity ( S i , S j ) = Σ W p ∈ S i Σ W q ∈ S j σ 1 ( W p , W q ) log ( | S i | ) + log ( | S j | )
Wherein S represents sentence, and W represents the word in the sentence, || the number of words that comprises in the symbology sentence, σ 1(W p, W q) utilize WordNet to measure similarity degree between two words.
3) utilize WordNet to measure similarity degree between the word, this method has been used one piece of article (" Wodnet::Similarity-measuring therelatedness of concepts " that Proceedings ofAAAI magazine was announced in 2004, what Proceedings of the Nineteenth National Conference onArtificial Intelligence, 2004) proposed is a kind of based on similarity degree computing method between the word of WordNet.
(B) critical sentence that detects in the step (A) is handled, inquired about accordingly, details are as follows for its step:
1) insignificant word removed in crucial sentence.This method has been used the meaningless word list that ACM Forum magazine was announced in 1989 (" A stop list for general text ", ACM Forum, 24 (1-2): 19-21,1989) and has been removed insignificant speech in the sentence.
2) remaining word is returned its original form, the result after utilization is handled forms the inquiry corresponding to article.
(C) utilize the inquiry that produces that wikipedia is retrieved, details are as follows for its step:
1) utilize the inquiry that produces that wikipedia is carried out full-text search, this method has been used one piece of article (" RMIT University at TREC2004 " that Proceedings TextRetrieval Conference magazine was announced in 2004, Proceedings Text Retrieval Conference) a full-text search engine Zettair who is proposed carries out full-text search to wikipedia, returns a series of relevant article titles.
2) according to the degree of correlation with inquiry the article as a result that returns is sorted, and get the top n article, we obtain the set that a related article in the wikipedia is formed like this, and the value of N can be regulated.Wikipedia text structure 102 is returned in analysis: in this example, this part may further comprise the steps:
(D) from the wikipedia article, extract link structure, comprise importing link and deriving link that details are as follows for its step:
1) import link the article of other position of wikipedia is linked to current article, this method has been used one piece of article (" Semantic MediaWiki ", Proceedings of 5 that Proceedings of ISWC magazine was announced in 2006 ThInternational Semantic Web Conference, 935-942,2006) the MediaWiki API that proposed obtain certain article all import the link set.
2) derive link current article is pointed to other position of wikipedia, derive that the form with hyperlink exists in the text that is linked at article, gather by extracting the derivation link that hyperlink all in the article obtains this article.
(E) extract kind of information and infobox parameter information from the wikipedia article, details are as follows for its step:
1) species structure is the key character of wikipedia, and it puts related article together, makes things convenient for the user to read.We extract its kind of information to a step to each article, and form a kind set.
2) infobox in the wikipedia article is a summary of important information in the article, and each article that contains infobox is extracted parameter value among the infobox, forms an infobox set of parameter values, throws away the parameter name information of infobox kind simultaneously.
Utilize the new word feature 103 of wikipedia structure and type definition: in this example, this part may further comprise the steps:
(F) utilize the structural information of wikipedia article to define new word feature, details are as follows for its step:
1) for each link in the importing link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article at full-text search engine simultaneously, calculate the importing chain feature of this word candidate, eigenwert is by following function calculation:
S I ( x i , Π ) = Σ p r ∈ Π [ z ( p r ) · Σ k ∈ IT ( p r ) σ 1 ( x i , k ) ] Σ p r ∈ Π z ( p r ) · | IT ( p r ) |
Wherein ∏ represents that the front obtains the set of wikipedia article, x iRepresent a word candidate, p rAn article among the expression ∏, z (p r) the degree of correlation score returned of expression full-text search engine Zettair, σ 1Utilize the similarity degree between two words of WordNet measurement, || the number of elements in the expression set, IT represents to import link and gathers.
2) for each link in the derivation link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article simultaneously at full-text search engine, calculate the derivation chain feature of this word candidate.
S O ( x i , Π ) = Σ p r ∈ Π [ z ( p r ) · Σ k ∈ OT ( p r ) σ 1 ( x i , k ) ] Σ p r ∈ Π z ( p r ) · | OT ( p r ) |
Represent that wherein OT links set everywhere, other symbol be defined in i) in provide.
3) for each element of the kind of wikipedia article set, utilize the familygram of wikipedia to calculate similarity degree between it and the word candidate, consider the score of this article simultaneously at full-text search engine, calculate the species characteristic of this word candidate.
S C ( x i , Π ) = Σ p r ∈ Π [ z ( p r ) · Σ c ∈ C ( p r ) σ 2 ( x i , c ) ] Σ p r ∈ Π z ( p r ) · | C ( p r ) |
Wherein C represents the kind set of a wikipedia article correspondence, σ 2Utilize the similarity degree between two words of wikipedia familygram calculating.Other symbol be defined in i) in provide.
4) for each element in the infobox set of parameter values of wikipedia article, utilize WordNet to calculate similarity degree between it and the word candidate, consider the score of this article simultaneously, calculate the infobox feature of this word candidate at full-text search engine.
S F ( x i , Π ) = Σ p r ∈ Π [ z ( p r ) · Σ k ∈ IV ( p r ) σ 1 ( x i , k ) ] Σ p r ∈ Π z ( p r ) · | IV ( p r ) |
Wherein IV represents the infobox set of parameter values of a wikipedia article, and the definition of other symbol is at i) provide.
(G) utilize the new word feature of type information definition of article, details are as follows for its step:
1) the type feature of extraction article is determined the type of article, this method has been used one piece of article (" The form is the substance:classification of genres in text " that Proceedings ofHuman Language Technology and Knowledge Management magazine was announced in calendar year 2001, Proceedingsof the workshop on Human Language Technology and Knowledge Management, 1-8,2001) a kind of article that utilizes that is proposed comprises architectural feature, character feature, barment tag are determined the type of article in interior multinomial feature.
2) measure the type similarity of two articles, this method has been used a kind of method that J.G.Stewart proposed in one piece of PhD dissertation (" Genre Oriented Summarization ") in 2008 and has been measured type similarity between article.
3) define the subject matter fitness of word and article according to word occurrence number in article title, this method is used an article set that comprises a lot of type articles, a given article, from set, find out 300 nearest articles of type similarity degree, extract title, remove insignificant word in the title, to each such word, calculate the occurrence number of word, and calculate the subject matter similarity degree of word and article, define word weighting function based on type:
WO ( w i ) = Σ k = 1 n θ ( d j , d j , k )
Wherein θ is 2) function of the subject matter similarity degree of two articles of measurement of proposing, d J, kBe and d j300 articles that the type similarity is nearest.
4) based on 3) the result word frequencies function based on type is further proposed:
WF ( w k ) = WO ( w k ) Σ t = 1 m WO ( w t )
Wherein m is all number of words that occur in the title in 300 articles, based on top two formula, next defines the article type compliance characteristics of word, and this function is as giving a definition:
γ ( w i , d j ) = Σ k = 1 m WF ( w k ) σ 1 ( w k , w i )
(H) use some widely used word features, details are as follows for its step:
1) utilizes the frequecy characteristic of the frequency computation part word that word occurs in article, adopt standardized tf.idf to go to measure the frequency of word, this method has been used one piece of article (" Term-weighting approaches in automatic text retrieval " that Technical Report magazine was announced in 1987, Technical report, 1987) method that is proposed is calculated the value of tf.idf.
That 2) utilizes position that word occurs and number of times definition word in article occurs feature first, average characteristics and occur feature for the last time; The word that refers to special name or place name also is used to defined feature; The relative length of word also is used to portray the feature of word in addition; The word of last and summing-up, as " insummary ", " in conclusion " appears at word together, and its summary feature is defined as 1, otherwise is 0.Discern candidate's title word 104 by machine learning: move on the text feature space of using algorithm of support vector machine to produce in the above, word candidate is divided into keyword and non-key speech, when using algorithm of support vector machine to train, appear at word in the title as the example in front, the example of other word reverse side, the data mode in the training set are (F (w 1), y 1) ..., (F (w n, y n)), F (w wherein j) refer to the proper vector of j word, y jBe the class label corresponding to word, its value is 1 or-1.1 represents key word, and-1 represents non-keyword.Train a support vector model then, utilize this model to carry out the extraction of key word, size according to the signals in machine learning, the key word that extracts is sorted, the possibility that the high word candidate that sorts becomes key word is high more, and the quantity of extracting keyword is by parameter M control.
Cluster is also optimized formation title 105: in this example, this part may further comprise the steps:
(I) key word that has identified is carried out cluster, form preliminary title, details are as follows for its step:
1) key word that identifies is carried out cluster operation, this method has been used one piece of article (" Headline Summarization at ISI " that Proceedings ofHLT/NAACL magazine was announced in 2003, Proceedings of HLT/NAACL workshop on Automatic Summarization/DUC2003,2003) method that is proposed is carried out the cluster of key word, thereby forms a preliminary title.
2) identify the maximum cluster window that exists in the text, the word in this window is used as title, if do not reach the length for heading requirement, we discern next maximum window, and adding word is wherein known and satisfied the length for heading requirement in title then.
(J) title to preliminary generation is optimized, and details are as follows for its step:
1) utilize some syntax rules to optimize title, strengthen readable, this method has been used one piece of article (" Headline Summarization at ISI " that the HLT/NAACL magazine was announced in 2003, Proceedings ofHLT/NAACL workshop on Automatic Summarization/DUC2003,2003) syntax rule that is proposed is carried out the Optimizing operation of title.
2) utilize the POS label of word to optimize title, strengthen readable, this method has been used one piece of article (" Statistical Techniques for Natural Language Parsing " that the AI magazine was announced in 1997, AIMagazine, 18 (4): 33-44,1997) method is calculated the POS label of word.

Claims (2)

1. method that adopts the artificial intelligence technology automatically generating title, its process is may further comprise the steps:
1) text background knowledge obtains, utilize a text to produce the query statement of text correspondence to the switch technology of inquiry, detect important sentence in the text, select sentence important in the text, remove insignificant word then, and remaining speech is returned virgin state, the result is exactly the inquiry that generates, full-text search engine Zettair serves as that input is retrieved at wikipedia with this inquiry, returns the article set of a wikipedia;
2) analyze the wikipedia article set of returning, therefrom extract valuable information, for each wikipedia article that returns, analyze its structure, extract and import link, derive link, four kinds of different structural informations of kind and infobox, and form corresponding set;
3) utilize the structural information of wikipedia and article subject matter to define new word feature, from three aspect definition word features, produce the feature of word by the background knowledge of using wikipedia, produce the feature of word according to the subject matter information of article, utilize the information of article itself to produce the feature of word, form a feature space jointly;
4) based on the word feature space that produces above, use the method for support vector machine to carry out machine learning, obtain a training pattern, and use this model from text, to extract candidate's title word;
5) use clustering algorithm that the word that extracts is linked together, utilize syntax rule to handle to connecting the title that produces, thereby reach the requirement of fluency;
Described step 1) is:
A) sentence in the text is made up a figure, the point among the figure represent sentence, and the contact between sentence is represented on the limit of tie point, and the weight on limit determines by the similarity degree of two sentences, utilizes this figure to detect important sentences in the text;
B) all represent a critical sentence by each key node that calculates, remove meaningless word in the sentence according to meaningless word list then;
C) word that step b) is handled returns original form, utilizes the inquiry of remaining group of words one-tenth corresponding to article then;
D) inquiry that produces is input to a full-text search engine Zettair, this engine moves on wikipedia, according to returning article in the wikipedia, and, obtains the set of the related article composition in the wikipedia according to the degree of correlation ordering with the degree of correlation of inquiry;
Described step 2) be:
E) for each article in the set, extract importing link wherein, produce an importing link set, import link the article of other position of wikipedia is linked to current article, utilize MediaWikiAPI to obtain all importing link set of certain article;
F) for each article in the set, extract derivation link wherein, and form one and derive the link set, derive link current article is pointed to other position of wikipedia, derivation is linked at that the form with hyperlink exists in the text of article, obtains the derivation link set of this article by extracting hyperlink all in the article;
G) each article is extracted its kind of information, and form a kind set;
H) each article that contains infobox is extracted parameter value among the infobox, form an infobox set of parameter values, throw away the parameter name information of infobox kind simultaneously;
Described step 3) is:
I) for each link in the importing link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article simultaneously at full-text search engine, calculate the importing chain feature of this word candidate;
J) for each link in the derivation link structure of wikipedia article, utilize the relatively similarity degree of it and word candidate of WordNet, consider the return score of article simultaneously at full-text search engine, draw the derivation chain feature of this word candidate;
K) for each element of the kind of wikipedia article set, utilize the familygram of wikipedia come the similarity degree between it and the word candidate, consider the score of this article simultaneously at full-text search engine, draw the species characteristic of this word candidate;
L) for each element in the infobox set of parameter values of wikipedia article, utilize WordNet to get similarity degree between itself and the word candidate, consider the score of this article simultaneously at full-text search engine, draw the infobox feature of this word candidate;
M) extraction comprises barment tag, character feature, and architectural feature is measured the subject matter similarity degree of two articles in interior article subject matter feature;
N) use an article set that comprises a lot of subject matters, a given article, from set, find out 300 nearest articles of subject matter similarity degree, extract their title, remove wherein insignificant word, to each such word, calculate the occurrence number of word, and calculate the subject matter similarity degree of word and article;
O) use the feature of some widely used words simultaneously, the frequency that word occurs in article, the position of word in article, whether word refers to special name or place name, and whether word length and word appear in the sentence of summary;
Described step 4) is:
P) keyword extraction is seen as a classification problem, moves on the text feature space of using algorithm of support vector machine to produce in the above, and word candidate is divided into keyword and non-key speech;
Q) when using algorithm of support vector machine to train, appear at word in the title as the example in front, the example of other word reverse side is trained a support vector model then, utilizes this model to carry out the extraction of keyword;
R) according to the size of the signals in machine learning, utilize a parameter M control to extract the quantity of keyword, the keyword that extracts is sorted, the possibility that the high word candidate that sorts becomes keyword is high more.
2. a kind of method that adopts the artificial intelligence technology automatically generating title according to claim 1 is characterized in that described step 5) is:
S) in text, mark the candidate's title word that identifies, comprise the word on its left side and the right,, merge and form a bulk if two fritters link to each other with a fritter;
T) when not having piece to merge, identify the piece of the maximum that exists in the text, the word in this piece is used as title, if do not reach the length for heading requirement, the next maximum piece of identification, adding word wherein is in title, up to satisfying the length for heading requirement then;
U) in order further to strengthen the readability of title, produce the title that some syntax rule optimizations produce, the POS label also is used for optimizing title, is exactly last title through the title of two suboptimization.
CN2009101570162A 2009-12-31 2009-12-31 Method for automatically generating title by adopting artificial intelligence technology Expired - Fee Related CN101751455B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009101570162A CN101751455B (en) 2009-12-31 2009-12-31 Method for automatically generating title by adopting artificial intelligence technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009101570162A CN101751455B (en) 2009-12-31 2009-12-31 Method for automatically generating title by adopting artificial intelligence technology

Publications (2)

Publication Number Publication Date
CN101751455A CN101751455A (en) 2010-06-23
CN101751455B true CN101751455B (en) 2011-09-21

Family

ID=42478443

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009101570162A Expired - Fee Related CN101751455B (en) 2009-12-31 2009-12-31 Method for automatically generating title by adopting artificial intelligence technology

Country Status (1)

Country Link
CN (1) CN101751455B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650943A (en) * 2016-10-28 2017-05-10 北京百度网讯科技有限公司 Auxiliary writing method and apparatus based on artificial intelligence

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5197774B2 (en) * 2011-01-18 2013-05-15 株式会社東芝 Learning device, determination device, learning method, determination method, learning program, and determination program
CN105468933B (en) * 2014-08-28 2018-06-15 深圳先进技术研究院 biological data analysis method and system
CN106503002A (en) * 2015-09-07 2017-03-15 张晓晔 A kind of method for substituting title display of commodity main information with some labels
CN106383817B (en) * 2016-09-29 2019-07-02 北京理工大学 Utilize the Article Titles generation method of distributed semantic information
CN107203509B (en) * 2017-04-20 2023-06-20 北京拓尔思信息技术股份有限公司 Title generation method and device
CN107832299B (en) * 2017-11-17 2021-11-23 北京百度网讯科技有限公司 Title rewriting processing method and device based on artificial intelligence and readable medium
CN108549813A (en) * 2018-03-02 2018-09-18 彭根 Method of discrimination, device and pocessor and storage media
CN110555196B (en) * 2018-05-30 2023-07-18 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for automatically generating article
US10831803B2 (en) * 2018-07-26 2020-11-10 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for true product word recognition
CN110516227A (en) * 2019-03-28 2019-11-29 苏州八叉树智能科技有限公司 Title text generation method, device, electronic equipment and computer-readable medium
CN112732946B (en) * 2019-10-12 2023-04-18 四川医枢科技有限责任公司 Modular data analysis and database establishment method for medical literature
CN111737985B (en) * 2020-07-27 2021-02-12 江苏联著实业股份有限公司 Method and device for extracting process system from article title hierarchical structure
CN113918685A (en) * 2021-12-13 2022-01-11 中电云数智科技有限公司 Keyword extraction method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650943A (en) * 2016-10-28 2017-05-10 北京百度网讯科技有限公司 Auxiliary writing method and apparatus based on artificial intelligence

Also Published As

Publication number Publication date
CN101751455A (en) 2010-06-23

Similar Documents

Publication Publication Date Title
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN102831184B (en) According to the method and system text description of social event being predicted to social affection
CN103473283B (en) Method for matching textual cases
CN103049435B (en) Text fine granularity sentiment analysis method and device
CN104199857B (en) A kind of tax document hierarchy classification method based on multi-tag classification
CN101719129A (en) Method for automatically extracting key words by adopting artificial intelligence technology
CN110298032A (en) Text classification corpus labeling training system
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN111581474B (en) Evaluation object extraction method of case-related microblog comments based on multi-head attention system
CN101127042A (en) Sensibility classification method based on language model
CN104484380A (en) Personalized search method and personalized search device
CN105512687A (en) Emotion classification model training and textual emotion polarity analysis method and system
CN101430695A (en) Automatic generation of ontologies using word affinities
CN104636465A (en) Webpage abstract generating methods and displaying methods and corresponding devices
CN105302793A (en) Method for automatically evaluating scientific and technical literature novelty by utilizing computer
CN104408093A (en) News event element extracting method and device
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
CN102637192A (en) Method for answering with natural language
CN104199965A (en) Semantic information retrieval method
CN101097570A (en) Advertisement classification method capable of automatic recognizing classified advertisement type
CN102737013A (en) Device and method for identifying statement emotion based on dependency relation
CN109960756A (en) Media event information inductive method
CN110851584B (en) Legal provision accurate recommendation system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110921

Termination date: 20131231