JP2010537286A5

JP2010537286A5 -

Info

Publication number: JP2010537286A5
Application number: JP2010521289A
Authority: JP
Filing date: 2008-08-25
Publication date: 2011-10-13
Anticipated expiration: 2028-08-25

Claims

Calculating a topic divergence value that is substantially proportional to a ratio of a first topic word distribution in the topic document corpus to a second topic word distribution in the document corpus;
Calculating a candidate topic word divergence value for the candidate topic word that is substantially proportional to a ratio of a first distribution of candidate topic words in the topic document corpus to a second distribution of the candidate topic words in the document corpus. When,
Determining whether the candidate topic word is a new topic word for the topic based on the candidate topic word divergence value and the topic divergence value;
With
The topic document corpus is a corpus of topic documents related to a topic;
The document corpus is implemented by a computer characterized in that it is a corpus of documents including the topic document and other documents ;
The candidate topic word is, candidate der Ru process for the topic word in a topic dictionary of the topic.

2. The method of claim 1, further comprising selecting an existing word in the topic dictionary for the topic as a topic word for which the topic divergence value is calculated .

The step of calculating the topic divergence value is:
Selecting a topic word for the topic;
Calculating a topic word divergence value for each of the topic words that is substantially proportional to a ratio of a first distribution of each topic word in the topic document corpus to a second distribution of each topic word in the document corpus; ,
Calculating the topic difference value based on a central tendency of the topic word difference value;
The method of claim 1, comprising:

2. The first distribution of candidate topic words in the topic document corpus is proportional to a ratio of a distribution of the candidate topic words in the topic document corpus to a value based on a logarithm of the distribution. The method described in 1.

The step of determining whether or not the candidate topic word is a new topic word includes the step of determining that the candidate topic word is a new topic word when the candidate topic word difference value is larger than the topic difference value. The method of claim 1, comprising:

2. The method of claim 1, further comprising storing the candidate topic word in the topic dictionary if the candidate topic word is determined to be a new topic word.

Identifying a document related to a topic in the document corpus;
Generating a document cluster related to the topic;
Identifying words in each of the document clusters;
Selecting candidate topic words from the identified words in each of the document clusters;
The method of claim 1, further comprising:

Calculating a first word frequency for existing words in a training corpus comprising a first subset of the document corpus and candidate words defined by a sequence of component words, each of which is an existing word in the dictionary When,
Calculating a second word frequency for the component word and the candidate word in a development corpus comprising a second subset of the document corpus;
Calculating a candidate word entropy measure based on the second word frequency of the candidate word and the first word frequency of the component word and the candidate word;
Calculating an existing word entropy measure based on the second word frequency of the component word and the first word frequency of the component word and the candidate word;
Determining that the candidate word is a candidate topic word if the candidate word entropy measure exceeds the existing word entropy measure;
The method of claim 1, further comprising:

Calculating a first word frequency for existing words and candidate words in a training corpus comprises training a language model for the probability of the existing words and candidate words in the training corpus;
The step of calculating the second word frequency relating to the component word and the candidate word in a development corpus includes the step of calculating a word count value relating to each of the component word and the candidate word in the development corpus. The method according to claim 8.

Based on the second word frequency of the candidate word, and the first word frequency of the component word and the candidate word, calculating a candidate word entropy measure,
Calculating a first logarithmic value based on the probabilities of the candidate word and the component word;
Calculating the candidate word entropy measure based on the word count value of the candidate word and the first logarithmic value;
With
Based on the second word frequency of the component word and the first word frequency of the component word and the candidate word, calculating an existing word entropy measure,
Calculating a second logarithmic value based on the probability of the candidate word and the component word;
Calculating the existing word entropy measure based on the word count of the component words and the second logarithmic value;
10. The method of claim 9, comprising:

The method of claim 1, wherein the candidate topic word comprises one or more Hanzi characters.

Selecting a topic dictionary comprising topic words related to a topic;
Calculating a topic word divergence value based on the topic word, the document corpus, and the topic document corpus;
Calculating candidate topic word divergence values for candidate topic words based on the document corpus and the topic document corpus;
Determining whether the candidate topic word is a new topic word for the topic based on the candidate topic word divergence value and the topic word divergence value;
With
The topic word relates to the certain topic;
The document corpus is a corpus of documents including topic documents and other documents;
The topic document corpus, Ri Oh in the corpus of the belt pick document be related to the topic,
Wherein the candidate topic word is performed by a computer, wherein the candidate der Rukoto for topic words in the topic dictionary.

13. The method of claim 12, further comprising storing the candidate topic word in the topic dictionary if the candidate topic word is determined to be a new topic word.

The step of calculating the topic word divergence value is:
Selecting an existing topic word in the topic dictionary;
Calculating an existing topic word divergence value for each of the topic words based on the document corpus and the topic document corpus;
Calculating the topic word divergence value based on a central tendency of the existing topic word divergence value;
13. The method of claim 12, comprising:

Based on the document corpus and the topic document corpus, calculating a candidate topic word divergence value for the candidate topic word comprises:
Calculating a first probability associated with the candidate topic word in the topic document corpus;
Calculating a second probability associated with the candidate topic word in the document corpus;
Calculating the candidate topic word divergence value based on a ratio of the first probability to a product of the second probability and a logarithmic value based on the first probability;
13. The method of claim 12, comprising:

13. The method of claim 12, wherein the candidate topic word comprises one or more Hanzi characters.

An apparatus comprising software stored in a non-transitory computer readable medium comprising:
The software comprises computer readable instructions,
The computer readable instructions can be executed by a computer processing device, and upon such execution,
Based on the topic word, document corpus, and topic document corpus, the topic word difference value is calculated,
Based on the document corpus and the topic document corpus, a candidate topic word divergence value for a candidate topic word is calculated,
On the basis of the candidate topic word divergence value and the topic word divergence value, the candidate topic word is determined whether the topic word for the topic, further the candidate topic word is determined to be topic word If, to store the candidate topic word in the topic dictionary,
The topic word is a word in a topic dictionary related to the topic;
The document corpus is a corpus of documents including the topic document and other documents;
The topic document corpus, Ri Oh topic corpus of documents related to the topic,
The candidate topic words, and wherein the candidate der Rukoto for topic words in the topic dictionary.

A data store,
A topic word processing module;
A dictionary updater module;
With
The data store stores a topic dictionary with topic words related to a topic;
The topic word processing module includes:
A topic word that is a word in a topic dictionary related to a topic, a document corpus that is a corpus of documents including topic documents and other documents, and a topic document corpus that is a corpus of the topic documents related to the topic Based on the topic word difference value,
Selecting candidate topic words as candidates for topic words in the topic dictionary ;
Based on the document corpus and the topic document corpus, the calculated candidate topic word divergence value about the candidate topic word, further on the basis of the candidate topic word divergence value and the topic word divergence value, the candidate topic word is the topic Is configured to determine if it is a topic word for
The dictionary updater module is configured to store the candidate topic word in the topic dictionary when it is determined that the candidate topic word is a topic word.

The topic word processing module includes:
Calculating a first probability associated with the candidate topic word in the topic document corpus;
Calculating a second probability associated with the candidate topic word in the document corpus, and further based on a ratio of the first probability to a product of the second probability and a logarithmic value based on the first probability 19. The system of claim 18, wherein the system is configured to calculate the candidate topic word divergence value.

Calculating a difference threshold for the topic document corpus;
Calculating a candidate word difference value for the candidate word;
Determining that the candidate word is a topic word for the topic if the candidate word divergence value exceeds the difference threshold;
With
The difference threshold is proportional to a ratio of a first topic word probability for a topic word in a topic document corpus to a second topic word probability for the topic word in the document corpus;
The topic document corpus is a corpus of topic documents related to a topic;
The topic word is a word in a topic dictionary related to the topic;
The document corpus is a corpus of documents including the topic document and other documents;
The candidate word divergence value is proportional to a ratio of a first candidate word probability for a candidate word associated with the topic document corpus to a second candidate word probability for the candidate word associated with the document corpus. how to.

Means for calculating the topic divergence value;
Means for calculating a candidate topic word divergence value for the candidate topic word;
Means for determining whether the candidate topic word is a new topic word for the topic based on the candidate topic word divergence value and the topic divergence value;
With
The topic divergence value is substantially proportional to the ratio of the first topic word distribution in the topic document corpus to the second topic word distribution in the document corpus;
The topic document corpus is a corpus of topic documents related to a topic;
The document corpus is a corpus of documents including the topic document and other documents;
The candidate topic word divergence value is substantially proportional to a ratio of a first distribution of candidate topic words in the topic document corpus to a second distribution of candidate topic words in the document corpus ;
The candidate topic word, the system characterized by candidate der Rukoto for topic words in the topic dictionary of the topic.

Means for selecting a topic dictionary comprising topic words related to a topic;
Means for calculating a topic word divergence value based on the topic word, the document corpus, and the topic document corpus;
Means for calculating a candidate topic word divergence value for a candidate topic word based on the document corpus and the topic document corpus;
Means for determining whether the candidate topic word is a new topic word for the topic based on the candidate topic word divergence value and the topic word divergence value;
With
The topic word is a word in the topic dictionary ,
The document corpus is a corpus of documents including topic documents and other documents;
The topic document corpus, Ri Oh in the corpus of the topic documents related to the topic,
The candidate topic word, the system characterized by candidate der Rukoto for topic words in the topic dictionary.

Means for calculating a topic word divergence value based on the topic word, the document corpus, and the topic document corpus;
Means for calculating a candidate topic word divergence value for a candidate topic word based on the document corpus and the topic document corpus;
Means for determining whether the candidate topic word is a topic word based on the candidate topic word divergence value and the topic word divergence value;
If the candidate topic word is determined to be a topic word, and means for storing the candidate topic word in the topic dictionary,
With
The topic word is a word in a topic dictionary related to a topic,
The document corpus is a corpus of documents including topic documents and other documents;
The topic document corpus, Ri Oh in the corpus of the topic documents related to the topic,
The candidate topic words, computing device, wherein the candidate der Rukoto for words in the topic dictionary.

Means for calculating a difference threshold for the topic document corpus;
Means for calculating a candidate word difference value for the candidate word;
Means for determining that the candidate word is a topic word for the topic if the candidate word divergence value exceeds the difference threshold;
With
The difference threshold is proportional to a ratio of a first topic word probability for a topic word to a second topic word probability for the topic word in a document corpus;
The topic document corpus is a corpus of topic documents related to a topic;
The topic word is a word in a topic dictionary related to the topic;
The document corpus is a corpus of documents including the topic document and other documents;
The candidate word divergence value is proportional to a ratio of a first candidate word probability for a candidate word associated with the topic document corpus to a second candidate word probability for the candidate word associated with the document corpus. System.

Calculating a first word frequency for existing words in the training corpus and candidate words defined by a sequence of component words, each of which is an existing word in the dictionary;
Calculating a second word frequency for the component word and the candidate word in a development corpus;
Calculating a candidate word entropy-related measure based on the second word frequency of the candidate word and the first word frequency of the component word and the candidate word;
Calculating an existing word entropy-related measure based on the second word frequency of the component word and the first word frequency of the component word and the candidate word;
Determining that the candidate word is a new word if the candidate word entropy related measure exceeds the existing word entropy related measure;
A computer-implemented method comprising:

26. The method of claim 25, wherein the training corpus and the development corpus comprise web documents.

26. The method of claim 25, further comprising adding the candidate word to an existing word dictionary if the candidate word is determined to be a new word.

Calculating a first word frequency comprises training a language model for the probabilities of the existing word and the candidate word in the training corpus;
26. The method of claim 25, wherein calculating a second word frequency comprises calculating a word count value for each of the component words and the candidate words in the development corpus.

The step of calculating candidate word entropy-related measures is:
Calculating a first logarithmic value based on the probabilities of the candidate word and the component word;
Calculating the candidate word entropy-related measure based on the word count value of the candidate word and the first logarithmic value;
And calculating the existing word entropy-related measure comprises:
Calculating a second logarithmic value based on the probability of the candidate word and the component word;
Calculating the existing word entropy-related measure based on the word count of the component words and the second logarithmic value;
26. The method of claim 25, comprising:

26. The method of claim 25, wherein each word comprises one or more Hanzi characters.

26. The method of claim 25, wherein each word comprises one or more ideographic characters.

26. The method of claim 25, further comprising the step of updating the dictionary with the candidate word if the candidate word is determined to be a new word.

Calculating a first word probability for a candidate word defined by an existing word in the first corpus and a sequence of component words, each of which is an existing word in the dictionary;
Calculating a second word probability for the component word and the candidate word in a second corpus;
Calculating a first entropy-related value based on the second candidate word probability and the first word probability of the candidate word and of the component word;
Calculating a second entropy-related value based on the second component word probability and the first word probability of the candidate word and the component word;
Determining that the candidate word is a new word if the first entropy related value exceeds the second entropy related value;
A computer-implemented method comprising:

The method of claim 33, wherein identifying a word corpus comprises identifying a web document.

The step of calculating a first word probability comprises training a language model on the first corpus with respect to the word probabilities of the existing word and the candidate word in the first corpus, and further comprising a second word 34. The method of claim 33, wherein calculating the probability comprises calculating a word count value for each of the component words and candidate words.

The step of calculating the first entropy-related value is:
Calculating a first logarithmic value based on the first word probabilities of the candidate word and the component word;
Calculating the first entropy-related value based on the word count value of the candidate word and the first logarithmic value;
With
The step of calculating the second entropy related value is:
Calculating a second logarithmic value based on the first word probabilities of the candidate word and the component word;
Calculating the second entropy-related value based on the word count of the component word and the second logarithmic value;
36. The method of claim 35, comprising:

34. The method of claim 33, wherein each word comprises one or more Hanzi characters.

Dividing a collection of web documents into a training corpus and a development corpus;
Training a language model on the training corpus with respect to a first word probability of words in the training corpus;
Counting the number of occurrences of the candidate word and the two or more corresponding words in the development corpus;
Calculating a first value based on the number of occurrences of the candidate word in the development corpus and the first word probability;
Calculating a second value based on the number of occurrences of the two or more corresponding words in the development corpus and the first word probability;
Comparing the first value to the second value;
Determining whether the candidate word is a new word based on the comparison;
With
The computer-implemented method, wherein words in the training corpus include candidate words defined by a sequence of two or more corresponding words in the training corpus that are existing words in a dictionary .

39. The method of claim 38, further comprising adding the candidate word to the dictionary if the candidate word is determined to be a new word.

40. The method of claim 38, wherein training a language model on the training corpus with respect to a first word probability of words in the training corpus comprises training an n-gram language model.

Based on the number of occurrences of the candidate word in the development corpus and the first word probability, calculating the first value,
Calculating a first logarithmic value based on the first word probability for the candidate word and the first word probability of the two or more corresponding words;
Multiplying the first logarithm value by the counted number of occurrences of the candidate word;
And calculating the second value based on the two or more corresponding words in the development corpus and the first word probability,
Calculating a second logarithmic value based on the first word probability of the candidate word and the first word probability of the two or more corresponding words;
Multiplying the second logarithm value by the counted number of occurrences of the two or more corresponding words;
41. The method of claim 40, comprising:

42. The method of claim 41, wherein each of the words comprises one or more Hanzi characters.

Comprising computer instructions stored in a computer readable medium, wherein when the computer instructions are executed by a computing device, the word corpus is accessed and the word corpus is divided into a training corpus and a development corpus;
A first word probability for words stored in the training corpus comprising candidate words comprising two or more corresponding words;
A second word probability for the word in the development corpus;
A word processing module configured to generate
Comprising computer instructions stored in a computer-readable medium, and when the computer instructions are executed by a computing device, processing the first word probability and the second word probability;
The first word probability for the candidate word and the two or more corresponding words, and a first value based on the second word probability for the candidate word; and
A second value based on the first word probability for the candidate word and the two or more corresponding words, and the second word probability for the two or more corresponding words;
A new word analyzer module configured to generate
Comprising
The system is further configured to compare the first value with the second value and determine whether the candidate word is a new word based on the comparison.

A dictionary updater module comprising computer instructions stored in a computer readable medium and further configured to update the dictionary with the identified new word when executed by a computing device. 44. The system of claim 43.

44. The system of claim 43, wherein the word processing module comprises an n-gram language model.

44. The system of claim 43, wherein the first value and the second value are entropy related values.

45. The system of claim 44, wherein the word corpus comprises a web document.

44. The system of claim 43, wherein the word processing module comprises a Hanzi character processing module.

49. The system of claim 48, wherein each word comprises one or more Hanzi characters.

An apparatus comprising software stored in a computer readable medium,
The software comprises computer readable instructions that are executable by a computer processing device;
When the computer readable instructions are executed, the computer processing device includes:
Calculating a first word frequency for a candidate word defined by an existing word in the training corpus and a sequence of component words, each of which is an existing word in the dictionary;
Calculating a second word frequency for the component word and the candidate word in a development corpus;
Based on the second word frequency of the candidate word and the first word frequency of the component word and the candidate word, a candidate word entropy-related measure is calculated,
Based on the second word frequency of the component word and the first word frequency of the component word and the candidate word, an existing word entropy related measure is calculated, and the candidate word entropy related measure is The apparatus, wherein if the existing word entropy-related measure is exceeded, the candidate word is determined to be a new word.

Means for calculating a first word probability for existing words in the first corpus and candidate words defined by component words, each of which is an existing word in the dictionary;
Means for calculating a second word probability for the component word and the candidate word in a second corpus;
Means for calculating a first entropy-related value based on the second word probability of the candidate word and the first word probability of the candidate word and the component word;
Means for calculating a second entropy-related value based on the second word probability of the component word and the first word probability of the candidate word and the component word;
Means for determining whether a candidate word is a new word based on a comparison between the first entropy related value and the second entropy related value;
A system comprising:

Accessing the word corpus and further dividing the word corpus into a training corpus and a development corpus;
A first word probability for words stored in the training corpus comprising candidate words comprising two or more corresponding words;
A second word probability for the word in the development corpus;
A word processing means configured to generate
Receiving the first word probability and the second word probability;
A first value based on the first word probability for the candidate word and the two or more corresponding words, and a first value based on the second word probability for the candidate word;
A second value based on the first word probability for the candidate word and the two or more corresponding words, and the second word probability for the two or more corresponding words;
A new word analyzer means configured to generate
With
The system is further configured to compare the first value and the second value and determine whether the candidate word is a new word based on the comparison.