WO2009123260A1 - 共起辞書作成システムおよびスコアリングシステム - Google Patents
共起辞書作成システムおよびスコアリングシステム Download PDFInfo
- Publication number
- WO2009123260A1 WO2009123260A1 PCT/JP2009/056804 JP2009056804W WO2009123260A1 WO 2009123260 A1 WO2009123260 A1 WO 2009123260A1 JP 2009056804 W JP2009056804 W JP 2009056804W WO 2009123260 A1 WO2009123260 A1 WO 2009123260A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- occurrence
- typicality
- score
- relationship
- collected
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
Definitions
- the present invention relates to a co-occurrence dictionary generation system, a scoring system, a co-occurrence dictionary generation method, a scoring method, and a program.
- This application claims priority based on Japanese Patent Application No. 2008-094980 filed in Japan on April 1, 2008 and Japanese Patent Application No. 2008-124254 filed on May 12, 2008 in Japan. And the contents thereof are incorporated herein.
- word co-occurrence information is often used.
- Patent Document 1 An example of a conventional co-occurrence dictionary creation system is described in Patent Document 1.
- the co-occurrence dictionary creation system of Patent Document 1 includes a document analysis unit that analyzes a given document set, a word extraction unit that extracts words that exist in the given document set, and stores them in a storage device.
- the word chain existing in the given document set is extracted and stored in a storage device, and the number of co-occurrence between each of the words and the word chain is detected and stored.
- the co-occurrence number detection unit to be stored in the device, and the co-occurrence degree are detected according to the co-occurrence number, and based on the detected co-occurrence degree, the conceptual information of the word is quantified and stored in the storage device
- word chain is a chain of n words (n is 2 or more) consecutive in the document.
- Patent Document 1 first, each sentence in a document set is subjected to morphological analysis. Next, all words or word chains (chains of two or more words) are extracted from the morphological analysis results and stored in the storage device. Next, the co-occurrence number detection unit extracts the independent words or word chains that co-occur for each of the extracted independent words (nouns, pronouns, verbs, adjectives, adverbs) or word chains, and determines the number of occurrences. Count. The co-occurrence number detection unit sends the count result to the conceptual information quantification unit. Here, the number of appearances is counted when a word or a word chain co-occurs in a predetermined document range.
- the “predetermined document range” is any one of a document, a paragraph, and a sentence.
- the conceptual information quantification unit calculates the degree of co-occurrence of each of the extracted words or word chains with each of the words or word chains.
- the co-occurrence degree is, for example, a value obtained by dividing the number of times of co-occurrence by the number of times of appearance of one word constituting the co-occurrence information.
- the first problem in the prior art is that it is difficult to generate a high-quality co-occurrence dictionary.
- the reason is that the co-occurrence dictionary creation system described in Patent Document 1 collects all co-occurrence within a certain range, such as documents, paragraphs, and sentences. This is because co-occurrence has no semantic relation. For example, consider a case where co-occurrence information is acquired from a sentence “curry is spicy, but Fukujinzuke is soppy”.
- Patent Document 1 “curry, hot”, “curry, pickled”, “Fukujin pickled, salty”, “curry, salty”, “Fukujin pickled, spicy”, etc. are acquired as co-occurrence.
- the second problem in the prior art is that a large amount of storage area is required for storing the co-occurrence information, and the storage capacity of the co-occurrence dictionary is increased.
- the reason for this is that in the co-occurrence dictionary creation system described in Patent Document 1, the number of word chains in a document set or the number of word chains n for an expression consisting of a plurality of words (referred to as compound expression) increases. This is because the number increases.
- JP 2006-215850 A Akiko Aizawa, “Similarity Scale Based on Co-occurrence”, Operations Research, Vol. 52, no. 11, pp. 706-718, 2007 T.A. Hofmann, “Probabilistic Latin Semantic Indexing”, Proc. Of SIGIR'99, pp. 50-57, 1999. M.M. A. Heart, Text Tiling: Segmenting Text into Multiparagraph Subtopic Passages, Computational Linguistics, Vol. 23, no. 1, pp. 33-64, 1997.
- the present invention has been invented in view of the above problems, and its object is to provide a co-occurrence dictionary generation system, a scoring system, a co-occurrence dictionary generation method, and a scoring that can create a co-occurrence dictionary in consideration of a semantic relationship. It is to provide a method and program.
- Another object of the present invention is to provide a co-occurrence dictionary generation system, a scoring system, and a co-occurrence dictionary generation method capable of creating a co-occurrence dictionary with a small storage area corresponding to the compound expression by extracting only meaningful compound expressions. It is to provide a scoring method and program.
- a co-occurrence dictionary generation system includes a language analysis unit that analyzes morphological analysis of text to identify clauses and analyzes the dependency between clauses, co-occurrence of nouns in the text, nouns
- Co-occurrence relationship collection unit that collects the dependency of precautions, predicate and predicate dependency as co-occurrence relationships, and co-occurrence that calculates the co-occurrence score of co-occurrence relationships based on the frequency of collected co-occurrence relationships
- a score calculation unit and a co-occurrence dictionary storage unit that stores a co-occurrence dictionary describing a correspondence between the calculated co-occurrence score and the co-occurrence relationship.
- the unit constituting the co-occurrence relationship is a phrase
- noun phrases and nouns, and noun phrases and nouns need not be distinguished from each other. For this reason, phrases may be omitted.
- the word “word” is specified, it represents only the word, not the phrase.
- the co-occurrence collection unit is a clause
- only meaningful compound expressions can be extracted.
- a co-occurrence dictionary with a small storage area corresponding to the composite expression can be created.
- FIG. 1 is a general block diagram of an information processing system that implements a system according to an embodiment of the present invention.
- FIG. 1 is a block diagram showing the configuration of the first exemplary embodiment of the present invention.
- the first embodiment of the present invention operates by a corpus input unit 1 that inputs text that is a collection source of co-occurrence relationships, a storage device 2 that stores text and generated co-occurrence dictionaries, and program control. It has a data processing device 3 and a co-occurrence dictionary display unit 4 that displays the contents of the generated co-occurrence dictionary.
- the corpus input unit 1 causes the corpus storage unit 20 to store text data that is a collection source of co-occurrence relationships.
- the corpus is composed of “text” representing a text body and “ID” representing an identifier of each data.
- the “ID” may be specified in advance, or may be automatically assigned, for example, by assigning an ID so as to be an integer serial number in the order of input.
- the storage device 2 includes a corpus storage unit 20 and a co-occurrence dictionary storage unit 21.
- the corpus storage unit 20 stores the text data input by the corpus input unit 1.
- the co-occurrence dictionary storage unit 21 stores the co-occurrence dictionary generated by the co-occurrence dictionary generation unit 30.
- the data processing device 3 includes a co-occurrence dictionary generation unit 30 and a co-occurrence dictionary output unit 31.
- the co-occurrence dictionary generation unit 30 includes a language analysis unit 300, a co-occurrence relation collection unit 301, and a co-occurrence score calculation unit 302.
- the language analysis unit 300 reads text data from the corpus storage unit 20 and performs morphological analysis, phrase identification, and dependency analysis between phrases on each text data.
- the language analysis unit 300 outputs the analysis result to the co-occurrence relationship collection unit 301.
- the co-occurrence relationship collecting unit 301 is based on the analysis result of the language analysis unit 300. Collect co-occurrence relationships. In addition, the co-occurrence relationship collection unit 301 acquires a noun, a predicate, and the frequency of each co-occurrence relationship. The co-occurrence relationship collection unit 301 outputs the collected co-occurrence relationship and the frequency of each acquired co-occurrence relationship to the co-occurrence score calculation unit 302.
- the co-occurrence relationship between nouns is collected when each noun co-occurs in a predetermined document range.
- the predetermined document range is one of a document, a paragraph, and a sentence.
- the co-occurrence score calculation unit 302 When the co-occurrence score calculation unit 302 receives each noun, predicate, co-occurrence relation and the frequency thereof, the co-occurrence score calculation unit 302 calculates a co-occurrence score of each co-occurrence relation. Then, the co-occurrence score calculation unit 302 stores each co-occurrence relation and the calculated co-occurrence score in the co-occurrence dictionary storage unit 21.
- the co-occurrence score is the degree to which two words are used at the same time, and is calculated so that the score increases as it is easily used at the same time. Any co-occurrence intensity calculation method can be used for the co-occurrence score.
- the frequency may be used as it is as a co-occurrence score.
- the logarithm of the frequency may be taken as the co-occurrence score so that the high-frequency co-occurrence relationship does not become too advantageous.
- the co-occurrence score may be a value obtained by dividing the frequency of the co-occurrence relationship by the frequency of one of the two words of the co-occurrence relationship or the sum of both frequencies. It should be noted that the semantic relation is higher for words that are more likely to be used at the same time, and lower for words that are less likely to be used at the same time.
- a dice coefficient, a self-mutual information amount, a jackard coefficient, etc. which are measures of co-occurrence strength in Non-Patent Document 1
- Non-Patent Document 2 a method of estimating the co-occurrence probability of any two words from a set of co-occurrence relationships may be used.
- n represents the number of types of words constituting the co-occurrence relationship.
- k in z_k represents a subscript.
- ⁇ represents an operator that sums all k.
- z_k is a cluster in which words having similar distributions of co-occurrence words are gathered. The number of k is specified by the user.
- P (z_k) is the appearance probability of each cluster.
- z_k) is a generation probability of w_i when the cluster z_k appears.
- z_k) is a generation probability of w_j when the cluster z_k appears.
- Non-Patent Document 2 P (w_i
- the co-occurrence dictionary output unit 31 reads out the co-occurrence relationship described in the co-occurrence dictionary from the co-occurrence dictionary storage unit 21 and the co-occurrence score, and outputs them to the co-occurrence dictionary display unit 4.
- the co-occurrence dictionary output unit 31 may sort and output the co-occurrence relationships in descending order or ascending order of the co-occurrence scores.
- the co-occurrence dictionary output unit 31 may specify at least one word and output only the co-occurrence relationship including the input word.
- the co-occurrence dictionary output unit 31 may output only co-occurrence relationships having a co-occurrence score that is greater than or equal to, less than or equal to, or greater than or equal to a certain value.
- the co-occurrence dictionary display unit 4 displays the co-occurrence relationship output by the co-occurrence dictionary output unit 31 together with the co-occurrence score.
- the co-occurrence dictionary generation unit 30 sets the co-occurrence collection unit as a phrase that is the minimum unit of the meaning of the sentence.
- the co-occurrence dictionary generation unit 30 limits the co-occurrence of nouns and predicates and the co-occurrence of predicates to the dependency relationship. Therefore, it is possible to reduce the collection amount of co-occurrence relationships that have no semantic relationship, and to create a high-quality, low-capacity co-occurrence dictionary.
- FIG. 3 is an example of data stored in the corpus storage unit 20.
- FIG. 3 includes three pieces of document data.
- the text data whose ID is 1 is “This amusement zone is narrow, dark and fun, and looks pretty interesting”.
- the language analysis unit 300 reads the text data from the corpus storage unit 20, and performs morphological analysis, phrase identification, and dependency analysis between phrases (step S2 in FIG. 2). This will be specifically described with reference to FIG. FIG. 4 shows the result of linguistic analysis of the text “ID seems to be a castle in the Edo period, but the structure remains the same, or there are a lot of steep staircases.” .
- step S101 the language analysis unit 300 performs morphological analysis.
- step S101 is also referred to as morphological analysis.
- step S102 the results of the morphological analysis are compiled into phrase units, and whether each phrase is a noun phrase or a prescriptive phrase is identified (step S102).
- step S102 is also referred to as phrase identification.
- whether each phrase is a noun phrase or a prescriptive phrase is determined by searching the morpheme from the back of the phrase and by the type of part of speech of the first independent word found. If a noun is found first, it becomes a noun phrase, and if a noun is found, it becomes a noun phrase.
- step S103 the dependency relationship of phrases is analyzed (step S103).
- step S103 is also referred to as dependency analysis.
- the dependency relationship is represented by an arrow. For example, “in the Edo period” relates to “like a castle”, and “staircase” relates to “many”.
- the co-occurrence relation collection unit 301 collects co-occurrence relations, nouns, and predicates from the analysis result of the language analysis unit 300 and calculates the frequency thereof (step S ⁇ b> 3 in FIG. 2).
- the co-occurrence relation collecting unit 301 records the collected co-occurrence relations, nouns, and predicates, and the calculated frequency.
- FIG. 5 is an example in which nouns, idioms, and co-occurrence relationships are collected from the results of FIG.
- the co-occurrence relation collecting unit 301 removes an attached word from the phrase when collecting the co-occurrence relation. For example, “no” in “in the Edo period” is a particle and thus becomes “Edo period”.
- the prescription is returned to the original form from the result of morphological analysis. For example, “Mysterious” becomes “Mysterious”.
- the co-occurrence of the nouns, the dependency relationship between the noun and the predicate, and the dependency relationship between the nouns are collected and counted.
- the frequency of nouns and predicates alone is also recorded.
- no direction is defined for the co-occurrence relationship.
- the co-occurrence relationship composed of the same word is made one type by determining the order relationship between the two words based on the value of the character code value.
- the co-occurrence score calculation unit 302 calculates a co-occurrence score representing the strength of the co-occurrence of each co-occurrence relationship based on the result collected by the co-occurrence relationship collection unit 301 (FIG. 2). S4). Then, the co-occurrence score calculation unit 302 stores the co-occurrence relationship and the co-occurrence score in the co-occurrence dictionary storage unit 21.
- FIG. 6 is an example of the output result of the co-occurrence relationship collection unit 301. The operation of the co-occurrence score calculation unit 302 will be described using the data in FIG. 6 as an example. In this example, a dice coefficient is adopted as a method for calculating the co-occurrence score. Specifically, in the data of FIG.
- the dice coefficient for “Edo period, castle” is 30 for “Edo period, castle”, 66 for “Edo period”, and 110 for “castle”. Therefore, 2 ⁇ 30 / (66 + 110) ⁇ 0.34 can be calculated.
- the co-occurrence score calculation unit 302 performs the same process for all co-occurrence relationships.
- the co-occurrence score calculation unit 302 associates the two words constituting the co-occurrence relationship with the calculated co-occurrence score, and stores them in the co-occurrence dictionary storage unit 21.
- the co-occurrence dictionary display unit 4 displays the data of the co-occurrence dictionary read from the co-occurrence dictionary storage unit 21 by the co-occurrence dictionary output unit 31 (step S5 in FIG. 2).
- FIG. 7 is a display example of data stored in the co-occurrence dictionary storage unit 21.
- FIG. 7 displays all co-occurrence relationships having “Edo period”. Referring to FIG. 7, it can be seen that the co-occurrence score of “Edo period, castle” is 0.34. Also, comparing the co-occurrence scores of “Edo period, castle” and “Edo period, structure” shows that the semantic relationship of “Edo period, castle” is stronger.
- the language analysis unit 300 analyzes morphological analysis, phrase identification, and dependency between phrases. Then, the co-occurrence relation collecting unit 301 collects each data of co-occurrence of noun phrases, dependency of noun phrases and prescriptive phrases, and dependency of prescriptive phrases. Then, the co-occurrence score calculation unit 302 calculates the co-occurrence score of the co-occurrence relation based on the collected frequency of the co-occurrence relation. As a result, the co-occurrence relationship related to the predicate is narrowed down to the dependency relationship. Therefore, a co-occurrence dictionary can be generated from a co-occurrence relationship having a high semantic relation.
- the co-occurrence collection unit is used as a clause, the co-occurrence relationship having a low semantic relationship is eliminated.
- a co-occurrence dictionary with a small storage area can be generated.
- a phrase is "a sentence divided into as small parts as possible within a range where the meaning can be understood.” If the collection unit is a clause, there is no compound expression that is not a meaningful unit. Therefore, the storage capacity of the co-occurrence dictionary can be reduced accordingly. Also, by collecting co-occurrence in semantic units, co-occurrence relationships that do not reflect the meaning of the sentence are not collected, and a high-quality co-occurrence dictionary can be generated while reducing the storage area.
- search engine can search for documents containing keywords at high speed” morphological analysis
- search / engine / has / high speed / to / keyword / includes / includes / documents / searches / to / It can be done.
- the part of speech is omitted.
- the phrase is “the search engine can / fast / include keywords / include documents / find / can / be”.
- word chain is a basic unit, complex expressions such as “has high speed” and “including documents” are collected.
- co-occurrence having a low semantic relationship such as “engine, document” and “engine, keyword” is collected.
- co-occurrence relationships that appropriately reflect the meaning of sentences such as “search engine, document”, “search engine, keyword”, etc., can be collected for each phrase.
- FIG. 8 is a block diagram showing the configuration of the second exemplary embodiment of the present invention.
- the second embodiment of the present invention is different from the first embodiment (FIG. 1) in that a data processing device 5 is provided instead of the data processing device 3.
- the data processing device 5 is different from the data processing device 3 in that a co-occurrence dictionary generation unit 50 is provided instead of the co-occurrence dictionary generation unit 30.
- the co-occurrence dictionary generation unit 50 further includes a topic division unit 500 in addition to the language analysis unit 300, the co-occurrence relationship collection unit 301, and the co-occurrence score calculation unit 302. It is different in point.
- the language analysis unit 300 reads text data from the corpus storage unit 20 and performs morphological analysis, phrase identification, and dependency analysis between phrases on each text data. Then, the language analysis unit 300 outputs the analysis result to the topic dividing unit 500.
- the topic dividing unit 500 detects a topic change point of each text data from the analysis result of the language analyzing unit 300. Then, the topic dividing unit 500 divides the original analysis result at each change point and outputs the result to the co-occurrence relation collecting unit 301. In the co-occurrence relationship between nouns of different topics, the semantic relationship is low. Therefore, the topic division unit 500 divides the topic into topics and outputs the result to the subsequent co-occurrence relationship collection unit 301. As a result, it is possible to collect co-occurrence relationships that are more semantically related.
- the topic dividing unit 500 can use any means capable of dividing based on the results of morphological analysis, phrase identification, and dependency analysis. For example, the topic dividing unit 500 may divide the nouns used in the preceding and following sentences unless n or more types overlap. This is based on the assumption that the same topic is used if the same topic continues. In the text above, the same noun is between “Yes, yesterday, the Nikkei average had fallen, but is it due to the influence of foreign investors?” not being used.
- the topic dividing unit 500 may divide by the appearance of expressions representing topic changes. Expressions that express changes in topics include “speaking changes”, “by the way”, “suddenly”, and the like. Further, the topic dividing unit 500 may divide when a conjunction does not exist at the beginning of the sentence. This is because the presence of a conjunction is considered to be connected to the preceding and following sentences, and if not, it is considered a separate topic.
- the topic dividing unit 500 can use the technique of Non-Patent Document 3. In Non-Patent Document 3, a word string is regarded as a pseudo paragraph, the overlap of words of two connected pseudo paragraphs is measured, and a place where the overlap is reduced is divided as a topic change point.
- the co-occurrence relationship collection unit 301 has the same function as the co-occurrence relationship collection unit 301 in the first embodiment except that the co-occurrence relationship is collected for each analysis result divided at the topic change point.
- steps S11 and S12 in FIG. 9 are the same as steps S1 and S2 in FIG.
- the topic division unit 500 receives the analysis result of the language analysis unit 300 and detects the change point of the topic of the text. Then, the topic dividing unit 500 divides the analysis result based on the detected change point (step S13 in FIG. 9) and outputs the result to the co-occurrence relation collecting unit 301. In this example, the topic dividing unit 500 divides the sentence if the nouns do not overlap in the preceding and following sentences. For example, for the text to be split, “1) Nikkei average has been checked because of recent interest in investment. 2) Yesterday, the Nikkei average crashed. 3) I'm getting hungry. 4) Let's go to a convenience store. " In addition, 1) to 4) are numbers of each sentence given for explanation, and are not actually written in the text.
- the topic division unit 500 counts the number of overlapping noun types of two sentences connected to each other, and divides the noun into sentences that do not overlap two or more kinds.
- the noun of each sentence can be extracted from the output of the language analysis unit 300.
- the topic segmentation unit 500 reads the input text as follows: “Nikkei average has been checked because of recent interest in investment. The Nikkei average had plunged yesterday, but it was influenced by foreign investors. Uka. ”,“ I'm hungry, ”and“ Let me go to a convenience store. ”
- steps S3 to S5 in FIG. 1 are the same as steps S3 to S5 in FIG. 1, and thus description thereof is omitted.
- This embodiment has the following effects in addition to the effects of the first embodiment. That is, by having the topic dividing unit 500, it is possible to collect co-occurrence of nouns limited within the same topic. Therefore, it is possible to generate a co-occurrence dictionary by focusing on co-occurrence relationships having higher semantic relations. Note that the co-occurrence of a noun and a predicate and the co-occurrence of a predicate are naturally limited to the dependency relationship between the noun and the predicate in the sentence and the dependency relationship between the predicates. For this reason, the co-occurrence relationship has a high semantic relationship regardless of the presence or absence of topic division.
- FIG. 10 is a block diagram showing the configuration of the third exemplary embodiment of the present invention.
- the third embodiment of the present invention is different from the first embodiment (FIG. 1) in that the storage device 9, the data processing device 3, and the co-occurrence dictionary display unit 4 are replaced with the storage device 9, the data The difference is that the processing device 7 and the text data display unit 8 are provided.
- the third embodiment is different from the first embodiment in that a text data input unit 6 is provided.
- the storage device 9 is different from the storage device 2 in that in addition to the corpus storage unit 20 and the co-occurrence dictionary storage unit 21, the storage device 9 further includes a text data storage unit 22 and a text data storage unit 23 with typicality score. .
- the data processing device 7 replaces the co-occurrence dictionary generation unit 30 and the co-occurrence dictionary output unit 31 with a co-occurrence dictionary generation unit 70, a typicality scoring unit 71, and a text data selection unit.
- the difference is that 72 is provided.
- the co-occurrence dictionary generation unit 70 generates a co-occurrence dictionary based on the text that is the collection source of the co-occurrence relationship stored in the corpus storage unit 20 by the corpus input unit 1 and stores the co-occurrence dictionary in the co-occurrence dictionary storage unit 21.
- the co-occurrence dictionary generation unit 70 has the same configuration as the co-occurrence dictionary generation unit 30 or the same configuration as the co-occurrence dictionary generation unit 50 in the second embodiment.
- the text data input unit 6 causes the text data storage unit 22 to store text data to be given a typicality by the co-occurrence dictionary.
- the text data includes “text” representing a text body, “ID” representing an identifier of each data, and “initial score” in which a score of typicality designated in advance is set.
- the “ID” may be specified in advance, or may be automatically assigned, for example, by assigning an ID so as to be an integer serial number in the order of input.
- the “text” may be a document or a relationship composed of a plurality of words extracted by some method.
- “Initial score” indicates that the higher the value, the higher the evaluation. Further, when the “initial score” is not required or not given, all values such as 0 and 1 are set to the same value.
- the text data input unit 6 automatically inputs outputs from other natural language processing systems such as kana-kanji conversion candidates, information search results, and information extraction results.
- the “initial score” is a score of each system. It is good.
- the “initial score” may be a kana-kanji conversion candidate score, reliability to the information extraction result provided by the information extraction device, search engine fitness, or reciprocal ranking.
- the typicality scoring unit 71 reads the text data stored in the text data storage unit 22 and the co-occurrence dictionary data stored in the co-occurrence dictionary storage unit 21. Then, the typicality scoring unit 71 extracts the co-occurrence relationship from each text data, and calculates the typicality score of each text data from the co-occurrence score of the co-occurrence relationship of each text data and the initial score. The typicality scoring unit 71 stores each text and its typicality score in the text data storage unit 23 with typicality score.
- the typicality score is calculated such that the higher the co-occurrence score and the initial score, the higher the typicality score.
- the typicality score may be the sum or product of each co-occurrence score and the initial score, or a combination of sum and product.
- the text data sorting unit 72 reads the text and its typicality score from the text data storage unit 23 with typicality score.
- the text data sorting unit 72 sorts the text data based on the magnitude relationship or value of the typicality score, and outputs the data to the text data display unit 8.
- the text data display unit 8 displays the text data selected by the text data selection unit 72 based on the typicality of the contents together with the typicality score.
- the co-occurrence dictionary storage unit 21 has a function of creating a co-occurrence dictionary and a function of assigning a typicality score to a typicality grant target text using the created co-occurrence dictionary.
- the operation of the function for creating the co-occurrence dictionary is the same as the operation for creating the co-occurrence dictionary in the first embodiment or the second embodiment. Therefore, the operation after the co-occurrence dictionary is created will be described below.
- the text data input unit 6 causes the text data storage unit 22 to store text data to which typicality is given in the co-occurrence dictionary (step S21 in FIG. 11).
- 12A and 12B are examples of data stored in the text data storage unit 22 by the text data input unit 6.
- FIG. 12A is a diagram illustrating an example of an extraction result of the information extraction device.
- FIG. 12B is a diagram illustrating an example of a kana-kanji conversion candidate.
- FIG. 12A shows an information extraction result obtained by extracting a relationship consisting of three words, what (object), what point (attribute), and how (evaluation) was from text data.
- FIG. 12B shows kana-kanji conversion candidates for “I went to” amusement park A.
- the typicality scoring unit 71 reads the text data from the text data storage unit 22.
- the typicality scoring unit 71 then extracts a co-occurrence relationship from each text data (step S22 in FIG. 11).
- the typicality scoring unit 71 performs the same processing as the language analysis unit 300 for each read text, and collects the co-occurrence relationships in the same manner as the co-occurrence relationship collection unit 301. That is, the typicality scoring unit 71 performs morphological analysis on text data to identify phrases and analyzes the dependency between phrases. Then, the typicality scoring unit 71 collects the co-occurrence of nouns in the text data, the dependency between the noun and the predicate, and the dependency between the noun and the predicate as the co-occurrence relationship for each phrase.
- the co-occurrence relationship may be limited instead of the combination of all the words.
- FIG. 12A “attribute” is an evaluation viewpoint of “object” and “evaluation” is an evaluation of “attribute”, but “evaluation” does not directly evaluate “object” itself. Absent. That is, in FIG. 12A, you may limit to two co-occurrence relations of "object, attribute” and "attribute, evaluation”.
- FIG. 12A describes a case where two of “object, attribute” and “attribute, evaluation” are extracted as a co-occurrence relationship.
- the typicality scoring unit 71 reads the co-occurrence dictionary from the co-occurrence dictionary storage unit 21. And the typicality scoring part 71 acquires the co-occurrence score of each co-occurrence relationship extracted by step S22 of FIG. 11 (step S23 of FIG. 11).
- FIG. 13 is a diagram illustrating an example of a co-occurrence dictionary stored in the co-occurrence dictionary storage unit 21.
- the data in the co-occurrence dictionary storage unit 21 is created in either the first or second embodiment of the present invention.
- the typicality scoring unit 71 determines the typicality of each text data acquired in step S22, the co-occurrence relationship of each text data extracted in step S22, and the initial value of each text data read in step S22. A typicality score is calculated based on the score and the co-occurrence score of each co-occurrence relationship acquired in step S23 (step S24 in FIG. 11). Then, the typicality scoring unit 71 stores each text and the typicality score of each text in the text data storage unit 23 with typicality score.
- the initial score of ID 1 in FIG.
- the typicality score is the sum of the initial score and the co-occurrence score of each co-occurrence relationship.
- FIG. 14A is a diagram illustrating an example of the typicality score of the information extraction result.
- FIG. 14B is a diagram illustrating an example of a typicality score of a kana-kanji conversion candidate.
- the typicality scoring unit 71 calculates a typicality score from the data of FIGS. 12A and 12B stored in the text data storage unit 22 and the data of FIG. 13 stored in the co-occurrence dictionary storage unit 21. .
- the typicality scoring unit 71 stores the typicality score and text data in the text data storage unit 23 with typicality score.
- the text data display unit 8 displays the text selected by the text data selection unit 72 (step S26 in FIG. 11).
- the typicality scoring unit 71 when the text to which the typicality is to be given is a sentence, the typicality scoring unit 71 performs morphological analysis on the text to identify the phrase and analyze the dependency between the phrases. Then, the typicality scoring unit 71 collects the co-occurrence of nouns in the text, the dependency between the noun and the predicate, and the dependency between the predicate and the predicate as the co-occurrence relationship for each phrase. Then, the typicality scoring unit 71 obtains the co-occurrence score corresponding to the collected co-occurrence relation from the co-occurrence dictionary and calculates the typical degree of the text content. Therefore, the semantic typical degree of the text content can be calculated with higher accuracy.
- the typicality scoring unit 71 co-occurs a meaningful combination among the word combinations. Collect as a relationship. Then, the typicality scoring unit 71 obtains the co-occurrence score corresponding to the collected co-occurrence relation from the co-occurrence dictionary and calculates the typical degree of the text content. Therefore, the semantic typical degree of the text content can be calculated with higher accuracy. In addition, it is not necessary to limit to the combination of words having a meaningful combination. Even in this case, a certain degree of accuracy can be obtained because the co-occurrence dictionary generated by focusing on the co-occurrence relation having high semantic relation is used.
- FIG. 15 is a block diagram showing a configuration of the fourth exemplary embodiment of the present invention.
- the fourth embodiment of the present invention includes a storage device 10 and a data processing device 11 instead of the storage device 9 and the data processing device 7 as compared with the third embodiment (FIG. 10). The point is different.
- the fourth embodiment is different from the third embodiment in that the corpus input unit 1 is not provided.
- the storage device 10 is different from the storage device 9 in that it does not include the corpus storage unit 20.
- the data processing device 11 is different from the data processing device 7 in that the co-occurrence dictionary generation unit 70 is not provided.
- a co-occurrence dictionary created by using the co-occurrence dictionary generation unit 30 of the first embodiment or the co-occurrence dictionary generation unit 50 of the second embodiment is stored in advance as a co-occurrence dictionary. It differs from the third embodiment in that it is stored in the unit 21.
- the co-occurrence dictionary is stored in the co-occurrence dictionary storage unit 21 in advance, there is no operation for creating the co-occurrence dictionary.
- Other operations that is, an operation in which the typicality scoring unit 71 gives typicality to text data using the co-occurrence dictionary stored in the co-occurrence dictionary storage unit 21,
- the operation of selecting the text to be displayed on the text data display unit 8 based on the degree score is the same as that of the third embodiment. Therefore, those descriptions are omitted.
- the same effect as in the third embodiment can be obtained, and at the same time, the semantic typical degree of the contents of the text data can be calculated at high speed.
- the reason is that the co-occurrence dictionary generation time can be eliminated by using the co-occurrence dictionary created in advance.
- the present invention can be realized by a computer and a program as well as by realizing the functions of the hardware.
- the program is provided by being recorded on a computer-readable recording medium such as a magnetic disk or a semiconductor memory.
- the program is read by the computer when the computer is started up.
- the read program controls the operation of the computer.
- the program causes the computer to function as each functional unit on the data processing device in each of the above-described embodiments, and causes the above-described processing steps to be executed.
- FIG. 16 is a general block diagram of an information processing system that implements a system according to each embodiment of the present invention.
- the information processing system illustrated in FIG. 16 includes a processor 3000, a program memory 3001, and a storage medium 3002.
- a magnetic storage medium such as a RAM or a hard disk can be used.
- the program memory 3001 stores a program for executing processing steps performed by the data processing apparatus according to any one of the first to fourth embodiments.
- the processor 3000 operates according to this program.
- the storage medium 3002 is used as a storage device in the first to fourth embodiments.
- the present invention creates a co-occurrence dictionary used for semantic analysis of natural language, such as dependency analysis, document proofreading, Kana-Kanji conversion, evaluation of semantic consistency of information extraction results, evaluation of semantic typicality of text, etc. It can be applied to other systems.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
本願は、2008年4月1日に、日本に出願された特願2008-094980号と、2008年5月12日に、日本に出願された特願2008-124254号とに基づき優先権を主張し、その内容をここに援用する。
なお、本発明では、共起関係を構成する単位を文節とするため、名詞文節と名詞、用言文節と用言とは、それぞれ区別する必要がない。そのため、文節を省略して表記することがある。ただし、「単語」と明示した場合のみ、文節ではなく単語のみを表す。
2、9、10・・・記憶装置、
3、5、7、11・・・データ処理装置、
8・・・テキストデータ表示部、
20・・・コーパス記憶部、
21・・・共起辞書記憶部、
22・・・テキストデータ記憶部、
23・・・典型度スコア付きテキストデータ記憶部、
30、70・・・共起辞書生成部、
71・・・典型度スコアリング部、
72・・・テキストデータ選別部、
300・・・言語解析部、
301・・・共起関係収集部、
302・・・共起スコア計算部、
500・・・トピック分割部、
3000・・・プロセッサ、
3001・・・プログラムメモリ、
3002・・・記憶媒体
本発明を実施するための、第1の実施の形態について図面を参照して詳細に説明する。
本発明の第1の実施の形態は、共起関係の収集元となるテキストを入力するコーパス入力部1と、テキストや生成された共起辞書を記憶する記憶装置2と、プログラム制御により動作するデータ処理装置3と、生成した共起辞書の内容を表示する共起辞書表示部4とを有する。
また、偏って共起する関係は、意味的関連が深いと考えられる。そのため、共起スコアを、共起関係の2語の片方の頻度、もしくは両方の頻度の合計で共起関係の頻度を割った値としてもよい。
なお、意味的関連は、意味的に同時に使われやすい語同士ほど高く、逆に同時に使われにくい語同士は低い。
図4は、図3のIDが2のテキストである「江戸時代の城らしいが、構造が昔のままなのか、妙に急な階段が多い。」というテキストを言語解析した結果を示している。
次に、形態素解析の結果を、文節単位にまとめ上げ、各文節が名詞文節か用言文節であるかを同定する(ステップS102)。このステップS102の処理を、文節同定とも称する。ここで、各文節が名詞文節であるか用言文節であるかは、文節の後ろから形態素を探索し、最初に発見した自立語の品詞の種類により決定する。最初に名詞が見つかれば名詞文節、用言が見つかれば用言文節となる。
最後に文節の係り受け関係を解析する(ステップS103)。このステップS103の処理を、係り受け解析とも称する。
ステップS103の処理結果を示す図(図4の一番下の図)では、係り受け関係を矢印で表している。例えば、「江戸時代の」は「城らしいが」に係り、「階段が」は「多い。」に係る。
図5は、図4の結果から、名詞、用言、共起関係を収集した例である。共起関係収集部301は、共起関係を収集する際に、文節から付属語を除く。例えば、「江戸時代の」の「の」は助詞であるため、「江戸時代」となる。
また、用言は形態素解析の結果から原形に戻す。例えば、「妙に」は「妙だ」となる。これらの処理の後、名詞同士の共起、名詞と用言の係り受け関係、用言同士の係り受け関係を収集し頻度を数える。
また、共起スコアの計算時に必要になる場合には、名詞、用言単独の頻度も記録する。ここで、本発明の実施の形態では、共起関係には方向を定めていない。つまり、2つの語の順序関係を文字コードの値の大小で決めるなどして、同じ語から構成される共起関係が1種類になるようにしている。
図6は、共起関係収集部301の出力結果の例である。図6のデータを例に共起スコア計算部302の動作を説明する。本例では、共起スコアの計算法として、ダイス係数を採用する。具体的には、図6のデータにおいて、「江戸時代,城」のダイス係数は、「江戸時代,城」の頻度が30、「江戸時代」の頻度が66、「城」の頻度が110であることから、2×30/(66+110)≒0.34と計算できる。共起スコア計算部302は、全ての共起関係について同様の処理を行う。共起スコア計算部302は、共起関係を構成する2つの語と、計算した共起スコアとを対応付けて、共起辞書記憶部21に記憶させる。
図7は、共起辞書記憶部21に記憶されたデータの表示例である。図7は、「江戸時代」を持つ全ての共起関係を表示している。図7を参照すると、「江戸時代,城」の共起スコアは0.34であることがわかる。また、「江戸時代,城」と、「江戸時代,構造」の共起スコアとを比べると、「江戸時代,城」の意味的関連のほうが強いことがわかる。
一方、文節は「検索エンジンは/高速に/キーワードを/含む/文書を/探す/ことが/できる」となる。単語連鎖を基本単位とすると「は高速」、「含む文書」など、意味をなさない複合表現が収集される。
また、単語単位では、「エンジン、文書」、「エンジン、キーワード」といった意味的関連が低い共起が収集される。一方、文節単位では「検索エンジン、文書」、「検索エンジン、キーワード」など、文意を適切に反映した共起関係が収集できる。
次に、本発明の第2の実施の形態について図面を参照して詳細に説明する。
本発明の第2の実施の形態は、第1の実施の形態(図1)と比較して、データ処理装置3の代わりにデータ処理装置5を備えている点で相違する。データ処理装置5は、データ処理装置3と比較して、共起辞書生成部30の代わりに共起辞書生成部50を備えている点で相違する。共起辞書生成部50は、共起辞書生成部30と比較して、言語解析部300、共起関係収集部301および共起スコア計算部302に加えてさらに、トピック分割部500を備えている点で相違する。
本例では、トピック分割部500は、前後の文で、名詞が2種類以上重ならなければ分割する。例えば、分割対象のテキストを、「1)最近投資に興味が出てきたため日経平均をチェックするようになった。2)昨日、日経平均が暴落していたが、海外投資家の影響であろうか。3)なんか腹が減ってきた。4)コンビニいってこよう。」として以下説明する。なお、1)から4)は説明のためにつけた各文の番号であり、実際にはテキストには書かれていない。
次に、本発明の第3の実施の形態について図面を参照して詳細に説明する。
本発明の第3の実施の形態は、第1の実施の形態(図1)と比較して、記憶装置2、データ処理装置3および共起辞書表示部4の代わりに、記憶装置9、データ処理装置7およびテキストデータ表示部8を備えている点において相違する。また、第3の実施の形態は、テキストデータ入力部6を備えている点で、第1の実施の形態と相違する。
「ID」は、あらかじめ指定しても良いし、入力順に整数の連番となるようIDを付与するなど、自動的に付けても良い。また、「テキスト」は文書であっても、なんらかの方法によって抽出された複数語からなる関係でもよい。
ここで、典型度スコアの計算は、各共起スコアと初期スコアとが高いほど高くなるように計算する。例えば、典型度スコアは、各共起スコアと初期スコアとの和、もしくは積、もしくは和と積の組み合わせとすることが考えられる。
図12A及び図12Bは、テキストデータ入力部6によりテキストデータ記憶部22に記憶されるデータの例である。図12Aは、情報抽出装置の抽出結果の一例を示す図である。図12Bは、かな漢字変換の候補の一例を示す図である。図12Aは、テキストデータから、何の(対象物)、どういった点が(属性)、どうであったか(評価)、の3語からなる関係を抽出した情報抽出結果を示している。図12Bは、「遊園地Aにいった」の「いった」のかな漢字変換の候補を示している。
図13は、共起辞書記憶部21に記憶された共起辞書の一例を示す図である。ここで、共起辞書記憶部21のデータは、本発明の第1又は第2の実施の形態の何れかで作成されたものである。
典型度スコアリング部71は、テキストデータ記憶部22が記憶している図12A及び図12Bのデータと、共起辞書記憶部21が記憶している図13のデータとから典型度スコアを計算する。典型度スコアリング部71は、典型度スコアと、テキストデータとを典型度スコア付きテキストデータ記憶部23に記憶させる。
なお、組み合わせに意味のある語同士の組み合わせに限定しなくてもよい。この場合であっても、意味的関連が高い共起関係に絞って生成された共起辞書を用いているため、ある程度の精度は得られる。
次に、本発明の第4の実施の形態について図面を参照して詳細に説明する。
本発明の第4の実施の形態は、第3の実施の形態(図10)と比較して、記憶装置9およびデータ処理装置7の代わりに、記憶装置10およびデータ処理装置11を備えている点において相違する。また、第4の実施の形態は、コーパス入力部1を備えていない点で、第3の実施の形態と相違する。
Claims (45)
- テキストを形態素解析して文節を同定して文節間の係り受けを解析する言語解析部と、
文節単位でテキスト内の名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係として収集する共起関係収集部と、
収集された共起関係の頻度に基づき共起関係の共起スコアを計算する共起スコア計算部と、
計算された共起スコアと共起関係との対応を記述した共起辞書を記憶する共起辞書記憶部と、
を有する共起辞書生成システム。 - 前記共起スコア計算部は、前記共起関係収集部により収集された共起関係とその頻度とに基づき、収集された共起関係を構成する任意の2語の共起確率を推定して共起スコアとする請求項1に記載の共起辞書生成システム。
- 前記共起スコア計算部は、前記共起関係収集部により収集された共起関係とその頻度とに基づき、共起関係を構成する2つの語の1つ、または両方の出現頻度を用いて共起関係の頻度を正規化したものを共起スコアとする請求項1に記載の共起辞書生成システム。
- 前記共起スコア計算部は、前記共起関係収集部により収集された共起関係の頻度を共起スコアとする請求項1に記載の共起辞書生成システム。
- 前記言語解析部の解析結果に基づいて前記テキストのトピックの変化点を検出して前記解析結果を分割するトピック分割部を備え、
前記共起関係収集部は、前記トピックの変化点で分割された解析結果ごとに共起関係を収集する請求項1に記載の共起辞書生成システム。 - 典型度付与対象テキスト内の共起関係を収集し、当該収集した共起関係に対応する共起スコアを前記共起辞書から取得して、前記典型度付与対象テキストの内容の典型的度合いを計算する典型度スコアリング部を備える請求項1に記載の共起辞書生成システム。
- 前記典型度スコアリング部は、前記典型度付与対象テキストを形態素解析して文節を同定して文節間の係り受けを解析し、文節単位で前記典型度付与対象テキスト内の名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係として収集する請求項6に記載の共起辞書生成システム。
- 前記典型度スコアリング部は、前記典型度付与対象テキストを構成する複数の語の組み合わせを共起関係として収集する請求項6に記載の共起辞書生成システム。
- 前記典型度スコアリング部は、前記典型度付与対象テキストを構成する複数の語の組み合わせのうち、語同士の組み合わせに意味のある組み合わせを共起関係として収集する請求項6に記載の共起辞書生成システム。
- 前記典型度スコアリング部は、前記典型度付与対象テキストに初期スコアが付与されている場合、前記共起辞書から取得した共起スコアと前記初期スコアとから前記典型的度合いを計算する請求項6に記載の共起辞書生成システム。
- テキストから収集した名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係とし、収集した共起関係の頻度に基づいて計算した値を共起スコアとし、共起関係と共起スコアとの対応を記述した共起辞書と、
典型度付与対象テキスト内の共起関係を収集し、当該収集した共起関係に対応する共起スコアを前記共起辞書から取得して、前記典型度付与対象テキストの内容の典型的度合いを計算する典型度スコアリング部と、
を備えるスコアリングシステム。 - 前記典型度スコアリング部は、前記典型度付与対象テキストを形態素解析して文節を同定して文節間の係り受けを解析し、文節単位で前記典型度付与対象テキスト内の名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係として収集する請求項11に記載のスコアリングシステム。
- 前記典型度スコアリング部は、前記典型度付与対象テキストを構成する複数の語の組み合わせを共起関係として収集する請求項11に記載のスコアリングシステム。
- 前記典型度スコアリング部は、前記典型度付与対象テキストを構成する複数の語の組み合わせのうち、語同士の組み合わせに意味のある組み合わせを共起関係として収集する請求項11に記載のスコアリングシステム。
- 前記典型度スコアリング部は、前記典型度付与対象テキストに初期スコアが付与されている場合、前記共起辞書から取得した共起スコアと前記初期スコアとから前記典型的度合いを計算する請求項11に記載のスコアリングシステム。
- テキストを形態素解析して文節を同定して文節間の係り受けを解析する言語解析ステップと、
文節単位でテキスト内の名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係として収集する共起関係収集ステップと、
収集された共起関係の頻度に基づき共起関係の共起スコアを計算する共起スコア計算ステップと、
計算された共起スコアと共起関係との対応を記述した共起辞書を共起辞書記憶部に記憶させる共起辞書記憶ステップと、
を含む共起辞書生成方法。 - 前記共起スコア計算ステップでは、前記共起関係収集ステップで収集された共起関係とその頻度とに基づき、収集された共起関係を構成する任意の2語の共起確率を推定して共起スコアとする請求項16に記載の共起辞書生成方法。
- 前記共起スコア計算ステップでは、前記共起関係収集ステップで収集された共起関係とその頻度とに基づき、共起関係を構成する2つの語の1つ、または両方の出現頻度を用いて共起関係の頻度を正規化したものを共起スコアとする請求項16に記載の共起辞書生成方法。
- 前記共起スコア計算ステップでは、前記共起関係収集ステップで収集された共起関係の頻度を共起スコアとする請求項16に記載の共起辞書生成方法。
- 前記言語解析ステップの解析結果に基づいて前記テキストのトピックの変化点を検出して前記解析結果を分割するトピック分割ステップを、さらに含み、
前記共起関係収集ステップでは、前記トピックの変化点で分割された解析結果ごとに共起関係を収集する請求項16に記載の共起辞書生成方法。 - 典型度付与対象テキスト内の共起関係を収集し、当該収集した共起関係に対応する共起スコアを前記共起辞書から取得して、前記典型度付与対象テキストの内容の典型的度合いを計算する典型度スコアリングステップを、さらに含む請求項16に記載の共起辞書生成方法。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを形態素解析して文節を同定して文節間の係り受けを解析し、文節単位で前記典型度付与対象テキスト内の名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係として収集する請求項21に記載の共起辞書生成方法。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを構成する複数の語の組み合わせを共起関係として収集する請求項21に記載の共起辞書生成方法。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを構成する複数の語の組み合わせのうち、語同士の組み合わせに意味のある組み合わせを共起関係として収集する請求項21に記載の共起辞書生成方法。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストに初期スコアが付与されている場合、前記共起辞書から取得した共起スコアと前記初期スコアとから前記典型的度合いを計算する請求項21に記載の共起辞書生成方法。
- テキストから収集した名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係とし、収集した共起関係の頻度に基づいて計算した値を共起スコアとし、共起関係と共起スコアとの対応を記述した共起辞書を有する情報処理装置が、典型度付与対象テキスト内の共起関係を収集し、当該収集した共起関係に対応する共起スコアを前記共起辞書から取得して、前記典型度付与対象テキストの内容の典型的度合いを計算する典型度スコアリングステップを含むスコアリング方法。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを形態素解析して文節を同定して文節間の係り受けを解析し、文節単位で前記典型度付与対象テキスト内の名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係として収集する請求項26に記載のスコアリング方法。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを構成する複数の語の組み合わせを共起関係として収集する請求項26に記載のスコアリング方法。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを構成する複数の語の組み合わせのうち、語同士の組み合わせに意味のある組み合わせを共起関係として収集する請求項26に記載のスコアリング方法。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストに初期スコアが付与されている場合、前記共起辞書から取得した共起スコアと前記初期スコアとから前記典型的度合いを計算する請求項26に記載のスコアリング方法。
- テキストを形態素解析して文節を同定して文節間の係り受けを解析する言語解析ステップと、
文節単位でテキスト内の名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係として収集する共起関係収集ステップと、
収集された共起関係の頻度に基づき共起関係の共起スコアを計算する共起スコア計算ステップと、
計算された共起スコアと共起関係との対応を記述した共起辞書を共起辞書記憶部に記憶させる共起辞書記憶ステップと、
を情報処理装置に実行させるためのプログラム。 - 前記共起スコア計算ステップでは、前記共起関係収集ステップで収集された共起関係とその頻度とに基づき、収集された共起関係を構成する任意の2語の共起確率を推定して共起スコアとする請求項31に記載のプログラム。
- 前記共起スコア計算ステップでは、前記共起関係収集ステップで収集された共起関係とその頻度とに基づき、共起関係を構成する2つの語の1つ、または両方の出現頻度を用いて共起関係の頻度を正規化したものを共起スコアとする請求項31に記載のプログラム。
- 前記共起スコア計算ステップでは、前記共起関係収集ステップで収集された共起関係の頻度を共起スコアとする請求項31に記載のプログラム。
- 前記言語解析ステップの解析結果に基づいて前記テキストのトピックの変化点を検出して前記解析結果を分割するトピック分割ステップを前記情報処理装置にさらに実行させ、
前記共起関係収集ステップでは、前記トピックの変化点で分割された解析結果ごとに共起関係を収集する請求項31に記載のプログラム。 - 典型度付与対象テキスト内の共起関係を収集し、当該収集した共起関係に対応する共起スコアを前記共起辞書から取得して、前記典型度付与対象テキストの内容の典型的度合いを計算する典型度スコアリングステップを前記情報処理装置にさらに実行させる請求項31に記載のプログラム。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを形態素解析して文節を同定して文節間の係り受けを解析し、文節単位で前記典型度付与対象テキスト内の名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係として収集する請求項36に記載のプログラム。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを構成する複数の語の組み合わせを共起関係として収集する請求項36に記載のプログラム。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを構成する複数の語の組み合わせのうち、語同士の組み合わせに意味のある組み合わせを共起関係として収集する請求項36に記載のプログラム。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストに初期スコアが付与されている場合、前記共起辞書から取得した共起スコアと前記初期スコアとから前記典型的度合いを計算する請求項36に記載のプログラム。
- テキストから収集した名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係とし、収集した共起関係の頻度に基づいて計算した値を共起スコアとし、共起関係と共起スコアとの対応を記述した共起辞書を有する情報処理装置に、
典型度付与対象テキスト内の共起関係を収集し、当該収集した共起関係に対応する共起スコアを前記共起辞書から取得して、前記典型度付与対象テキストの内容の典型的度合いを計算する典型度スコアリングステップを実行させるためのプログラム。 - 前記典型度スコアリングステップでは、前記典型度付与対象テキストを形態素解析して文節を同定して文節間の係り受けを解析し、文節単位で前記典型度付与対象テキスト内の名詞の共起、名詞と用言の係り受け、用言と用言の係り受けを共起関係として収集する請求項41に記載のプログラム。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを構成する複数の語の組み合わせを共起関係として収集する請求項41に記載のプログラム。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストを構成する複数の語の組み合わせのうち、語同士の組み合わせに意味のある組み合わせを共起関係として収集する請求項41に記載のプログラム。
- 前記典型度スコアリングステップでは、前記典型度付与対象テキストに初期スコアが付与されている場合、前記共起辞書から取得した共起スコアと前記初期スコアとから前記典型的度合いを計算する請求項41に記載のプログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/922,320 US8443008B2 (en) | 2008-04-01 | 2009-04-01 | Cooccurrence dictionary creating system, scoring system, cooccurrence dictionary creating method, scoring method, and program thereof |
JP2010505973A JP5321583B2 (ja) | 2008-04-01 | 2009-04-01 | 共起辞書生成システム、スコアリングシステム、共起辞書生成方法、スコアリング方法及びプログラム |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008094980 | 2008-04-01 | ||
JP2008-094980 | 2008-04-01 | ||
JP2008-124254 | 2008-05-12 | ||
JP2008124254 | 2008-05-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009123260A1 true WO2009123260A1 (ja) | 2009-10-08 |
Family
ID=41135627
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/056804 WO2009123260A1 (ja) | 2008-04-01 | 2009-04-01 | 共起辞書作成システムおよびスコアリングシステム |
Country Status (3)
Country | Link |
---|---|
US (1) | US8443008B2 (ja) |
JP (1) | JP5321583B2 (ja) |
WO (1) | WO2009123260A1 (ja) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013171382A (ja) * | 2012-02-20 | 2013-09-02 | Nec Corp | 共起辞書作成装置 |
US9547645B2 (en) | 2014-01-22 | 2017-01-17 | Fujitsu Limited | Machine translation apparatus, translation method, and translation system |
WO2018029791A1 (ja) * | 2016-08-09 | 2018-02-15 | 楽天株式会社 | キーワード抽出システム、キーワード抽出方法およびプログラム |
JP2018049478A (ja) * | 2016-09-21 | 2018-03-29 | 日本電信電話株式会社 | テキスト分析方法、テキスト分析装置、及びプログラム |
JP7032582B1 (ja) | 2021-01-29 | 2022-03-08 | Kpmgコンサルティング株式会社 | 情報解析プログラム、情報解析方法及び情報解析装置 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101818717B1 (ko) * | 2011-09-27 | 2018-01-15 | 네이버 주식회사 | 컨셉 키워드 확장 데이터 셋을 이용한 검색방법, 장치 및 컴퓨터로 판독 가능한 기록매체 |
JP5870790B2 (ja) * | 2012-03-19 | 2016-03-01 | 富士通株式会社 | 文章校正装置、及び文章校正方法 |
CN104685493A (zh) * | 2012-09-27 | 2015-06-03 | 日本电气株式会社 | 用于监视文本信息的字典创建装置、用于监视文本信息的字典创建方法和用于监视文本信息的字典创建程序 |
JP6237168B2 (ja) * | 2013-12-02 | 2017-11-29 | 富士ゼロックス株式会社 | 情報処理装置及び情報処理プログラム |
US9684694B2 (en) * | 2014-09-23 | 2017-06-20 | International Business Machines Corporation | Identifying and scoring data values |
US11531811B2 (en) * | 2020-07-23 | 2022-12-20 | Hitachi, Ltd. | Method and system for extracting keywords from text |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08329090A (ja) * | 1995-05-30 | 1996-12-13 | Oki Electric Ind Co Ltd | 共起辞書装置、共起辞書データ作成方法及び文解析システム |
JP2003132059A (ja) * | 2001-10-19 | 2003-05-09 | Seiko Epson Corp | 言語文を用いた検索装置、検索システム、検索方法、プログラム、および記録媒体 |
JP2006072483A (ja) * | 2004-08-31 | 2006-03-16 | Toshiba Corp | プログラム及び文書処理装置並びに文書処理方法 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1049549A (ja) * | 1996-05-29 | 1998-02-20 | Matsushita Electric Ind Co Ltd | 文書検索装置 |
JP2006215850A (ja) | 2005-02-04 | 2006-08-17 | Nippon Telegr & Teleph Corp <Ntt> | 概念情報データベース作成装置、概念情報データベース作成方法、プログラムおよび記録媒体 |
-
2009
- 2009-04-01 WO PCT/JP2009/056804 patent/WO2009123260A1/ja active Application Filing
- 2009-04-01 JP JP2010505973A patent/JP5321583B2/ja active Active
- 2009-04-01 US US12/922,320 patent/US8443008B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08329090A (ja) * | 1995-05-30 | 1996-12-13 | Oki Electric Ind Co Ltd | 共起辞書装置、共起辞書データ作成方法及び文解析システム |
JP2003132059A (ja) * | 2001-10-19 | 2003-05-09 | Seiko Epson Corp | 言語文を用いた検索装置、検索システム、検索方法、プログラム、および記録媒体 |
JP2006072483A (ja) * | 2004-08-31 | 2006-03-16 | Toshiba Corp | プログラム及び文書処理装置並びに文書処理方法 |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013171382A (ja) * | 2012-02-20 | 2013-09-02 | Nec Corp | 共起辞書作成装置 |
US9547645B2 (en) | 2014-01-22 | 2017-01-17 | Fujitsu Limited | Machine translation apparatus, translation method, and translation system |
WO2018029791A1 (ja) * | 2016-08-09 | 2018-02-15 | 楽天株式会社 | キーワード抽出システム、キーワード抽出方法およびプログラム |
JPWO2018029791A1 (ja) * | 2016-08-09 | 2018-08-09 | 楽天株式会社 | キーワード抽出システム、キーワード抽出方法およびプログラム |
JP2018049478A (ja) * | 2016-09-21 | 2018-03-29 | 日本電信電話株式会社 | テキスト分析方法、テキスト分析装置、及びプログラム |
JP7032582B1 (ja) | 2021-01-29 | 2022-03-08 | Kpmgコンサルティング株式会社 | 情報解析プログラム、情報解析方法及び情報解析装置 |
JP2022117019A (ja) * | 2021-01-29 | 2022-08-10 | Kpmgコンサルティング株式会社 | 情報解析プログラム、情報解析方法及び情報解析装置 |
Also Published As
Publication number | Publication date |
---|---|
JP5321583B2 (ja) | 2013-10-23 |
JPWO2009123260A1 (ja) | 2011-07-28 |
US20110055228A1 (en) | 2011-03-03 |
US8443008B2 (en) | 2013-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5321583B2 (ja) | 共起辞書生成システム、スコアリングシステム、共起辞書生成方法、スコアリング方法及びプログラム | |
Hanselowski et al. | Ukp-athene: Multi-sentence textual entailment for claim verification | |
Singh et al. | Text stemming: Approaches, applications, and challenges | |
US8346795B2 (en) | System and method for guiding entity-based searching | |
Castellví et al. | Automatic term detection | |
JP4571404B2 (ja) | データ処理方法、データ処理システムおよびプログラム | |
US20070073745A1 (en) | Similarity metric for semantic profiling | |
US20050080613A1 (en) | System and method for processing text utilizing a suite of disambiguation techniques | |
US20070073678A1 (en) | Semantic document profiling | |
CN114254653A (zh) | 一种科技项目文本语义抽取与表示分析方法 | |
JPH03172966A (ja) | 類似文書検索装置 | |
JP2011227688A (ja) | テキストコーパスにおける2つのエンティティ間の関係抽出方法及び装置 | |
JP4534666B2 (ja) | テキスト文検索装置及びテキスト文検索プログラム | |
US9164981B2 (en) | Information processing apparatus, information processing method, and program | |
JP2014106665A (ja) | 文書検索装置、文書検索方法 | |
Krishna et al. | A hybrid method for query based automatic summarization system | |
Sanyal et al. | Natural language processing technique for generation of SQL queries dynamically | |
Quan et al. | Combine sentiment lexicon and dependency parsing for sentiment classification | |
JP5447368B2 (ja) | 新規事例生成装置、新規事例生成方法及び新規事例生成用プログラム | |
Srinivas et al. | Heuristics and parse ranking | |
Ullah et al. | Pattern and semantic analysis to improve unsupervised techniques for opinion target identification | |
Tadesse et al. | Event extraction from unstructured amharic text | |
Patel et al. | Influence of Gujarati STEmmeR in supervised learning of web page categorization | |
Ung et al. | Combination of features for vietnamese news multi-document summarization | |
JP2019003270A (ja) | 学習装置、映像検索装置、方法、及びプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09727204 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12922320 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010505973 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09727204 Country of ref document: EP Kind code of ref document: A1 |